Commit Graph

72 Commits

Author SHA1 Message Date
Raphael Mitsch
df6e4ab055 Reformat EL test. 2022-12-12 14:05:41 +01:00
Raphael Mitsch
cb640abe81 Fix EL test. 2022-12-12 14:04:34 +01:00
Raphael Mitsch
6a30df3039 Fix newly introduced bug in EntityLinker.predict(). 2022-11-29 23:03:03 +01:00
Raphael Mitsch
60eda0d7a5 Update setup.py. Remove temporary comments. 2022-11-29 15:15:47 +01:00
Raphael Mitsch
3e668503de Finish Candidate refactoring. 2022-11-29 15:03:54 +01:00
Raphael Mitsch
4f7b535ebb Merge branch 'master' into feature/candidate-generation-by-docs
# Conflicts:
#	spacy/tests/pipeline/test_entity_linker.py
2022-10-28 10:38:18 +02:00
Paul O'Leary McCann
d61e742960
Handle Docs with no entities in EntityLinker (#11640)
* Handle docs with no entities

If a whole batch contains no entities it won't make it to the model, but
it's possible for individual Docs to have no entities. Before this
commit, those Docs would cause an error when attempting to concatenate
arrays because the dimensions didn't match.

It turns out the process of preparing the Ragged at the end of the span
maker forward was a little different from list2ragged, which just uses
the flatten function directly. Letting list2ragged do the conversion
avoids the dimension issue.

This did not come up before because in NEL demo projects it's typical
for data with no entities to be discarded before it reaches the NEL
component.

This includes a simple direct test that shows the issue and checks it's
resolved. It doesn't check if there are any downstream changes, so a
more complete test could be added. A full run was tested by adding an
example with no entities to the Emerson sample project.

* Add a blank instance to default training data in tests

Rather than adding a specific test, since not failing on instances with
no entities is basic functionality, it makes sense to add it to the
default set.

* Fix without modifying architecture

If the architecture is modified this would have to be a new version, but
this change isn't big enough to merit that.
2022-10-28 10:25:34 +02:00
Raphael Mitsch
ace5655fe1 Fix test. 2022-10-27 13:28:17 +02:00
Raphael Mitsch
ba91d0d1d9 Add test for candidate stream processing. Simplify processing of candidate streams. 2022-10-27 13:10:23 +02:00
Raphael Mitsch
b32f48c878 Change typing from Generator to Iterable. 2022-10-20 09:43:47 +02:00
Raphael Mitsch
d8183121f6 Reformat with black. 2022-10-18 17:33:37 +02:00
Raphael Mitsch
7c28424f47 Convert batched into doc-wise batched candidate generation. 2022-10-18 15:31:15 +02:00
Raphael Mitsch
1f23c615d7
Refactor KB for easier customization (#11268)
* Add implementation of batching + backwards compatibility fixes. Tests indicate issue with batch disambiguation for custom singular entity lookups.

* Fix tests. Add distinction w.r.t. batch size.

* Remove redundant and add new comments.

* Adjust comments. Fix variable naming in EL prediction.

* Fix mypy errors.

* Remove KB entity type config option. Change return types of candidate retrieval functions to Iterable from Iterator. Fix various other issues.

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/kb_base.pyx

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/kb_base.pyx

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Add error messages to NotImplementedErrors. Remove redundant comment.

* Fix imports.

* Remove redundant comments.

* Rename KnowledgeBase to InMemoryLookupKB and BaseKnowledgeBase to KnowledgeBase.

* Fix tests.

* Update spacy/errors.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Move KB into subdirectory.

* Adjust imports after KB move to dedicated subdirectory.

* Fix config imports.

* Move Candidate + retrieval functions to separate module. Fix other, small issues.

* Fix docstrings and error message w.r.t. class names. Fix typing for candidate retrieval functions.

* Update spacy/kb/kb_in_memory.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix typing.

* Change typing of mentions to be Span instead of Union[Span, str].

* Update docs.

* Update EntityLinker and _architecture docs.

* Update website/docs/api/entitylinker.md

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Adjust message for E1046.

* Re-add section for Candidate in kb.md, add reference to dedicated page.

* Update docs and docstrings.

* Re-add section + reference for KnowledgeBase.get_alias_candidates() in docs.

* Update spacy/kb/candidate.pyx

* Update spacy/kb/kb_in_memory.pyx

* Update spacy/pipeline/legacy/entity_linker.py

* Remove canididate.md. Remove mistakenly added config snippet in entity_linker.py.

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-09-08 10:38:07 +02:00
Adriane Boyd
f55bb7470d
Clean up warnings in the test suite (#11331) 2022-08-22 12:04:30 +02:00
Raphael Mitsch
e9eb59699f
NEL confidence threshold (#11016)
* Add base for NEL abstention threshold mechanism.

* Add abstention threshold to entity linker. Add test.

* Fix entity linking tests.

* Changed abstention default threshold from 0 to None.

* Fix default values for abstention thresholds.

* Fix mypy errors.

* Replace assertion with raise of proper error code.

* Simplify threshold check. Remove thresholding from EntityLinker_v1.

* Rename test.

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/pipeline/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Make E1043 configurable.

* Update docs.

* Rephrase description in docs. Adjusting error code message.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-07-04 17:05:21 +02:00
github-actions[bot]
6172af8158
Auto-format code with black (#10857)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-05-27 10:54:54 +02:00
Paul O'Leary McCann
6be09bbd07
Fix Entity Linker with tokenization mismatches (fix #9575) (#10457)
* Add failing test

* Partial fix for issue

This kind of works. The issue with token length mismatches is gone. The
problem is that when you get empty lists of encodings to compare, it
fails because the sizes are not the same, even though they're both zero:
(0, 3) vs (0,). Not sure why that happens...

* Short circuit on empties

* Remove spurious check

The check here isn't needed now the the short circuit is fixed.

* Update spacy/tests/pipeline/test_entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Use "eg", not "example"

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-05-23 20:42:26 +02:00
github-actions[bot]
1bbf232074
Auto-format code with black (#10479)
* Auto-format code with black

* Update spacy/lang/hsb/lex_attrs.py

Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-03-11 12:20:23 +01:00
Paul O'Leary McCann
61ba5450ff
Fix get_matching_ents (#10451)
* Fix get_matching_ents

Not sure what happened here - the code prior to this commit simply does
not work. It's already covered by entity linker tests, which were
succeeding in the NEL PR, but couldn't possibly succeed on master.

* Fix test

Test was indented inside another test and so doesn't seem to have been
running properly.
2022-03-07 16:56:57 +01:00
Paul O'Leary McCann
91acc3ea75
Fix entity linker batching (#9669)
* Partial fix of entity linker batching

* Add import

* Better name

* Add `use_gold_ents` option, docs

* Change to v2, create stub v1, update docs etc.

* Fix error type

Honestly no idea what the right type to use here is.
ConfigValidationError seems wrong. Maybe a NotImplementedError?

* Make mypy happy

* Add hacky fix for init issue

* Add legacy pipeline entity linker

* Fix references to class name

* Add __init__.py for legacy

* Attempted fix for loss issue

* Remove placeholder V1

* formatting

* slightly more interesting train data

* Handle batches with no usable examples

This adds a test for batches that have docs but not entities, and a
check in the component that detects such cases and skips the update step
as thought the batch were empty.

* Remove todo about data verification

Check for empty data was moved further up so this should be OK now - the
case in question shouldn't be possible.

* Fix gradient calculation

The model doesn't know which entities are not in the kb, so it generates
embeddings for the context of all of them.

However, the loss does know which entities aren't in the kb, and it
ignores them, as there's no sensible gradient.

This has the issue that the gradient will not be calculated for some of
the input embeddings, which causes a dimension mismatch in backprop.
That should have caused a clear error, but with numpyops it was causing
nans to happen, which is another problem that should be addressed
separately.

This commit changes the loss to give a zero gradient for entities not in
the kb.

* add failing test for v1 EL legacy architecture

* Add nasty but simple working check for legacy arch

* Clarify why init hack works the way it does

* Clarify use_gold_ents use case

* Fix use gold ents related handling

* Add tests for no gold ents and fix other tests

* Use aligned ents function (not working)

This doesn't actually work because the "aligned" ents are gold-only. But
if I have a different function that returns the intersection, *then*
this will work as desired.

* Use proper matching ent check

This changes the process when gold ents are not used so that the
intersection of ents in the pred and gold is used.

* Move get_matching_ents to Example

* Use model attribute to check for legacy arch

* Rename flag

* bump spacy-legacy to lower 3.0.9

Co-authored-by: svlandeg <svlandeg@github.com>
2022-03-04 09:17:36 +01:00
Lj Miranda
7d50804644
Migrate regression tests into the main test suite (#9655)
* Migrate regressions 1-1000

* Move serialize test to correct file

* Remove tests that won't work in v3

* Migrate regressions 1000-1500

Removed regression test 1250 because v3 doesn't support the old LEX
scheme anymore.

* Add missing imports in serializer tests

* Migrate tests 1500-2000

* Migrate regressions from 2000-2500

* Migrate regressions from 2501-3000

* Migrate regressions from 3000-3501

* Migrate regressions from 3501-4000

* Migrate regressions from 4001-4500

* Migrate regressions from 4501-5000

* Migrate regressions from 5001-5501

* Migrate regressions from 5501 to 7000

* Migrate regressions from 7001 to 8000

* Migrate remaining regression tests

* Fixing missing imports

* Update docs with new system [ci skip]

* Update CONTRIBUTING.md

- Fix formatting
- Update wording

* Remove lemmatizer tests in el lang

* Move a few tests into the general tokenizer

* Separate Doc and DocBin tests
2021-12-04 20:34:48 +01:00
github-actions[bot]
b0b115ff39
Auto-format code with black (#9530)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2021-10-22 13:03:10 +02:00
Sofie Van Landeghem
da578c3d3b
Fix kb.set_entities (#9463)
* avoid creating _vectors_table when also using c_add_vector

* write to self._vectors_table directly in set_entities
2021-10-19 09:39:17 +02:00
Adriane Boyd
86d01e9229 Tidy up with flake8: imports, comparisons, etc. 2021-06-28 12:08:15 +02:00
Adriane Boyd
5eeb25f043 Tidy up code 2021-06-28 12:08:15 +02:00
Adriane Boyd
ec71a6b572
Filter W036 for entity ruler, etc. (#8424) 2021-06-21 09:34:29 +02:00
Sofie Van Landeghem
202943bc8c
KB & NEL to/from bytes (#8113)
* unit test for pickling KB

* add pickling test for NEL

* KB to_bytes and from_bytes

* NEL to_bytes and from_bytes

* xfail pickle tests for now

* fix docs

* cleanup
2021-05-20 18:11:30 +10:00
svlandeg
d900c55061 consistently use registry as callable 2021-03-02 17:56:28 +01:00
Sofie Van Landeghem
b92f81d5da
fix NEL config and IO, and n_sents functionality (#7100)
* fix NEL config and IO, and n_sents functionality

* add docs

* fix test
2021-02-22 14:49:52 +11:00
Matthew Honnibal
f049df1715
Revert "Set annotations in update" (#6810)
* Revert "Set annotations in update (#6767)"

This reverts commit e680efc7cc.

* Fix version

* Update spacy/pipeline/entity_linker.py

* Update spacy/pipeline/entity_linker.py

* Update spacy/pipeline/tagger.pyx

* Update spacy/pipeline/tok2vec.py

* Update spacy/pipeline/tok2vec.py

* Update spacy/pipeline/transition_parser.pyx

* Update spacy/pipeline/transition_parser.pyx

* Update website/docs/api/multilabel_textcategorizer.md

* Update website/docs/api/tok2vec.md

* Update website/docs/usage/layers-architectures.md

* Update website/docs/usage/layers-architectures.md

* Update website/docs/api/transformer.md

* Update website/docs/api/textcategorizer.md

* Update website/docs/api/tagger.md

* Update spacy/pipeline/entity_linker.py

* Update website/docs/api/sentencerecognizer.md

* Update website/docs/api/pipe.md

* Update website/docs/api/morphologizer.md

* Update website/docs/api/entityrecognizer.md

* Update spacy/pipeline/entity_linker.py

* Update spacy/pipeline/multitask.pyx

* Update spacy/pipeline/tagger.pyx

* Update spacy/pipeline/tagger.pyx

* Update spacy/pipeline/textcat.py

* Update spacy/pipeline/textcat.py

* Update spacy/pipeline/textcat.py

* Update spacy/pipeline/tok2vec.py

* Update spacy/pipeline/trainable_pipe.pyx

* Update spacy/pipeline/trainable_pipe.pyx

* Update spacy/pipeline/transition_parser.pyx

* Update spacy/pipeline/transition_parser.pyx

* Update website/docs/api/entitylinker.md

* Update website/docs/api/dependencyparser.md

* Update spacy/pipeline/trainable_pipe.pyx
2021-01-25 22:18:45 +08:00
Sofie Van Landeghem
e680efc7cc
Set annotations in update (#6767)
* bump to 3.0.0rc4

* do set_annotations in component update calls

* update docs and remove set_annotations flag

* fix EL test
2021-01-20 11:49:25 +11:00
svlandeg
e94a21638e adding tests for trained models to ensure predict reproducibility 2020-10-13 21:07:13 +02:00
svlandeg
3a505e7e14 small edit to ensure the new word was indeed new 2020-10-10 21:05:28 +02:00
svlandeg
68d79796c6 add test for vocab after serializing KB 2020-10-10 20:59:48 +02:00
Ines Montani
539b0c10da Tidy up and auto-format 2020-10-10 19:14:48 +02:00
Ines Montani
bfa3931c9d
Revert added_strings change (#6236) 2020-10-10 18:55:07 +02:00
Sofie Van Landeghem
d093d6343b
TrainablePipe (#6213)
* rename Pipe to TrainablePipe

* split functionality between Pipe and TrainablePipe

* remove unnecessary methods from certain components

* cleanup

* hasattr(component, "pipe") should be sufficient again

* remove serialization and vocab/cfg from Pipe

* unify _ensure_examples and validate_examples

* small fixes

* hasattr checks for self.cfg and self.vocab

* make is_resizable and is_trainable properties

* serialize strings.json instead of vocab

* fix KB IO + tests

* fix typos

* more typos

* _added_strings as a set

* few more tests specifically for _added_strings field

* bump to 3.0.0a36
2020-10-08 21:33:49 +02:00
svlandeg
eaf5c265cb set_kb method for entity_linker 2020-10-08 10:34:01 +02:00
svlandeg
6b8bdb2d39 add init_config to nlp.create_pipe 2020-10-07 14:58:16 +02:00
Ines Montani
fa47f87924 Tidy up and auto-format 2020-09-29 21:39:28 +02:00
Ines Montani
ff9a63bfbd begin_training -> initialize 2020-09-28 21:35:09 +02:00
Sofie Van Landeghem
c7eedd3534
updates to NEL functionality (#6132)
* NEL: read sentences and ents from reference

* fiddling with sent_start annotations

* add KB serialization test

* KB write additional file with strings.json

* score_links function to calculate NEL P/R/F

* formatting

* documentation
2020-09-24 16:53:59 +02:00
Sofie Van Landeghem
e0e793be4d
fix KB IO (#6118) 2020-09-22 21:53:06 +02:00
Sofie Van Landeghem
8e7557656f
Renaming gold & annotation_setter (#6042)
* version bump to 3.0.0a16

* rename "gold" folder to "training"

* rename 'annotation_setter' to 'set_extra_annotations'

* formatting
2020-09-09 10:31:03 +02:00
Sofie Van Landeghem
60f22e1800
Pipe API (#6034)
* ensure Language passes on valid examples for initialization

* fix tagger model initialization

* check for valid get_examples across components

* assume labels were added before begin_training

* fix senter initialization

* fix morphologizer initialization

* use methods to check arguments

* test textcat init, requires thinc>=8.0.0a31

* fix tok2vec init

* fix entity linker init

* use islice

* fix simple NER

* cleanup debug model

* fix assert statements

* fix tests

* throw error when adding a label if the output layer can't be resized anymore

* fix test

* add failing test for simple_ner

* UX improvements

* morphologizer UX

* assume begin_training gets a representative set and processes the labels

* remove assumptions for output of untrained NER model

* restore test for original purpose
2020-09-08 22:44:25 +02:00
Ines Montani
5afe6447cd registry.assets -> registry.misc 2020-09-03 17:31:14 +02:00
Sofie Van Landeghem
358cbb21e3
Define candidate generator in EL config (#5876)
* candidate generator as separate part of EL config

* update comment

* ent instead of str as input for candidate generation

* Span instead of str: correct type indication

* fix types

* unit test to create new candidate generator

* fix replace_pipe argument passing

* move error message, general cleanup

* add vocab back to KB constructor

* provide KB as callable from Vocab arg

* rename to kb_loader, fix KB serialization as part of the EL pipe

* fix typo

* reformatting

* cleanup

* fix comment

* fix wrongly duplicated code from merge conflict

* rename dump to to_disk

* from_disk instead of load_bulk

* update test after recent removal of set_morphology in tagger

* remove old doc
2020-08-18 16:10:36 +02:00
Ines Montani
950832f087
Tidy up pipes (#5906)
* Tidy up pipes

* Fix init, defaults and raise custom errors

* Update docs

* Update docs [ci skip]

* Apply suggestions from code review

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

* Tidy up error handling and validation, fix consistency

* Simplify get_examples check

* Remove unused import [ci skip]

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-08-11 23:29:31 +02:00
Ines Montani
e68459296d Tidy up and auto-format 2020-08-05 16:00:59 +02:00
Sofie Van Landeghem
82347110f5
Default empty KB in EL component (#5872)
* EL field documentation

* documentation consistent with docs

* default empty KB, initialize vocab separately

* formatting

* add test for changing the default entity vector length

* update comment
2020-08-04 14:34:09 +02:00