spaCy/website/docs/api
Paul O'Leary McCann 91acc3ea75
Fix entity linker batching (#9669)
* Partial fix of entity linker batching

* Add import

* Better name

* Add `use_gold_ents` option, docs

* Change to v2, create stub v1, update docs etc.

* Fix error type

Honestly no idea what the right type to use here is.
ConfigValidationError seems wrong. Maybe a NotImplementedError?

* Make mypy happy

* Add hacky fix for init issue

* Add legacy pipeline entity linker

* Fix references to class name

* Add __init__.py for legacy

* Attempted fix for loss issue

* Remove placeholder V1

* formatting

* slightly more interesting train data

* Handle batches with no usable examples

This adds a test for batches that have docs but not entities, and a
check in the component that detects such cases and skips the update step
as thought the batch were empty.

* Remove todo about data verification

Check for empty data was moved further up so this should be OK now - the
case in question shouldn't be possible.

* Fix gradient calculation

The model doesn't know which entities are not in the kb, so it generates
embeddings for the context of all of them.

However, the loss does know which entities aren't in the kb, and it
ignores them, as there's no sensible gradient.

This has the issue that the gradient will not be calculated for some of
the input embeddings, which causes a dimension mismatch in backprop.
That should have caused a clear error, but with numpyops it was causing
nans to happen, which is another problem that should be addressed
separately.

This commit changes the loss to give a zero gradient for entities not in
the kb.

* add failing test for v1 EL legacy architecture

* Add nasty but simple working check for legacy arch

* Clarify why init hack works the way it does

* Clarify use_gold_ents use case

* Fix use gold ents related handling

* Add tests for no gold ents and fix other tests

* Use aligned ents function (not working)

This doesn't actually work because the "aligned" ents are gold-only. But
if I have a different function that returns the intersection, *then*
this will work as desired.

* Use proper matching ent check

This changes the process when gold ents are not used so that the
intersection of ents in the pred and gold is used.

* Move get_matching_ents to Example

* Use model attribute to check for legacy arch

* Rename flag

* bump spacy-legacy to lower 3.0.9

Co-authored-by: svlandeg <svlandeg@github.com>
2022-03-04 09:17:36 +01:00
..
architectures.md Fix entity linker batching (#9669) 2022-03-04 09:17:36 +01:00
attributeruler.md Document scorers in registry and components from #8766 (#8929) 2021-08-12 12:50:03 +02:00
cli.md Fix references to config file in the docs & UX (#9961) 2022-01-04 14:31:26 +01:00
corpus.md Add shuffle parameter to Corpus API docs (#10220) 2022-02-07 14:55:53 +01:00
cython-classes.md Update docs, types and API consistency 2020-08-17 16:45:24 +02:00
cython-structs.md Update docs, types and API consistency 2020-08-17 16:45:24 +02:00
cython.md Update docs [ci skip] 2020-09-12 17:05:10 +02:00
data-formats.md Fix references to config file in the docs & UX (#9961) 2022-01-04 14:31:26 +01:00
dependencymatcher.md doc fixes 2020-09-12 17:38:54 +02:00
dependencyparser.md Merge remote-tracking branch 'upstream/master' into develop 2021-09-27 09:10:45 +02:00
doc.md Token sent attributes more consistent (#10164) 2022-02-08 08:35:37 +01:00
docbin.md Fix point typo on docbin docs (#9097) 2021-08-31 10:55:44 +02:00
entitylinker.md Fix entity linker batching (#9669) 2022-03-04 09:17:36 +01:00
entityrecognizer.md Document Tagger neg_prefix, fix typo (#9821) 2021-12-07 09:42:40 +01:00
entityruler.md Add link to pattern file info in EntityRuler.initialize docs (#10091) 2022-01-19 10:45:11 +01:00
example.md Extend score_spans for overlapping & non-labeled spans (#7209) 2021-04-08 12:19:17 +02:00
index.md Update v3 docs 2020-07-03 16:48:21 +02:00
kb.md Tidy up docs 2021-06-28 12:08:15 +02:00
language.md Merge remote-tracking branch 'upstream/develop' into chore/switch-to-master-v3.2.0 2021-11-03 15:32:18 +01:00
legacy.md Clean up loggers docs (#10351) 2022-02-25 16:29:12 +01:00
lemmatizer.md Merge remote-tracking branch 'upstream/master' into develop 2021-09-27 09:10:45 +02:00
lexeme.md fix 's typo's across code base (#8384) 2021-06-15 10:57:08 +02:00
lookups.md Update docs, types and API consistency 2020-08-17 16:45:24 +02:00
matcher.md Add ENT_IOB key to Matcher (#9649) 2022-01-20 13:18:39 +01:00
morphologizer.md Update overwrite and scorer in API docs (#9384) 2021-10-11 10:35:07 +02:00
morphology.md Document Assigned Attributes of Pipeline Components (#9041) 2021-09-01 12:09:39 +02:00
phrasematcher.md 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00
pipe.md Document scorers in registry and components from #8766 (#8929) 2021-08-12 12:50:03 +02:00
pipeline-functions.md Add doc_cleaner component (#9659) 2021-11-23 15:33:33 +01:00
scorer.md Add micro PRF for morph scoring (#9546) 2021-10-29 10:29:29 +02:00
sentencerecognizer.md Update overwrite and scorer in API docs (#9384) 2021-10-11 10:35:07 +02:00
sentencizer.md Update overwrite and scorer in API docs (#9384) 2021-10-11 10:35:07 +02:00
span.md Clarify Span.ents documentation (#10154) 2022-01-31 08:41:42 +01:00
spancategorizer.md Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1 2021-10-26 11:53:50 +02:00
spangroup.md Warn and document spangroup.doc weakref (#8980) 2021-08-20 11:06:19 +02:00
stringstore.md Update docs, types and API consistency 2020-08-17 16:45:24 +02:00
tagger.md Document Tagger neg_prefix, fix typo (#9821) 2021-12-07 09:42:40 +01:00
textcategorizer.md Fix Scorer.score_cats for missing labels (#9443) 2021-12-29 11:04:39 +01:00
tok2vec.md Tidy up docs 2021-06-28 12:08:15 +02:00
token.md Token sent attributes more consistent (#10164) 2022-02-08 08:35:37 +01:00
tokenizer.md Tidy up docs 2021-06-28 12:08:15 +02:00
top-level.md Clean up loggers docs (#10351) 2022-02-25 16:29:12 +01:00
transformer.md Update docs for spacy-transformers v1.1 data classes (#9361) 2021-10-18 14:16:58 +02:00
vectors.md Fix Vectors.n_keys for floret vectors (#10394) 2022-03-01 09:21:25 +01:00
vocab.md 🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167) 2021-10-14 15:21:40 +02:00