Store activations in `Doc`s when `save_activations` is enabled (#11002)
* Store activations in Doc when `store_activations` is enabled
This change adds the new `activations` attribute to `Doc`. This
attribute can be used by trainable pipes to store their activations,
probabilities, and guesses for downstream users.
As an example, this change modifies the `tagger` and `senter` pipes to
add an `store_activations` option. When this option is enabled, the
probabilities and guesses are stored in `set_annotations`.
* Change type of `store_activations` to `Union[bool, List[str]]`
When the value is:
- A bool: all activations are stored when set to `True`.
- A List[str]: the activations named in the list are stored
* Formatting fixes in Tagger
* Support store_activations in spancat and morphologizer
* Make Doc.activations type visible to MyPy
* textcat/textcat_multilabel: add store_activations option
* trainable_lemmatizer/entity_linker: add store_activations option
* parser/ner: do not currently support returning activations
* Extend tagger and senter tests
So that they, like the other tests, also check that we get no
activations if no activations were requested.
* Document `Doc.activations` and `store_activations` in the relevant pipes
* Start errors/warnings at higher numbers to avoid merge conflicts
Between the master and v4 branches.
* Add `store_activations` to docstrings.
* Replace store_activations setter by set_store_activations method
Setters that take a different type than what the getter returns are still
problematic for MyPy. Replace the setter by a method, so that type inference
works everywhere.
* Use dict comprehension suggested by @svlandeg
* Revert "Use dict comprehension suggested by @svlandeg"
This reverts commit 6e7b958f7060397965176c69649e5414f1f24988.
* EntityLinker: add type annotations to _add_activations
* _store_activations: make kwarg-only, remove doc_scores_lens arg
* set_annotations: add type annotations
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* TextCat.predict: return dict
* Make the `TrainablePipe.store_activations` property a bool
This means that we can also bring back `store_activations` setter.
* Remove `TrainablePipe.activations`
We do not need to enumerate the activations anymore since `store_activations` is
`bool`.
* Add type annotations for activations in predict/set_annotations
* Rename `TrainablePipe.store_activations` to `save_activations`
* Error E1400 is not used anymore
This error was used when activations were still `Union[bool, List[str]]`.
* Change wording in API docs after store -> save change
* docs: tag (save_)activations as new in spaCy 4.0
* Fix copied line in morphologizer activations test
* Don't train in any test_save_activations test
* Rename activations
- "probs" -> "probabilities"
- "guesses" -> "label_ids", except in the edit tree lemmatizer, where
"guesses" -> "tree_ids".
* Remove unused W400 warning.
This warning was used when we still allowed the user to specify
which activations to save.
* Formatting fixes
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Replace "kb_ids" by a constant
* spancat: replace a cast by an assertion
* Fix EOF spacing
* Fix comments in test_save_activations tests
* Do not set RNG seed in activation saving tests
* Revert "spancat: replace a cast by an assertion"
This reverts commit 0bd5730d16432443a2b247316928d4f789ad8741.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-09-13 10:51:12 +03:00
|
|
|
from typing import cast
|
2020-04-02 15:46:32 +03:00
|
|
|
import pytest
|
2023-03-22 14:17:56 +03:00
|
|
|
from numpy.testing import assert_equal, assert_almost_equal
|
2020-04-02 15:46:32 +03:00
|
|
|
|
2023-03-31 14:41:41 +03:00
|
|
|
from thinc.api import get_current_ops
|
2020-04-02 15:46:32 +03:00
|
|
|
|
|
|
|
from spacy import util
|
2020-09-09 11:31:03 +03:00
|
|
|
from spacy.training import Example
|
2020-04-02 15:46:32 +03:00
|
|
|
from spacy.lang.en import English
|
|
|
|
from spacy.language import Language
|
|
|
|
from spacy.tests.util import make_tempdir
|
2020-07-19 12:10:51 +03:00
|
|
|
from spacy.morphology import Morphology
|
Store activations in `Doc`s when `save_activations` is enabled (#11002)
* Store activations in Doc when `store_activations` is enabled
This change adds the new `activations` attribute to `Doc`. This
attribute can be used by trainable pipes to store their activations,
probabilities, and guesses for downstream users.
As an example, this change modifies the `tagger` and `senter` pipes to
add an `store_activations` option. When this option is enabled, the
probabilities and guesses are stored in `set_annotations`.
* Change type of `store_activations` to `Union[bool, List[str]]`
When the value is:
- A bool: all activations are stored when set to `True`.
- A List[str]: the activations named in the list are stored
* Formatting fixes in Tagger
* Support store_activations in spancat and morphologizer
* Make Doc.activations type visible to MyPy
* textcat/textcat_multilabel: add store_activations option
* trainable_lemmatizer/entity_linker: add store_activations option
* parser/ner: do not currently support returning activations
* Extend tagger and senter tests
So that they, like the other tests, also check that we get no
activations if no activations were requested.
* Document `Doc.activations` and `store_activations` in the relevant pipes
* Start errors/warnings at higher numbers to avoid merge conflicts
Between the master and v4 branches.
* Add `store_activations` to docstrings.
* Replace store_activations setter by set_store_activations method
Setters that take a different type than what the getter returns are still
problematic for MyPy. Replace the setter by a method, so that type inference
works everywhere.
* Use dict comprehension suggested by @svlandeg
* Revert "Use dict comprehension suggested by @svlandeg"
This reverts commit 6e7b958f7060397965176c69649e5414f1f24988.
* EntityLinker: add type annotations to _add_activations
* _store_activations: make kwarg-only, remove doc_scores_lens arg
* set_annotations: add type annotations
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* TextCat.predict: return dict
* Make the `TrainablePipe.store_activations` property a bool
This means that we can also bring back `store_activations` setter.
* Remove `TrainablePipe.activations`
We do not need to enumerate the activations anymore since `store_activations` is
`bool`.
* Add type annotations for activations in predict/set_annotations
* Rename `TrainablePipe.store_activations` to `save_activations`
* Error E1400 is not used anymore
This error was used when activations were still `Union[bool, List[str]]`.
* Change wording in API docs after store -> save change
* docs: tag (save_)activations as new in spaCy 4.0
* Fix copied line in morphologizer activations test
* Don't train in any test_save_activations test
* Rename activations
- "probs" -> "probabilities"
- "guesses" -> "label_ids", except in the edit tree lemmatizer, where
"guesses" -> "tree_ids".
* Remove unused W400 warning.
This warning was used when we still allowed the user to specify
which activations to save.
* Formatting fixes
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Replace "kb_ids" by a constant
* spancat: replace a cast by an assertion
* Fix EOF spacing
* Fix comments in test_save_activations tests
* Do not set RNG seed in activation saving tests
* Revert "spancat: replace a cast by an assertion"
This reverts commit 0bd5730d16432443a2b247316928d4f789ad8741.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-09-13 10:51:12 +03:00
|
|
|
from spacy.pipeline import TrainablePipe
|
2020-10-13 22:07:13 +03:00
|
|
|
from spacy.attrs import MORPH
|
Add overwrite settings for more components (#9050)
* Add overwrite settings for more components
For pipeline components where it's relevant and not already implemented,
add an explicit `overwrite` setting that controls whether
`set_annotations` overwrites existing annotation.
For the `morphologizer`, add an additional setting `extend`, which
controls whether the existing features are preserved.
* +overwrite, +extend: overwrite values of existing features, add any new
features
* +overwrite, -extend: overwrite completely, removing any existing
features
* -overwrite, +extend: keep values of existing features, add any new
features
* -overwrite, -extend: do not modify the existing value if set
In all cases an unset value will be set by `set_annotations`.
Preserve current overwrite defaults:
* True: morphologizer, entity linker
* False: tagger, sentencizer, senter
* Add backwards compat overwrite settings
* Put empty line back
Removed by accident in last commit
* Set backwards-compatible defaults in __init__
Because the `TrainablePipe` serialization methods update `cfg`, there's
no straightforward way to detect whether models serialized with a
previous version are missing the overwrite settings.
It would be possible in the sentencizer due to its separate
serialization methods, however to keep the changes parallel, this also
sets the default in `__init__`.
* Remove traces
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2021-09-30 16:35:55 +03:00
|
|
|
from spacy.tokens import Doc
|
2020-04-02 15:46:32 +03:00
|
|
|
|
|
|
|
|
|
|
|
def test_label_types():
|
|
|
|
nlp = Language()
|
2020-07-22 14:42:59 +03:00
|
|
|
morphologizer = nlp.add_pipe("morphologizer")
|
|
|
|
morphologizer.add_label("Feat=A")
|
2020-04-02 15:46:32 +03:00
|
|
|
with pytest.raises(ValueError):
|
2020-07-22 14:42:59 +03:00
|
|
|
morphologizer.add_label(9)
|
2020-04-02 15:46:32 +03:00
|
|
|
|
|
|
|
|
2023-03-22 14:17:56 +03:00
|
|
|
TAGS = ["Feat=N", "Feat=V", "Feat=J"]
|
|
|
|
|
2020-04-02 15:46:32 +03:00
|
|
|
TRAIN_DATA = [
|
2020-06-20 15:15:04 +03:00
|
|
|
(
|
|
|
|
"I like green eggs",
|
|
|
|
{
|
|
|
|
"morphs": ["Feat=N", "Feat=V", "Feat=J", "Feat=N"],
|
|
|
|
"pos": ["NOUN", "VERB", "ADJ", "NOUN"],
|
|
|
|
},
|
|
|
|
),
|
2020-07-19 12:10:51 +03:00
|
|
|
# test combinations of morph+POS
|
2020-09-08 23:44:25 +03:00
|
|
|
("Eat blue ham", {"morphs": ["Feat=V", "", ""], "pos": ["", "ADJ", ""]}),
|
2020-04-02 15:46:32 +03:00
|
|
|
]
|
|
|
|
|
|
|
|
|
2023-03-22 14:17:56 +03:00
|
|
|
def test_label_smoothing():
|
|
|
|
nlp = Language()
|
|
|
|
morph_no_ls = nlp.add_pipe("morphologizer", "no_label_smoothing")
|
|
|
|
morph_ls = nlp.add_pipe(
|
|
|
|
"morphologizer", "label_smoothing", config=dict(label_smoothing=0.05)
|
|
|
|
)
|
|
|
|
train_examples = []
|
|
|
|
losses = {}
|
|
|
|
for tag in TAGS:
|
|
|
|
morph_no_ls.add_label(tag)
|
|
|
|
morph_ls.add_label(tag)
|
|
|
|
for t in TRAIN_DATA:
|
|
|
|
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
|
|
|
|
|
|
|
nlp.initialize(get_examples=lambda: train_examples)
|
|
|
|
tag_scores, bp_tag_scores = morph_ls.model.begin_update(
|
|
|
|
[eg.predicted for eg in train_examples]
|
|
|
|
)
|
2023-03-31 14:41:41 +03:00
|
|
|
ops = get_current_ops()
|
|
|
|
no_ls_grads = ops.to_numpy(morph_no_ls.get_loss(train_examples, tag_scores)[1][0])
|
|
|
|
ls_grads = ops.to_numpy(morph_ls.get_loss(train_examples, tag_scores)[1][0])
|
2023-03-22 14:17:56 +03:00
|
|
|
assert_almost_equal(ls_grads / no_ls_grads, 0.94285715)
|
|
|
|
|
|
|
|
|
2020-09-08 23:44:25 +03:00
|
|
|
def test_no_label():
|
|
|
|
nlp = Language()
|
|
|
|
nlp.add_pipe("morphologizer")
|
|
|
|
with pytest.raises(ValueError):
|
2020-09-28 22:35:09 +03:00
|
|
|
nlp.initialize()
|
2020-09-08 23:44:25 +03:00
|
|
|
|
|
|
|
|
|
|
|
def test_implicit_label():
|
|
|
|
nlp = Language()
|
|
|
|
nlp.add_pipe("morphologizer")
|
|
|
|
train_examples = []
|
|
|
|
for t in TRAIN_DATA:
|
|
|
|
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
2020-09-28 22:35:09 +03:00
|
|
|
nlp.initialize(get_examples=lambda: train_examples)
|
2020-09-08 23:44:25 +03:00
|
|
|
|
|
|
|
|
2023-01-16 12:25:53 +03:00
|
|
|
def test_is_distillable():
|
|
|
|
nlp = English()
|
|
|
|
morphologizer = nlp.add_pipe("morphologizer")
|
|
|
|
assert morphologizer.is_distillable
|
|
|
|
|
|
|
|
|
2020-09-08 23:44:25 +03:00
|
|
|
def test_no_resize():
|
|
|
|
nlp = Language()
|
|
|
|
morphologizer = nlp.add_pipe("morphologizer")
|
|
|
|
morphologizer.add_label("POS" + Morphology.FIELD_SEP + "NOUN")
|
|
|
|
morphologizer.add_label("POS" + Morphology.FIELD_SEP + "VERB")
|
2020-09-28 22:35:09 +03:00
|
|
|
nlp.initialize()
|
2020-09-08 23:44:25 +03:00
|
|
|
# this throws an error because the morphologizer can't be resized after initialization
|
|
|
|
with pytest.raises(ValueError):
|
|
|
|
morphologizer.add_label("POS" + Morphology.FIELD_SEP + "ADJ")
|
|
|
|
|
|
|
|
|
2020-09-28 22:35:09 +03:00
|
|
|
def test_initialize_examples():
|
2020-09-08 23:44:25 +03:00
|
|
|
nlp = Language()
|
|
|
|
morphologizer = nlp.add_pipe("morphologizer")
|
|
|
|
morphologizer.add_label("POS" + Morphology.FIELD_SEP + "NOUN")
|
|
|
|
train_examples = []
|
|
|
|
for t in TRAIN_DATA:
|
|
|
|
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
|
|
|
# you shouldn't really call this more than once, but for testing it should be fine
|
2020-09-28 22:35:09 +03:00
|
|
|
nlp.initialize()
|
|
|
|
nlp.initialize(get_examples=lambda: train_examples)
|
2020-10-08 22:33:49 +03:00
|
|
|
with pytest.raises(TypeError):
|
2020-09-28 22:35:09 +03:00
|
|
|
nlp.initialize(get_examples=lambda: None)
|
2020-10-08 22:33:49 +03:00
|
|
|
with pytest.raises(TypeError):
|
2020-09-28 22:35:09 +03:00
|
|
|
nlp.initialize(get_examples=train_examples)
|
2020-09-08 23:44:25 +03:00
|
|
|
|
|
|
|
|
2020-04-02 15:46:32 +03:00
|
|
|
def test_overfitting_IO():
|
|
|
|
# Simple test to try and quickly overfit the morphologizer - ensuring the ML models work correctly
|
|
|
|
nlp = English()
|
2020-09-08 23:44:25 +03:00
|
|
|
nlp.add_pipe("morphologizer")
|
2020-07-06 14:02:36 +03:00
|
|
|
train_examples = []
|
2020-04-02 15:46:32 +03:00
|
|
|
for inst in TRAIN_DATA:
|
2020-07-06 14:02:36 +03:00
|
|
|
train_examples.append(Example.from_dict(nlp.make_doc(inst[0]), inst[1]))
|
2020-09-28 22:35:09 +03:00
|
|
|
optimizer = nlp.initialize(get_examples=lambda: train_examples)
|
2020-04-02 15:46:32 +03:00
|
|
|
|
|
|
|
for i in range(50):
|
|
|
|
losses = {}
|
2020-07-06 14:02:36 +03:00
|
|
|
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
2020-04-02 15:46:32 +03:00
|
|
|
assert losses["morphologizer"] < 0.00001
|
|
|
|
|
|
|
|
# test the trained model
|
2020-07-19 12:10:51 +03:00
|
|
|
test_text = "I like blue ham"
|
2020-04-02 15:46:32 +03:00
|
|
|
doc = nlp(test_text)
|
2020-09-08 23:44:25 +03:00
|
|
|
gold_morphs = ["Feat=N", "Feat=V", "", ""]
|
|
|
|
gold_pos_tags = ["NOUN", "VERB", "ADJ", ""]
|
2020-10-01 23:21:46 +03:00
|
|
|
assert [str(t.morph) for t in doc] == gold_morphs
|
2020-07-19 12:10:51 +03:00
|
|
|
assert [t.pos_ for t in doc] == gold_pos_tags
|
2020-04-02 15:46:32 +03:00
|
|
|
|
|
|
|
# Also test the results are still the same after IO
|
|
|
|
with make_tempdir() as tmp_dir:
|
|
|
|
nlp.to_disk(tmp_dir)
|
|
|
|
nlp2 = util.load_model_from_path(tmp_dir)
|
|
|
|
doc2 = nlp2(test_text)
|
2020-10-01 23:21:46 +03:00
|
|
|
assert [str(t.morph) for t in doc2] == gold_morphs
|
2020-07-19 12:10:51 +03:00
|
|
|
assert [t.pos_ for t in doc2] == gold_pos_tags
|
2020-10-13 22:07:13 +03:00
|
|
|
|
|
|
|
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
|
|
|
|
texts = [
|
|
|
|
"Just a sentence.",
|
|
|
|
"Then one more sentence about London.",
|
|
|
|
"Here is another one.",
|
|
|
|
"I like London.",
|
|
|
|
]
|
|
|
|
batch_deps_1 = [doc.to_array([MORPH]) for doc in nlp.pipe(texts)]
|
|
|
|
batch_deps_2 = [doc.to_array([MORPH]) for doc in nlp.pipe(texts)]
|
|
|
|
no_batch_deps = [doc.to_array([MORPH]) for doc in [nlp(text) for text in texts]]
|
|
|
|
assert_equal(batch_deps_1, batch_deps_2)
|
|
|
|
assert_equal(batch_deps_1, no_batch_deps)
|
2020-11-10 15:15:09 +03:00
|
|
|
|
|
|
|
# Test without POS
|
|
|
|
nlp.remove_pipe("morphologizer")
|
|
|
|
nlp.add_pipe("morphologizer")
|
|
|
|
for example in train_examples:
|
|
|
|
for token in example.reference:
|
|
|
|
token.pos_ = ""
|
|
|
|
optimizer = nlp.initialize(get_examples=lambda: train_examples)
|
|
|
|
for i in range(50):
|
|
|
|
losses = {}
|
|
|
|
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
|
|
|
assert losses["morphologizer"] < 0.00001
|
|
|
|
|
|
|
|
# Test the trained model
|
|
|
|
test_text = "I like blue ham"
|
|
|
|
doc = nlp(test_text)
|
|
|
|
gold_morphs = ["Feat=N", "Feat=V", "", ""]
|
|
|
|
gold_pos_tags = ["", "", "", ""]
|
|
|
|
assert [str(t.morph) for t in doc] == gold_morphs
|
|
|
|
assert [t.pos_ for t in doc] == gold_pos_tags
|
2021-01-15 19:20:10 +03:00
|
|
|
|
Add overwrite settings for more components (#9050)
* Add overwrite settings for more components
For pipeline components where it's relevant and not already implemented,
add an explicit `overwrite` setting that controls whether
`set_annotations` overwrites existing annotation.
For the `morphologizer`, add an additional setting `extend`, which
controls whether the existing features are preserved.
* +overwrite, +extend: overwrite values of existing features, add any new
features
* +overwrite, -extend: overwrite completely, removing any existing
features
* -overwrite, +extend: keep values of existing features, add any new
features
* -overwrite, -extend: do not modify the existing value if set
In all cases an unset value will be set by `set_annotations`.
Preserve current overwrite defaults:
* True: morphologizer, entity linker
* False: tagger, sentencizer, senter
* Add backwards compat overwrite settings
* Put empty line back
Removed by accident in last commit
* Set backwards-compatible defaults in __init__
Because the `TrainablePipe` serialization methods update `cfg`, there's
no straightforward way to detect whether models serialized with a
previous version are missing the overwrite settings.
It would be possible in the sentencizer due to its separate
serialization methods, however to keep the changes parallel, this also
sets the default in `__init__`.
* Remove traces
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2021-09-30 16:35:55 +03:00
|
|
|
# Test overwrite+extend settings
|
|
|
|
# (note that "" is unset, "_" is set and empty)
|
|
|
|
morphs = ["Feat=V", "Feat=N", "_"]
|
|
|
|
doc = Doc(nlp.vocab, words=["blue", "ham", "like"], morphs=morphs)
|
|
|
|
orig_morphs = [str(t.morph) for t in doc]
|
|
|
|
orig_pos_tags = [t.pos_ for t in doc]
|
|
|
|
morphologizer = nlp.get_pipe("morphologizer")
|
|
|
|
|
|
|
|
# don't overwrite or extend
|
|
|
|
morphologizer.cfg["overwrite"] = False
|
|
|
|
doc = morphologizer(doc)
|
|
|
|
assert [str(t.morph) for t in doc] == orig_morphs
|
|
|
|
assert [t.pos_ for t in doc] == orig_pos_tags
|
|
|
|
|
|
|
|
# overwrite and extend
|
|
|
|
morphologizer.cfg["overwrite"] = True
|
|
|
|
morphologizer.cfg["extend"] = True
|
|
|
|
doc = Doc(nlp.vocab, words=["I", "like"], morphs=["Feat=A|That=A|This=A", ""])
|
|
|
|
doc = morphologizer(doc)
|
|
|
|
assert [str(t.morph) for t in doc] == ["Feat=N|That=A|This=A", "Feat=V"]
|
|
|
|
|
|
|
|
# extend without overwriting
|
|
|
|
morphologizer.cfg["overwrite"] = False
|
|
|
|
morphologizer.cfg["extend"] = True
|
|
|
|
doc = Doc(nlp.vocab, words=["I", "like"], morphs=["Feat=A|That=A|This=A", "That=B"])
|
|
|
|
doc = morphologizer(doc)
|
|
|
|
assert [str(t.morph) for t in doc] == ["Feat=A|That=A|This=A", "Feat=V|That=B"]
|
|
|
|
|
|
|
|
# overwrite without extending
|
|
|
|
morphologizer.cfg["overwrite"] = True
|
|
|
|
morphologizer.cfg["extend"] = False
|
|
|
|
doc = Doc(nlp.vocab, words=["I", "like"], morphs=["Feat=A|That=A|This=A", ""])
|
|
|
|
doc = morphologizer(doc)
|
|
|
|
assert [str(t.morph) for t in doc] == ["Feat=N", "Feat=V"]
|
|
|
|
|
2021-01-15 19:20:10 +03:00
|
|
|
# Test with unset morph and partial POS
|
|
|
|
nlp.remove_pipe("morphologizer")
|
|
|
|
nlp.add_pipe("morphologizer")
|
|
|
|
for example in train_examples:
|
|
|
|
for token in example.reference:
|
|
|
|
if token.text == "ham":
|
|
|
|
token.pos_ = "NOUN"
|
|
|
|
else:
|
|
|
|
token.pos_ = ""
|
|
|
|
token.set_morph(None)
|
|
|
|
optimizer = nlp.initialize(get_examples=lambda: train_examples)
|
2022-04-27 10:14:25 +03:00
|
|
|
assert nlp.get_pipe("morphologizer").labels is not None
|
2021-01-15 19:20:10 +03:00
|
|
|
for i in range(50):
|
|
|
|
losses = {}
|
|
|
|
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
|
|
|
assert losses["morphologizer"] < 0.00001
|
|
|
|
|
|
|
|
# Test the trained model
|
|
|
|
test_text = "I like blue ham"
|
|
|
|
doc = nlp(test_text)
|
|
|
|
gold_morphs = ["", "", "", ""]
|
|
|
|
gold_pos_tags = ["NOUN", "NOUN", "NOUN", "NOUN"]
|
|
|
|
assert [str(t.morph) for t in doc] == gold_morphs
|
|
|
|
assert [t.pos_ for t in doc] == gold_pos_tags
|
Store activations in `Doc`s when `save_activations` is enabled (#11002)
* Store activations in Doc when `store_activations` is enabled
This change adds the new `activations` attribute to `Doc`. This
attribute can be used by trainable pipes to store their activations,
probabilities, and guesses for downstream users.
As an example, this change modifies the `tagger` and `senter` pipes to
add an `store_activations` option. When this option is enabled, the
probabilities and guesses are stored in `set_annotations`.
* Change type of `store_activations` to `Union[bool, List[str]]`
When the value is:
- A bool: all activations are stored when set to `True`.
- A List[str]: the activations named in the list are stored
* Formatting fixes in Tagger
* Support store_activations in spancat and morphologizer
* Make Doc.activations type visible to MyPy
* textcat/textcat_multilabel: add store_activations option
* trainable_lemmatizer/entity_linker: add store_activations option
* parser/ner: do not currently support returning activations
* Extend tagger and senter tests
So that they, like the other tests, also check that we get no
activations if no activations were requested.
* Document `Doc.activations` and `store_activations` in the relevant pipes
* Start errors/warnings at higher numbers to avoid merge conflicts
Between the master and v4 branches.
* Add `store_activations` to docstrings.
* Replace store_activations setter by set_store_activations method
Setters that take a different type than what the getter returns are still
problematic for MyPy. Replace the setter by a method, so that type inference
works everywhere.
* Use dict comprehension suggested by @svlandeg
* Revert "Use dict comprehension suggested by @svlandeg"
This reverts commit 6e7b958f7060397965176c69649e5414f1f24988.
* EntityLinker: add type annotations to _add_activations
* _store_activations: make kwarg-only, remove doc_scores_lens arg
* set_annotations: add type annotations
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* TextCat.predict: return dict
* Make the `TrainablePipe.store_activations` property a bool
This means that we can also bring back `store_activations` setter.
* Remove `TrainablePipe.activations`
We do not need to enumerate the activations anymore since `store_activations` is
`bool`.
* Add type annotations for activations in predict/set_annotations
* Rename `TrainablePipe.store_activations` to `save_activations`
* Error E1400 is not used anymore
This error was used when activations were still `Union[bool, List[str]]`.
* Change wording in API docs after store -> save change
* docs: tag (save_)activations as new in spaCy 4.0
* Fix copied line in morphologizer activations test
* Don't train in any test_save_activations test
* Rename activations
- "probs" -> "probabilities"
- "guesses" -> "label_ids", except in the edit tree lemmatizer, where
"guesses" -> "tree_ids".
* Remove unused W400 warning.
This warning was used when we still allowed the user to specify
which activations to save.
* Formatting fixes
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Replace "kb_ids" by a constant
* spancat: replace a cast by an assertion
* Fix EOF spacing
* Fix comments in test_save_activations tests
* Do not set RNG seed in activation saving tests
* Revert "spancat: replace a cast by an assertion"
This reverts commit 0bd5730d16432443a2b247316928d4f789ad8741.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-09-13 10:51:12 +03:00
|
|
|
|
|
|
|
|
|
|
|
def test_save_activations():
|
|
|
|
nlp = English()
|
|
|
|
morphologizer = cast(TrainablePipe, nlp.add_pipe("morphologizer"))
|
|
|
|
train_examples = []
|
|
|
|
for inst in TRAIN_DATA:
|
|
|
|
train_examples.append(Example.from_dict(nlp.make_doc(inst[0]), inst[1]))
|
|
|
|
nlp.initialize(get_examples=lambda: train_examples)
|
|
|
|
|
|
|
|
doc = nlp("This is a test.")
|
|
|
|
assert "morphologizer" not in doc.activations
|
|
|
|
|
|
|
|
morphologizer.save_activations = True
|
|
|
|
doc = nlp("This is a test.")
|
|
|
|
assert "morphologizer" in doc.activations
|
|
|
|
assert set(doc.activations["morphologizer"].keys()) == {
|
|
|
|
"label_ids",
|
|
|
|
"probabilities",
|
|
|
|
}
|
|
|
|
assert doc.activations["morphologizer"]["probabilities"].shape == (5, 6)
|
|
|
|
assert doc.activations["morphologizer"]["label_ids"].shape == (5,)
|