spaCy/spacy/tests/vocab_vectors/test_vectors.py
Sofie Van Landeghem 569cc98982
Update spaCy for thinc 8.0.0 (#4920)
* Add load_from_config function

* Add train_from_config script

* Merge configs and expose via spacy.config

* Fix script

* Suggest create_evaluation_callback

* Hard-code for NER

* Fix errors

* Register command

* Add TODO

* Update train-from-config todos

* Fix imports

* Allow delayed setting of parser model nr_class

* Get train-from-config working

* Tidy up and fix scores and printing

* Hide traceback if cancelled

* Fix weighted score formatting

* Fix score formatting

* Make output_path optional

* Add Tok2Vec component

* Tidy up and add tok2vec_tensors

* Add option to copy docs in nlp.update

* Copy docs in nlp.update

* Adjust nlp.update() for set_annotations

* Don't shuffle pipes in nlp.update, decruft

* Support set_annotations arg in component update

* Support set_annotations in parser update

* Add get_gradients method

* Add get_gradients to parser

* Update errors.py

* Fix problems caused by merge

* Add _link_components method in nlp

* Add concept of 'listeners' and ControlledModel

* Support optional attributes arg in ControlledModel

* Try having tok2vec component in pipeline

* Fix tok2vec component

* Fix config

* Fix tok2vec

* Update for Example

* Update for Example

* Update config

* Add eg2doc util

* Update and add schemas/types

* Update schemas

* Fix nlp.update

* Fix tagger

* Remove hacks from train-from-config

* Remove hard-coded config str

* Calculate loss in tok2vec component

* Tidy up and use function signatures instead of models

* Support union types for registry models

* Minor cleaning in Language.update

* Make ControlledModel specifically Tok2VecListener

* Fix train_from_config

* Fix tok2vec

* Tidy up

* Add function for bilstm tok2vec

* Fix type

* Fix syntax

* Fix pytorch optimizer

* Add example configs

* Update for thinc describe changes

* Update for Thinc changes

* Update for dropout/sgd changes

* Update for dropout/sgd changes

* Unhack gradient update

* Work on refactoring _ml

* Remove _ml.py module

* WIP upgrade cli scripts for thinc

* Move some _ml stuff to util

* Import link_vectors from util

* Update train_from_config

* Import from util

* Import from util

* Temporarily add ml.component_models module

* Move ml methods

* Move typedefs

* Update load vectors

* Update gitignore

* Move imports

* Add PrecomputableAffine

* Fix imports

* Fix imports

* Fix imports

* Fix missing imports

* Update CLI scripts

* Update spacy.language

* Add stubs for building the models

* Update model definition

* Update create_default_optimizer

* Fix import

* Fix comment

* Update imports in tests

* Update imports in spacy.cli

* Fix import

* fix obsolete thinc imports

* update srsly pin

* from thinc to ml_datasets for example data such as imdb

* update ml_datasets pin

* using STATE.vectors

* small fix

* fix Sentencizer.pipe

* black formatting

* rename Affine to Linear as in thinc

* set validate explicitely to True

* rename with_square_sequences to with_list2padded

* rename with_flatten to with_list2array

* chaining layernorm

* small fixes

* revert Optimizer import

* build_nel_encoder with new thinc style

* fixes using model's get and set methods

* Tok2Vec in component models, various fixes

* fix up legacy tok2vec code

* add model initialize calls

* add in build_tagger_model

* small fixes

* setting model dims

* fixes for ParserModel

* various small fixes

* initialize thinc Models

* fixes

* consistent naming of window_size

* fixes, removing set_dropout

* work around Iterable issue

* remove legacy tok2vec

* util fix

* fix forward function of tok2vec listener

* more fixes

* trying to fix PrecomputableAffine (not succesful yet)

* alloc instead of allocate

* add morphologizer

* rename residual

* rename fixes

* Fix predict function

* Update parser and parser model

* fixing few more tests

* Fix precomputable affine

* Update component model

* Update parser model

* Move backprop padding to own function, for test

* Update test

* Fix p. affine

* Update NEL

* build_bow_text_classifier and extract_ngrams

* Fix parser init

* Fix test add label

* add build_simple_cnn_text_classifier

* Fix parser init

* Set gpu off by default in example

* Fix tok2vec listener

* Fix parser model

* Small fixes

* small fix for PyTorchLSTM parameters

* revert my_compounding hack (iterable fixed now)

* fix biLSTM

* Fix uniqued

* PyTorchRNNWrapper fix

* small fixes

* use helper function to calculate cosine loss

* small fixes for build_simple_cnn_text_classifier

* putting dropout default at 0.0 to ensure the layer gets built

* using thinc util's set_dropout_rate

* moving layer normalization inside of maxout definition to optimize dropout

* temp debugging in NEL

* fixed NEL model by using init defaults !

* fixing after set_dropout_rate refactor

* proper fix

* fix test_update_doc after refactoring optimizers in thinc

* Add CharacterEmbed layer

* Construct tagger Model

* Add missing import

* Remove unused stuff

* Work on textcat

* fix test (again :)) after optimizer refactor

* fixes to allow reading Tagger from_disk without overwriting dimensions

* don't build the tok2vec prematuraly

* fix CharachterEmbed init

* CharacterEmbed fixes

* Fix CharacterEmbed architecture

* fix imports

* renames from latest thinc update

* one more rename

* add initialize calls where appropriate

* fix parser initialization

* Update Thinc version

* Fix errors, auto-format and tidy up imports

* Fix validation

* fix if bias is cupy array

* revert for now

* ensure it's a numpy array before running bp in ParserStepModel

* no reason to call require_gpu twice

* use CupyOps.to_numpy instead of cupy directly

* fix initialize of ParserModel

* remove unnecessary import

* fixes for CosineDistance

* fix device renaming

* use refactored loss functions (Thinc PR 251)

* overfitting test for tagger

* experimental settings for the tagger: avoid zero-init and subword normalization

* clean up tagger overfitting test

* use previous default value for nP

* remove toy config

* bringing layernorm back (had a bug - fixed in thinc)

* revert setting nP explicitly

* remove setting default in constructor

* restore values as they used to be

* add overfitting test for NER

* add overfitting test for dep parser

* add overfitting test for textcat

* fixing init for linear (previously affine)

* larger eps window for textcat

* ensure doc is not None

* Require newer thinc

* Make float check vaguer

* Slop the textcat overfit test more

* Fix textcat test

* Fix exclusive classes for textcat

* fix after renaming of alloc methods

* fixing renames and mandatory arguments (staticvectors WIP)

* upgrade to thinc==8.0.0.dev3

* refer to vocab.vectors directly instead of its name

* rename alpha to learn_rate

* adding hashembed and staticvectors dropout

* upgrade to thinc 8.0.0.dev4

* add name back to avoid warning W020

* thinc dev4

* update srsly

* using thinc 8.0.0a0 !

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
Co-authored-by: Ines Montani <ines@ines.io>
2020-01-29 17:06:46 +01:00

314 lines
9.6 KiB
Python

import pytest
import numpy
from numpy.testing import assert_allclose
from spacy.vocab import Vocab
from spacy.vectors import Vectors
from spacy.tokenizer import Tokenizer
from spacy.strings import hash_string
from spacy.tokens import Doc
from ..util import add_vecs_to_vocab, get_cosine
@pytest.fixture
def strings():
return ["apple", "orange"]
@pytest.fixture
def vectors():
return [
("apple", [1, 2, 3]),
("orange", [-1, -2, -3]),
("and", [-1, -1, -1]),
("juice", [5, 5, 10]),
("pie", [7, 6.3, 8.9]),
]
@pytest.fixture
def ngrams_vectors():
return [
("apple", [1, 2, 3]),
("app", [-0.1, -0.2, -0.3]),
("ppl", [-0.2, -0.3, -0.4]),
("pl", [0.7, 0.8, 0.9]),
]
@pytest.fixture()
def ngrams_vocab(en_vocab, ngrams_vectors):
add_vecs_to_vocab(en_vocab, ngrams_vectors)
return en_vocab
@pytest.fixture
def data():
return numpy.asarray([[0.0, 1.0, 2.0], [3.0, -2.0, 4.0]], dtype="f")
@pytest.fixture
def most_similar_vectors_data():
return numpy.asarray(
[[0.0, 1.0, 2.0], [1.0, -2.0, 4.0], [1.0, 1.0, -1.0], [2.0, 3.0, 1.0]],
dtype="f",
)
@pytest.fixture
def resize_data():
return numpy.asarray([[0.0, 1.0], [2.0, 3.0]], dtype="f")
@pytest.fixture()
def vocab(en_vocab, vectors):
add_vecs_to_vocab(en_vocab, vectors)
return en_vocab
@pytest.fixture()
def tokenizer_v(vocab):
return Tokenizer(vocab, {}, None, None, None)
def test_init_vectors_with_resize_shape(strings, resize_data):
v = Vectors(shape=(len(strings), 3))
v.resize(shape=resize_data.shape)
assert v.shape == resize_data.shape
assert v.shape != (len(strings), 3)
def test_init_vectors_with_resize_data(data, resize_data):
v = Vectors(data=data)
v.resize(shape=resize_data.shape)
assert v.shape == resize_data.shape
assert v.shape != data.shape
def test_get_vector_resize(strings, data, resize_data):
v = Vectors(data=data)
v.resize(shape=resize_data.shape)
strings = [hash_string(s) for s in strings]
for i, string in enumerate(strings):
v.add(string, row=i)
assert list(v[strings[0]]) == list(resize_data[0])
assert list(v[strings[0]]) != list(resize_data[1])
assert list(v[strings[1]]) != list(resize_data[0])
assert list(v[strings[1]]) == list(resize_data[1])
def test_init_vectors_with_data(strings, data):
v = Vectors(data=data)
assert v.shape == data.shape
def test_init_vectors_with_shape(strings):
v = Vectors(shape=(len(strings), 3))
assert v.shape == (len(strings), 3)
def test_get_vector(strings, data):
v = Vectors(data=data)
strings = [hash_string(s) for s in strings]
for i, string in enumerate(strings):
v.add(string, row=i)
assert list(v[strings[0]]) == list(data[0])
assert list(v[strings[0]]) != list(data[1])
assert list(v[strings[1]]) != list(data[0])
def test_set_vector(strings, data):
orig = data.copy()
v = Vectors(data=data)
strings = [hash_string(s) for s in strings]
for i, string in enumerate(strings):
v.add(string, row=i)
assert list(v[strings[0]]) == list(orig[0])
assert list(v[strings[0]]) != list(orig[1])
v[strings[0]] = data[1]
assert list(v[strings[0]]) == list(orig[1])
assert list(v[strings[0]]) != list(orig[0])
def test_vectors_most_similar(most_similar_vectors_data):
v = Vectors(data=most_similar_vectors_data)
_, best_rows, _ = v.most_similar(v.data, batch_size=2, n=2, sort=True)
assert all(row[0] == i for i, row in enumerate(best_rows))
def test_vectors_most_similar_identical():
"""Test that most similar identical vectors are assigned a score of 1.0."""
data = numpy.asarray([[4, 2, 2, 2], [4, 2, 2, 2], [1, 1, 1, 1]], dtype="f")
v = Vectors(data=data, keys=["A", "B", "C"])
keys, _, scores = v.most_similar(numpy.asarray([[4, 2, 2, 2]], dtype="f"))
assert scores[0][0] == 1.0 # not 1.0000002
data = numpy.asarray([[1, 2, 3], [1, 2, 3], [1, 1, 1]], dtype="f")
v = Vectors(data=data, keys=["A", "B", "C"])
keys, _, scores = v.most_similar(numpy.asarray([[1, 2, 3]], dtype="f"))
assert scores[0][0] == 1.0 # not 0.9999999
@pytest.mark.parametrize("text", ["apple and orange"])
def test_vectors_token_vector(tokenizer_v, vectors, text):
doc = tokenizer_v(text)
assert vectors[0] == (doc[0].text, list(doc[0].vector))
assert vectors[1] == (doc[2].text, list(doc[2].vector))
@pytest.mark.parametrize("text", ["apple"])
def test_vectors__ngrams_word(ngrams_vocab, ngrams_vectors, text):
assert list(ngrams_vocab.get_vector(text)) == list(ngrams_vectors[0][1])
@pytest.mark.parametrize("text", ["applpie"])
def test_vectors__ngrams_subword(ngrams_vocab, ngrams_vectors, text):
truth = list(ngrams_vocab.get_vector(text, 1, 6))
test = list(
[
(
ngrams_vectors[1][1][i]
+ ngrams_vectors[2][1][i]
+ ngrams_vectors[3][1][i]
)
/ 3
for i in range(len(ngrams_vectors[1][1]))
]
)
eps = [abs(truth[i] - test[i]) for i in range(len(truth))]
for i in eps:
assert i < 1e-6
@pytest.mark.parametrize("text", ["apple", "orange"])
def test_vectors_lexeme_vector(vocab, text):
lex = vocab[text]
assert list(lex.vector)
assert lex.vector_norm
@pytest.mark.parametrize("text", [["apple", "and", "orange"]])
def test_vectors_doc_vector(vocab, text):
doc = Doc(vocab, words=text)
assert list(doc.vector)
assert doc.vector_norm
@pytest.mark.parametrize("text", [["apple", "and", "orange"]])
def test_vectors_span_vector(vocab, text):
span = Doc(vocab, words=text)[0:2]
assert list(span.vector)
assert span.vector_norm
@pytest.mark.parametrize("text", ["apple orange"])
def test_vectors_token_token_similarity(tokenizer_v, text):
doc = tokenizer_v(text)
assert doc[0].similarity(doc[1]) == doc[1].similarity(doc[0])
assert -1.0 < doc[0].similarity(doc[1]) < 1.0
@pytest.mark.parametrize("text1,text2", [("apple", "orange")])
def test_vectors_token_lexeme_similarity(tokenizer_v, vocab, text1, text2):
token = tokenizer_v(text1)
lex = vocab[text2]
assert token.similarity(lex) == lex.similarity(token)
assert -1.0 < token.similarity(lex) < 1.0
@pytest.mark.parametrize("text", [["apple", "orange", "juice"]])
def test_vectors_token_span_similarity(vocab, text):
doc = Doc(vocab, words=text)
assert doc[0].similarity(doc[1:3]) == doc[1:3].similarity(doc[0])
assert -1.0 < doc[0].similarity(doc[1:3]) < 1.0
@pytest.mark.parametrize("text", [["apple", "orange", "juice"]])
def test_vectors_token_doc_similarity(vocab, text):
doc = Doc(vocab, words=text)
assert doc[0].similarity(doc) == doc.similarity(doc[0])
assert -1.0 < doc[0].similarity(doc) < 1.0
@pytest.mark.parametrize("text", [["apple", "orange", "juice"]])
def test_vectors_lexeme_span_similarity(vocab, text):
doc = Doc(vocab, words=text)
lex = vocab[text[0]]
assert lex.similarity(doc[1:3]) == doc[1:3].similarity(lex)
assert -1.0 < doc.similarity(doc[1:3]) < 1.0
@pytest.mark.parametrize("text1,text2", [("apple", "orange")])
def test_vectors_lexeme_lexeme_similarity(vocab, text1, text2):
lex1 = vocab[text1]
lex2 = vocab[text2]
assert lex1.similarity(lex2) == lex2.similarity(lex1)
assert -1.0 < lex1.similarity(lex2) < 1.0
@pytest.mark.parametrize("text", [["apple", "orange", "juice"]])
def test_vectors_lexeme_doc_similarity(vocab, text):
doc = Doc(vocab, words=text)
lex = vocab[text[0]]
assert lex.similarity(doc) == doc.similarity(lex)
assert -1.0 < lex.similarity(doc) < 1.0
@pytest.mark.parametrize("text", [["apple", "orange", "juice"]])
def test_vectors_span_span_similarity(vocab, text):
doc = Doc(vocab, words=text)
with pytest.warns(UserWarning):
assert doc[0:2].similarity(doc[1:3]) == doc[1:3].similarity(doc[0:2])
assert -1.0 < doc[0:2].similarity(doc[1:3]) < 1.0
@pytest.mark.parametrize("text", [["apple", "orange", "juice"]])
def test_vectors_span_doc_similarity(vocab, text):
doc = Doc(vocab, words=text)
with pytest.warns(UserWarning):
assert doc[0:2].similarity(doc) == doc.similarity(doc[0:2])
assert -1.0 < doc[0:2].similarity(doc) < 1.0
@pytest.mark.parametrize(
"text1,text2", [(["apple", "and", "apple", "pie"], ["orange", "juice"])]
)
def test_vectors_doc_doc_similarity(vocab, text1, text2):
doc1 = Doc(vocab, words=text1)
doc2 = Doc(vocab, words=text2)
assert doc1.similarity(doc2) == doc2.similarity(doc1)
assert -1.0 < doc1.similarity(doc2) < 1.0
def test_vocab_add_vector():
vocab = Vocab(vectors_name="test_vocab_add_vector")
data = numpy.ndarray((5, 3), dtype="f")
data[0] = 1.0
data[1] = 2.0
vocab.set_vector("cat", data[0])
vocab.set_vector("dog", data[1])
cat = vocab["cat"]
assert list(cat.vector) == [1.0, 1.0, 1.0]
dog = vocab["dog"]
assert list(dog.vector) == [2.0, 2.0, 2.0]
def test_vocab_prune_vectors():
vocab = Vocab(vectors_name="test_vocab_prune_vectors")
_ = vocab["cat"] # noqa: F841
_ = vocab["dog"] # noqa: F841
_ = vocab["kitten"] # noqa: F841
data = numpy.ndarray((5, 3), dtype="f")
data[0] = [1.0, 1.2, 1.1]
data[1] = [0.3, 1.3, 1.0]
data[2] = [0.9, 1.22, 1.05]
vocab.set_vector("cat", data[0])
vocab.set_vector("dog", data[1])
vocab.set_vector("kitten", data[2])
remap = vocab.prune_vectors(2, batch_size=2)
assert list(remap.keys()) == ["kitten"]
neighbour, similarity = list(remap.values())[0]
assert neighbour == "cat", remap
assert_allclose(similarity, get_cosine(data[0], data[2]), atol=1e-4, rtol=1e-3)