Update Keras Example for (Parikh et al, 2016) implementation (#2803)

* bug fixes in keras example

* created contributor agreement

* baseline for Parikh model

* initial version of parikh 2016 implemented

* tested asymmetric models

* fixed grevious error in normalization

* use standard SNLI test file

* begin to rework parikh example

* initial version of running example

* start to document the new version

* start to document the new version

* Update Decompositional Attention.ipynb

* fixed calls to similarity

* updated the README

* import sys package duh

* simplified indexing on mapping word to IDs

* stupid python indent error

* added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround
This commit is contained in:
John Stewart 2018-10-01 04:28:45 -04:00 committed by Matthew Honnibal
parent 405a826436
commit 9faea3ff10
5 changed files with 1243 additions and 354 deletions

View File

@ -2,11 +2,7 @@
# A decomposable attention model for Natural Language Inference # A decomposable attention model for Natural Language Inference
**by Matthew Honnibal, [@honnibal](https://github.com/honnibal)** **by Matthew Honnibal, [@honnibal](https://github.com/honnibal)**
**Updated for spaCy 2.0+ and Keras 2.2.2+ by John Stewart, [@free-variation](https://github.com/free-variation)**
> ⚠️ **IMPORTANT NOTE:** This example is currently only compatible with spaCy
> v1.x. We're working on porting the example over to Keras v2.x and spaCy v2.x.
> See [#1445](https://github.com/explosion/spaCy/issues/1445) for details
> contributions welcome!
This directory contains an implementation of the entailment prediction model described This directory contains an implementation of the entailment prediction model described
by [Parikh et al. (2016)](https://arxiv.org/pdf/1606.01933.pdf). The model is notable by [Parikh et al. (2016)](https://arxiv.org/pdf/1606.01933.pdf). The model is notable
@ -21,19 +17,25 @@ hook is installed to customise the `.similarity()` method of spaCy's `Doc`
and `Span` objects: and `Span` objects:
```python ```python
def demo(model_dir): def demo(shape):
nlp = spacy.load('en', path=model_dir, nlp = spacy.load('en_vectors_web_lg')
create_pipeline=create_similarity_pipeline) nlp.add_pipe(KerasSimilarityShim.load(nlp.path / 'similarity', nlp, shape[0]))
doc1 = nlp(u'Worst fries ever! Greasy and horrible...')
doc2 = nlp(u'The milkshakes are good. The fries are bad.') doc1 = nlp(u'The king of France is bald.')
print(doc1.similarity(doc2)) doc2 = nlp(u'France has no king.')
sent1a, sent1b = doc1.sents
print(sent1a.similarity(sent1b)) print("Sentence 1:", doc1)
print(sent1a.similarity(doc2)) print("Sentence 2:", doc2)
print(sent1b.similarity(doc2))
entailment_type, confidence = doc1.similarity(doc2)
print("Entailment type:", entailment_type, "(Confidence:", confidence, ")")
``` ```
Which gives the output `Entailment type: contradiction (Confidence: 0.60604566)`, showing that
the system has definite opinions about Betrand Russell's [famous conundrum](https://users.drew.edu/jlenz/br-on-denoting.html)!
I'm working on a blog post to explain Parikh et al.'s model in more detail. I'm working on a blog post to explain Parikh et al.'s model in more detail.
A [notebook](https://github.com/free-variation/spaCy/blob/master/examples/notebooks/Decompositional%20Attention.ipynb) is available that briefly explains this implementation.
I think it is a very interesting example of the attention mechanism, which I think it is a very interesting example of the attention mechanism, which
I didn't understand very well before working through this paper. There are I didn't understand very well before working through this paper. There are
lots of ways to extend the model. lots of ways to extend the model.
@ -43,7 +45,7 @@ lots of ways to extend the model.
| File | Description | | File | Description |
| --- | --- | | --- | --- |
| `__main__.py` | The script that will be executed. Defines the CLI, the data reading, etc — all the boring stuff. | | `__main__.py` | The script that will be executed. Defines the CLI, the data reading, etc — all the boring stuff. |
| `spacy_hook.py` | Provides a class `SimilarityShim` that lets you use an arbitrary function to customize spaCy's `doc.similarity()` method. Instead of the default average-of-vectors algorithm, when you call `doc1.similarity(doc2)`, you'll get the result of `your_model(doc1, doc2)`. | | `spacy_hook.py` | Provides a class `KerasSimilarityShim` that lets you use an arbitrary function to customize spaCy's `doc.similarity()` method. Instead of the default average-of-vectors algorithm, when you call `doc1.similarity(doc2)`, you'll get the result of `your_model(doc1, doc2)`. |
| `keras_decomposable_attention.py` | Defines the neural network model. | | `keras_decomposable_attention.py` | Defines the neural network model. |
## Setting up ## Setting up
@ -52,17 +54,13 @@ First, install [Keras](https://keras.io/), [spaCy](https://spacy.io) and the spa
English models (about 1GB of data): English models (about 1GB of data):
```bash ```bash
pip install https://github.com/fchollet/keras/archive/1.2.2.zip pip install keras
pip install spacy pip install spacy
python -m spacy.en.download python -m spacy download en_vectors_web_lg
``` ```
⚠️ **Important:** In order for the example to run, you'll need to install Keras from You'll also want to get Keras working on your GPU, and you will need a backend, such as TensorFlow or Theano.
the 1.2.2 release (and not via `pip install keras`). For more info on this, see This will depend on your set up, so you're mostly on your own for this step. If you're using AWS, try the
[#727](https://github.com/explosion/spaCy/issues/727).
You'll also want to get Keras working on your GPU. This will depend on your
set up, so you're mostly on your own for this step. If you're using AWS, try the
[NVidia AMI](https://aws.amazon.com/marketplace/pp/B00FYCDDTE). It made things pretty easy. [NVidia AMI](https://aws.amazon.com/marketplace/pp/B00FYCDDTE). It made things pretty easy.
Once you've installed the dependencies, you can run a small preliminary test of Once you've installed the dependencies, you can run a small preliminary test of
@ -80,22 +78,35 @@ Finally, download the [Stanford Natural Language Inference corpus](http://nlp.st
## Running the example ## Running the example
You can run the `keras_parikh_entailment/` directory as a script, which executes the file You can run the `keras_parikh_entailment/` directory as a script, which executes the file
[`keras_parikh_entailment/__main__.py`](__main__.py). The first thing you'll want to do is train the model: [`keras_parikh_entailment/__main__.py`](__main__.py). If you run the script without arguments
the usage is shown. Running it with `-h` explains the command line arguments.
The first thing you'll want to do is train the model:
```bash ```bash
python keras_parikh_entailment/ train <train_directory> <dev_directory> python keras_parikh_entailment/ train -t <path to SNLI train JSON> -s <path to SNLI dev JSON>
``` ```
Training takes about 300 epochs for full accuracy, and I haven't rerun the full Training takes about 300 epochs for full accuracy, and I haven't rerun the full
experiment since refactoring things to publish this example — please let me experiment since refactoring things to publish this example — please let me
know if I've broken something. You should get to at least 85% on the development data. know if I've broken something. You should get to at least 85% on the development data even after 10-15 epochs.
The other two modes demonstrate run-time usage. I never like relying on the accuracy printed The other two modes demonstrate run-time usage. I never like relying on the accuracy printed
by `.fit()` methods. I never really feel confident until I've run a new process that loads by `.fit()` methods. I never really feel confident until I've run a new process that loads
the model and starts making predictions, without access to the gold labels. I've therefore the model and starts making predictions, without access to the gold labels. I've therefore
included an `evaluate` mode. Finally, there's also a little demo, which mostly exists to show included an `evaluate` mode.
```bash
python keras_parikh_entailment/ evaluate -s <path to SNLI train JSON>
```
Finally, there's also a little demo, which mostly exists to show
you how run-time usage will eventually look. you how run-time usage will eventually look.
```bash
python keras_parikh_entailment/ demo
```
## Getting updates ## Getting updates
We should have the blog post explaining the model ready before the end of the week. To get We should have the blog post explaining the model ready before the end of the week. To get

View File

@ -1,82 +1,104 @@
from __future__ import division, unicode_literals, print_function import numpy as np
import spacy
import plac
from pathlib import Path
import ujson as json import ujson as json
import numpy from keras.utils import to_categorical
from keras.utils.np_utils import to_categorical import plac
import sys
from spacy_hook import get_embeddings, get_word_ids
from spacy_hook import create_similarity_pipeline
from keras_decomposable_attention import build_model from keras_decomposable_attention import build_model
from spacy_hook import get_embeddings, KerasSimilarityShim
try: try:
import cPickle as pickle import cPickle as pickle
except ImportError: except ImportError:
import pickle import pickle
import spacy
# workaround for keras/tensorflow bug
# see https://github.com/tensorflow/tensorflow/issues/3388
import os
import importlib
from keras import backend as K
def set_keras_backend(backend):
if K.backend() != backend:
os.environ['KERAS_BACKEND'] = backend
importlib.reload(K)
assert K.backend() == backend
if backend == "tensorflow":
K.get_session().close()
cfg = K.tf.ConfigProto()
cfg.gpu_options.allow_growth = True
K.set_session(K.tf.Session(config=cfg))
K.clear_session()
set_keras_backend("tensorflow")
def train(train_loc, dev_loc, shape, settings): def train(train_loc, dev_loc, shape, settings):
train_texts1, train_texts2, train_labels = read_snli(train_loc) train_texts1, train_texts2, train_labels = read_snli(train_loc)
dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc) dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc)
print("Loading spaCy") print("Loading spaCy")
nlp = spacy.load('en') nlp = spacy.load('en_vectors_web_lg')
assert nlp.path is not None assert nlp.path is not None
print("Processing texts...")
train_X = create_dataset(nlp, train_texts1, train_texts2, 100, shape[0])
dev_X = create_dataset(nlp, dev_texts1, dev_texts2, 100, shape[0])
print("Compiling network") print("Compiling network")
model = build_model(get_embeddings(nlp.vocab), shape, settings) model = build_model(get_embeddings(nlp.vocab), shape, settings)
print("Processing texts...")
Xs = []
for texts in (train_texts1, train_texts2, dev_texts1, dev_texts2):
Xs.append(get_word_ids(list(nlp.pipe(texts, n_threads=20, batch_size=20000)),
max_length=shape[0],
rnn_encode=settings['gru_encode'],
tree_truncate=settings['tree_truncate']))
train_X1, train_X2, dev_X1, dev_X2 = Xs
print(settings) print(settings)
model.fit( model.fit(
[train_X1, train_X2], train_X,
train_labels, train_labels,
validation_data=([dev_X1, dev_X2], dev_labels), validation_data = (dev_X, dev_labels),
nb_epoch=settings['nr_epoch'], epochs = settings['nr_epoch'],
batch_size=settings['batch_size']) batch_size = settings['batch_size'])
if not (nlp.path / 'similarity').exists(): if not (nlp.path / 'similarity').exists():
(nlp.path / 'similarity').mkdir() (nlp.path / 'similarity').mkdir()
print("Saving to", nlp.path / 'similarity') print("Saving to", nlp.path / 'similarity')
weights = model.get_weights() weights = model.get_weights()
# remove the embedding matrix. We can reconstruct it.
del weights[1]
with (nlp.path / 'similarity' / 'model').open('wb') as file_: with (nlp.path / 'similarity' / 'model').open('wb') as file_:
pickle.dump(weights[1:], file_) pickle.dump(weights, file_)
with (nlp.path / 'similarity' / 'config.json').open('wb') as file_: with (nlp.path / 'similarity' / 'config.json').open('w') as file_:
file_.write(model.to_json()) file_.write(model.to_json())
def evaluate(dev_loc): def evaluate(dev_loc, shape):
dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc) dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc)
nlp = spacy.load('en', nlp = spacy.load('en_vectors_web_lg')
create_pipeline=create_similarity_pipeline) nlp.add_pipe(KerasSimilarityShim.load(nlp.path / 'similarity', nlp, shape[0]))
total = 0. total = 0.
correct = 0. correct = 0.
for text1, text2, label in zip(dev_texts1, dev_texts2, dev_labels): for text1, text2, label in zip(dev_texts1, dev_texts2, dev_labels):
doc1 = nlp(text1) doc1 = nlp(text1)
doc2 = nlp(text2) doc2 = nlp(text2)
sim = doc1.similarity(doc2) sim, _ = doc1.similarity(doc2)
if sim.argmax() == label.argmax(): if sim == KerasSimilarityShim.entailment_types[label.argmax()]:
correct += 1 correct += 1
total += 1 total += 1
return correct, total return correct, total
def demo(): def demo(shape):
nlp = spacy.load('en', nlp = spacy.load('en_vectors_web_lg')
create_pipeline=create_similarity_pipeline) nlp.add_pipe(KerasSimilarityShim.load(nlp.path / 'similarity', nlp, shape[0]))
doc1 = nlp(u'What were the best crime fiction books in 2016?')
doc2 = nlp( doc1 = nlp(u'The king of France is bald.')
u'What should I read that was published last year? I like crime stories.') doc2 = nlp(u'France has no king.')
print(doc1)
print(doc2) print("Sentence 1:", doc1)
print("Similarity", doc1.similarity(doc2)) print("Sentence 2:", doc2)
entailment_type, confidence = doc1.similarity(doc2)
print("Entailment type:", entailment_type, "(Confidence:", confidence, ")")
LABELS = {'entailment': 0, 'contradiction': 1, 'neutral': 2} LABELS = {'entailment': 0, 'contradiction': 1, 'neutral': 2}
@ -84,56 +106,92 @@ def read_snli(path):
texts1 = [] texts1 = []
texts2 = [] texts2 = []
labels = [] labels = []
with path.open() as file_: with open(path, 'r') as file_:
for line in file_: for line in file_:
eg = json.loads(line) eg = json.loads(line)
label = eg['gold_label'] label = eg['gold_label']
if label == '-': if label == '-': # per Parikh, ignore - SNLI entries
continue continue
texts1.append(eg['sentence1']) texts1.append(eg['sentence1'])
texts2.append(eg['sentence2']) texts2.append(eg['sentence2'])
labels.append(LABELS[label]) labels.append(LABELS[label])
return texts1, texts2, to_categorical(numpy.asarray(labels, dtype='int32')) return texts1, texts2, to_categorical(np.asarray(labels, dtype='int32'))
def create_dataset(nlp, texts, hypotheses, num_unk, max_length):
sents = texts + hypotheses
sents_as_ids = []
for sent in sents:
doc = nlp(sent)
word_ids = []
for i, token in enumerate(doc):
# skip odd spaces from tokenizer
if token.has_vector and token.vector_norm == 0:
continue
if i > max_length:
break
if token.has_vector:
word_ids.append(token.rank + num_unk + 1)
else:
# if we don't have a vector, pick an OOV entry
word_ids.append(token.rank % num_unk + 1)
# there must be a simpler way of generating padded arrays from lists...
word_id_vec = np.zeros((max_length), dtype='int')
clipped_len = min(max_length, len(word_ids))
word_id_vec[:clipped_len] = word_ids[:clipped_len]
sents_as_ids.append(word_id_vec)
return [np.array(sents_as_ids[:len(texts)]), np.array(sents_as_ids[len(texts):])]
@plac.annotations( @plac.annotations(
mode=("Mode to execute", "positional", None, str, ["train", "evaluate", "demo"]), mode=("Mode to execute", "positional", None, str, ["train", "evaluate", "demo"]),
train_loc=("Path to training data", "positional", None, Path), train_loc=("Path to training data", "option", "t", str),
dev_loc=("Path to development data", "positional", None, Path), dev_loc=("Path to development or test data", "option", "s", str),
max_length=("Length to truncate sentences", "option", "L", int), max_length=("Length to truncate sentences", "option", "L", int),
nr_hidden=("Number of hidden units", "option", "H", int), nr_hidden=("Number of hidden units", "option", "H", int),
dropout=("Dropout level", "option", "d", float), dropout=("Dropout level", "option", "d", float),
learn_rate=("Learning rate", "option", "e", float), learn_rate=("Learning rate", "option", "r", float),
batch_size=("Batch size for neural network training", "option", "b", int), batch_size=("Batch size for neural network training", "option", "b", int),
nr_epoch=("Number of training epochs", "option", "i", int), nr_epoch=("Number of training epochs", "option", "e", int),
tree_truncate=("Truncate sentences by tree distance", "flag", "T", bool), entail_dir=("Direction of entailment", "option", "D", str, ["both", "left", "right"])
gru_encode=("Encode sentences with bidirectional GRU", "flag", "E", bool),
) )
def main(mode, train_loc, dev_loc, def main(mode, train_loc, dev_loc,
tree_truncate=False, max_length = 50,
gru_encode=False, nr_hidden = 200,
max_length=100, dropout = 0.2,
nr_hidden=100, learn_rate = 0.001,
dropout=0.2, batch_size = 1024,
learn_rate=0.001, nr_epoch = 10,
batch_size=100, entail_dir="both"):
nr_epoch=5):
shape = (max_length, nr_hidden, 3) shape = (max_length, nr_hidden, 3)
settings = { settings = {
'lr': learn_rate, 'lr': learn_rate,
'dropout': dropout, 'dropout': dropout,
'batch_size': batch_size, 'batch_size': batch_size,
'nr_epoch': nr_epoch, 'nr_epoch': nr_epoch,
'tree_truncate': tree_truncate, 'entail_dir': entail_dir
'gru_encode': gru_encode
} }
if mode == 'train': if mode == 'train':
if train_loc == None or dev_loc == None:
print("Train mode requires paths to training and development data sets.")
sys.exit(1)
train(train_loc, dev_loc, shape, settings) train(train_loc, dev_loc, shape, settings)
elif mode == 'evaluate': elif mode == 'evaluate':
correct, total = evaluate(dev_loc) if dev_loc == None:
print("Evaluate mode requires paths to test data set.")
sys.exit(1)
correct, total = evaluate(dev_loc, shape)
print(correct, '/', total, correct / total) print(correct, '/', total, correct / total)
else: else:
demo() demo(shape)
if __name__ == '__main__': if __name__ == '__main__':
plac.call(main) plac.call(main)

View File

@ -1,259 +1,137 @@
# Semantic similarity with decomposable attention (using spaCy and Keras) # Semantic entailment/similarity with decomposable attention (using spaCy and Keras)
# Practical state-of-the-art text similarity with spaCy and Keras # Practical state-of-the-art textual entailment with spaCy and Keras
import numpy
from keras.layers import InputSpec, Layer, Input, Dense, merge
from keras.layers import Lambda, Activation, Dropout, Embedding, TimeDistributed
from keras.layers import Bidirectional, GRU, LSTM
from keras.layers.noise import GaussianNoise
from keras.layers.advanced_activations import ELU
import keras.backend as K
from keras.models import Sequential, Model, model_from_json
from keras.regularizers import l2
from keras.optimizers import Adam
from keras.layers.normalization import BatchNormalization
from keras.layers.pooling import GlobalAveragePooling1D, GlobalMaxPooling1D
from keras.layers import Merge
import numpy as np
from keras import layers, Model, models, optimizers
from keras import backend as K
def build_model(vectors, shape, settings): def build_model(vectors, shape, settings):
'''Compile the model.'''
max_length, nr_hidden, nr_class = shape max_length, nr_hidden, nr_class = shape
# Declare inputs.
ids1 = Input(shape=(max_length,), dtype='int32', name='words1')
ids2 = Input(shape=(max_length,), dtype='int32', name='words2')
# Construct operations, which we'll chain together. input1 = layers.Input(shape=(max_length,), dtype='int32', name='words1')
embed = _StaticEmbedding(vectors, max_length, nr_hidden, dropout=0.2, nr_tune=5000) input2 = layers.Input(shape=(max_length,), dtype='int32', name='words2')
if settings['gru_encode']:
encode = _BiRNNEncoding(max_length, nr_hidden, dropout=settings['dropout']) # embeddings (projected)
attend = _Attention(max_length, nr_hidden, dropout=settings['dropout']) embed = create_embedding(vectors, max_length, nr_hidden)
align = _SoftAlignment(max_length, nr_hidden)
compare = _Comparison(max_length, nr_hidden, dropout=settings['dropout']) a = embed(input1)
entail = _Entailment(nr_hidden, nr_class, dropout=settings['dropout']) b = embed(input2)
# step 1: attend
F = create_feedforward(nr_hidden)
att_weights = layers.dot([F(a), F(b)], axes=-1)
G = create_feedforward(nr_hidden)
if settings['entail_dir'] == 'both':
norm_weights_a = layers.Lambda(normalizer(1))(att_weights)
norm_weights_b = layers.Lambda(normalizer(2))(att_weights)
alpha = layers.dot([norm_weights_a, a], axes=1)
beta = layers.dot([norm_weights_b, b], axes=1)
# Declare the model as a computational graph. # step 2: compare
sent1 = embed(ids1) # Shape: (i, n) comp1 = layers.concatenate([a, beta])
sent2 = embed(ids2) # Shape: (j, n) comp2 = layers.concatenate([b, alpha])
v1 = layers.TimeDistributed(G)(comp1)
v2 = layers.TimeDistributed(G)(comp2)
if settings['gru_encode']: # step 3: aggregate
sent1 = encode(sent1) v1_sum = layers.Lambda(sum_word)(v1)
sent2 = encode(sent2) v2_sum = layers.Lambda(sum_word)(v2)
concat = layers.concatenate([v1_sum, v2_sum])
attention = attend(sent1, sent2) # Shape: (i, j) elif settings['entail_dir'] == 'left':
norm_weights_a = layers.Lambda(normalizer(1))(att_weights)
alpha = layers.dot([norm_weights_a, a], axes=1)
comp2 = layers.concatenate([b, alpha])
v2 = layers.TimeDistributed(G)(comp2)
v2_sum = layers.Lambda(sum_word)(v2)
concat = v2_sum
align1 = align(sent2, attention) else:
align2 = align(sent1, attention, transpose=True) norm_weights_b = layers.Lambda(normalizer(2))(att_weights)
beta = layers.dot([norm_weights_b, b], axes=1)
feats1 = compare(sent1, align1) comp1 = layers.concatenate([a, beta])
feats2 = compare(sent2, align2) v1 = layers.TimeDistributed(G)(comp1)
v1_sum = layers.Lambda(sum_word)(v1)
scores = entail(feats1, feats2) concat = v1_sum
# Now that we have the input/output, we can construct the Model object... H = create_feedforward(nr_hidden)
model = Model(input=[ids1, ids2], output=[scores]) out = H(concat)
out = layers.Dense(nr_class, activation='softmax')(out)
# ...Compile it...
model = Model([input1, input2], out)
model.compile( model.compile(
optimizer=Adam(lr=settings['lr']), optimizer=optimizers.Adam(lr=settings['lr']),
loss='categorical_crossentropy', loss='categorical_crossentropy',
metrics=['accuracy']) metrics=['accuracy'])
# ...And return it for training.
return model return model
class _StaticEmbedding(object): def create_embedding(vectors, max_length, projected_dim):
def __init__(self, vectors, max_length, nr_out, nr_tune=1000, dropout=0.0): return models.Sequential([
self.nr_out = nr_out layers.Embedding(
self.max_length = max_length vectors.shape[0],
self.embed = Embedding( vectors.shape[1],
vectors.shape[0], input_length=max_length,
vectors.shape[1], weights=[vectors],
input_length=max_length, trainable=False),
weights=[vectors],
name='embed', layers.TimeDistributed(
trainable=False) layers.Dense(projected_dim,
self.tune = Embedding( activation=None,
nr_tune, use_bias=False))
nr_out, ])
input_length=max_length,
weights=None,
name='tune',
trainable=True,
dropout=dropout)
self.mod_ids = Lambda(lambda sent: sent % (nr_tune-1)+1,
output_shape=(self.max_length,))
self.project = TimeDistributed( def create_feedforward(num_units=200, activation='relu', dropout_rate=0.2):
Dense( return models.Sequential([
nr_out, layers.Dense(num_units, activation=activation),
activation=None, layers.Dropout(dropout_rate),
bias=False, layers.Dense(num_units, activation=activation),
name='project')) layers.Dropout(dropout_rate)
])
def __call__(self, sentence):
def get_output_shape(shapes):
print(shapes)
return shapes[0]
mod_sent = self.mod_ids(sentence)
tuning = self.tune(mod_sent)
#tuning = merge([tuning, mod_sent],
# mode=lambda AB: AB[0] * (K.clip(K.cast(AB[1], 'float32'), 0, 1)),
# output_shape=(self.max_length, self.nr_out))
pretrained = self.project(self.embed(sentence))
vectors = merge([pretrained, tuning], mode='sum')
return vectors
class _BiRNNEncoding(object): def normalizer(axis):
def __init__(self, max_length, nr_out, dropout=0.0): def _normalize(att_weights):
self.model = Sequential() exp_weights = K.exp(att_weights)
self.model.add(Bidirectional(LSTM(nr_out, return_sequences=True, sum_weights = K.sum(exp_weights, axis=axis, keepdims=True)
dropout_W=dropout, dropout_U=dropout), return exp_weights/sum_weights
input_shape=(max_length, nr_out))) return _normalize
self.model.add(TimeDistributed(Dense(nr_out, activation='relu', init='he_normal')))
self.model.add(TimeDistributed(Dropout(0.2)))
def __call__(self, sentence): def sum_word(x):
return self.model(sentence) return K.sum(x, axis=1)
class _Attention(object):
def __init__(self, max_length, nr_hidden, dropout=0.0, L2=0.0, activation='relu'):
self.max_length = max_length
self.model = Sequential()
self.model.add(Dropout(dropout, input_shape=(nr_hidden,)))
self.model.add(
Dense(nr_hidden, name='attend1',
init='he_normal', W_regularizer=l2(L2),
input_shape=(nr_hidden,), activation='relu'))
self.model.add(Dropout(dropout))
self.model.add(Dense(nr_hidden, name='attend2',
init='he_normal', W_regularizer=l2(L2), activation='relu'))
self.model = TimeDistributed(self.model)
def __call__(self, sent1, sent2):
def _outer(AB):
att_ji = K.batch_dot(AB[1], K.permute_dimensions(AB[0], (0, 2, 1)))
return K.permute_dimensions(att_ji,(0, 2, 1))
return merge(
[self.model(sent1), self.model(sent2)],
mode=_outer,
output_shape=(self.max_length, self.max_length))
class _SoftAlignment(object):
def __init__(self, max_length, nr_hidden):
self.max_length = max_length
self.nr_hidden = nr_hidden
def __call__(self, sentence, attention, transpose=False):
def _normalize_attention(attmat):
att = attmat[0]
mat = attmat[1]
if transpose:
att = K.permute_dimensions(att,(0, 2, 1))
# 3d softmax
e = K.exp(att - K.max(att, axis=-1, keepdims=True))
s = K.sum(e, axis=-1, keepdims=True)
sm_att = e / s
return K.batch_dot(sm_att, mat)
return merge([attention, sentence], mode=_normalize_attention,
output_shape=(self.max_length, self.nr_hidden)) # Shape: (i, n)
class _Comparison(object):
def __init__(self, words, nr_hidden, L2=0.0, dropout=0.0):
self.words = words
self.model = Sequential()
self.model.add(Dropout(dropout, input_shape=(nr_hidden*2,)))
self.model.add(Dense(nr_hidden, name='compare1',
init='he_normal', W_regularizer=l2(L2)))
self.model.add(Activation('relu'))
self.model.add(Dropout(dropout))
self.model.add(Dense(nr_hidden, name='compare2',
W_regularizer=l2(L2), init='he_normal'))
self.model.add(Activation('relu'))
self.model = TimeDistributed(self.model)
def __call__(self, sent, align, **kwargs):
result = self.model(merge([sent, align], mode='concat')) # Shape: (i, n)
avged = GlobalAveragePooling1D()(result, mask=self.words)
maxed = GlobalMaxPooling1D()(result, mask=self.words)
merged = merge([avged, maxed])
result = BatchNormalization()(merged)
return result
class _Entailment(object):
def __init__(self, nr_hidden, nr_out, dropout=0.0, L2=0.0):
self.model = Sequential()
self.model.add(Dropout(dropout, input_shape=(nr_hidden*2,)))
self.model.add(Dense(nr_hidden, name='entail1',
init='he_normal', W_regularizer=l2(L2)))
self.model.add(Activation('relu'))
self.model.add(Dropout(dropout))
self.model.add(Dense(nr_hidden, name='entail2',
init='he_normal', W_regularizer=l2(L2)))
self.model.add(Activation('relu'))
self.model.add(Dense(nr_out, name='entail_out', activation='softmax',
W_regularizer=l2(L2), init='zero'))
def __call__(self, feats1, feats2):
features = merge([feats1, feats2], mode='concat')
return self.model(features)
class _GlobalSumPooling1D(Layer):
'''Global sum pooling operation for temporal data.
# Input shape
3D tensor with shape: `(samples, steps, features)`.
# Output shape
2D tensor with shape: `(samples, features)`.
'''
def __init__(self, **kwargs):
super(_GlobalSumPooling1D, self).__init__(**kwargs)
self.input_spec = [InputSpec(ndim=3)]
def get_output_shape_for(self, input_shape):
return (input_shape[0], input_shape[2])
def call(self, x, mask=None):
if mask is not None:
return K.sum(x * K.clip(mask, 0, 1), axis=1)
else:
return K.sum(x, axis=1)
def test_build_model(): def test_build_model():
vectors = numpy.ndarray((100, 8), dtype='float32') vectors = np.ndarray((100, 8), dtype='float32')
shape = (10, 16, 3) shape = (10, 16, 3)
settings = {'lr': 0.001, 'dropout': 0.2, 'gru_encode':True} settings = {'lr': 0.001, 'dropout': 0.2, 'gru_encode':True, 'entail_dir':'both'}
model = build_model(vectors, shape, settings) model = build_model(vectors, shape, settings)
def test_fit_model(): def test_fit_model():
def _generate_X(nr_example, length, nr_vector): def _generate_X(nr_example, length, nr_vector):
X1 = numpy.ndarray((nr_example, length), dtype='int32') X1 = np.ndarray((nr_example, length), dtype='int32')
X1 *= X1 < nr_vector X1 *= X1 < nr_vector
X1 *= 0 <= X1 X1 *= 0 <= X1
X2 = numpy.ndarray((nr_example, length), dtype='int32') X2 = np.ndarray((nr_example, length), dtype='int32')
X2 *= X2 < nr_vector X2 *= X2 < nr_vector
X2 *= 0 <= X2 X2 *= 0 <= X2
return [X1, X2] return [X1, X2]
def _generate_Y(nr_example, nr_class): def _generate_Y(nr_example, nr_class):
ys = numpy.zeros((nr_example, nr_class), dtype='int32') ys = np.zeros((nr_example, nr_class), dtype='int32')
for i in range(nr_example): for i in range(nr_example):
ys[i, i % nr_class] = 1 ys[i, i % nr_class] = 1
return ys return ys
vectors = numpy.ndarray((100, 8), dtype='float32') vectors = np.ndarray((100, 8), dtype='float32')
shape = (10, 16, 3) shape = (10, 16, 3)
settings = {'lr': 0.001, 'dropout': 0.2, 'gru_encode':True} settings = {'lr': 0.001, 'dropout': 0.2, 'gru_encode':True, 'entail_dir':'both'}
model = build_model(vectors, shape, settings) model = build_model(vectors, shape, settings)
train_X = _generate_X(20, shape[0], vectors.shape[0]) train_X = _generate_X(20, shape[0], vectors.shape[0])
@ -261,8 +139,7 @@ def test_fit_model():
dev_X = _generate_X(15, shape[0], vectors.shape[0]) dev_X = _generate_X(15, shape[0], vectors.shape[0])
dev_Y = _generate_Y(15, shape[2]) dev_Y = _generate_Y(15, shape[2])
model.fit(train_X, train_Y, validation_data=(dev_X, dev_Y), nb_epoch=5, model.fit(train_X, train_Y, validation_data=(dev_X, dev_Y), epochs=5, batch_size=4)
batch_size=4)
__all__ = [build_model] __all__ = [build_model]

View File

@ -1,8 +1,5 @@
import numpy as np
from keras.models import model_from_json from keras.models import model_from_json
import numpy
import numpy.random
import json
from spacy.tokens.span import Span
try: try:
import cPickle as pickle import cPickle as pickle
@ -11,16 +8,23 @@ except ImportError:
class KerasSimilarityShim(object): class KerasSimilarityShim(object):
entailment_types = ["entailment", "contradiction", "neutral"]
@classmethod @classmethod
def load(cls, path, nlp, get_features=None, max_length=100): def load(cls, path, nlp, max_length=100, get_features=None):
if get_features is None: if get_features is None:
get_features = get_word_ids get_features = get_word_ids
with (path / 'config.json').open() as file_: with (path / 'config.json').open() as file_:
model = model_from_json(file_.read()) model = model_from_json(file_.read())
with (path / 'model').open('rb') as file_: with (path / 'model').open('rb') as file_:
weights = pickle.load(file_) weights = pickle.load(file_)
embeddings = get_embeddings(nlp.vocab) embeddings = get_embeddings(nlp.vocab)
model.set_weights([embeddings] + weights) weights.insert(1, embeddings)
model.set_weights(weights)
return cls(model, get_features=get_features, max_length=max_length) return cls(model, get_features=get_features, max_length=max_length)
def __init__(self, model, get_features=None, max_length=100): def __init__(self, model, get_features=None, max_length=100):
@ -32,58 +36,42 @@ class KerasSimilarityShim(object):
doc.user_hooks['similarity'] = self.predict doc.user_hooks['similarity'] = self.predict
doc.user_span_hooks['similarity'] = self.predict doc.user_span_hooks['similarity'] = self.predict
return doc
def predict(self, doc1, doc2): def predict(self, doc1, doc2):
x1 = self.get_features([doc1], max_length=self.max_length, tree_truncate=True) x1 = self.get_features([doc1], max_length=self.max_length)
x2 = self.get_features([doc2], max_length=self.max_length, tree_truncate=True) x2 = self.get_features([doc2], max_length=self.max_length)
scores = self.model.predict([x1, x2]) scores = self.model.predict([x1, x2])
return scores[0]
return self.entailment_types[scores.argmax()], scores.max()
def get_embeddings(vocab, nr_unk=100): def get_embeddings(vocab, nr_unk=100):
nr_vector = max(lex.rank for lex in vocab) + 1 # the extra +1 is for a zero vector representing sentence-final padding
vectors = numpy.zeros((nr_vector+nr_unk+2, vocab.vectors_length), dtype='float32') num_vectors = max(lex.rank for lex in vocab) + 2
# create random vectors for OOV tokens
oov = np.random.normal(size=(nr_unk, vocab.vectors_length))
oov = oov / oov.sum(axis=1, keepdims=True)
vectors = np.zeros((num_vectors + nr_unk, vocab.vectors_length), dtype='float32')
vectors[1:(nr_unk + 1), ] = oov
for lex in vocab: for lex in vocab:
if lex.has_vector: if lex.has_vector and lex.vector_norm > 0:
vectors[lex.rank+1] = lex.vector / lex.vector_norm vectors[nr_unk + lex.rank + 1] = lex.vector / lex.vector_norm
return vectors return vectors
def get_word_ids(docs, rnn_encode=False, tree_truncate=False, max_length=100, nr_unk=100): def get_word_ids(docs, max_length=100, nr_unk=100):
Xs = numpy.zeros((len(docs), max_length), dtype='int32') Xs = np.zeros((len(docs), max_length), dtype='int32')
for i, doc in enumerate(docs): for i, doc in enumerate(docs):
if tree_truncate: for j, token in enumerate(doc):
if isinstance(doc, Span): if j == max_length:
queue = [doc.root]
else:
queue = [sent.root for sent in doc.sents]
else:
queue = list(doc)
words = []
while len(words) <= max_length and queue:
word = queue.pop(0)
if rnn_encode or (not word.is_punct and not word.is_space):
words.append(word)
if tree_truncate:
queue.extend(list(word.lefts))
queue.extend(list(word.rights))
words.sort()
for j, token in enumerate(words):
if token.has_vector:
Xs[i, j] = token.rank+1
else:
Xs[i, j] = (token.shape % (nr_unk-1))+2
j += 1
if j >= max_length:
break break
else: if token.has_vector:
Xs[i, len(words)] = 1 Xs[i, j] = token.rank + nr_unk + 1
else:
Xs[i, j] = token.rank % nr_unk + 1
return Xs return Xs
def create_similarity_pipeline(nlp, max_length=100):
return [
nlp.tagger,
nlp.entity,
nlp.parser,
KerasSimilarityShim.load(nlp.path / 'similarity', nlp, max_length)
]

View File

@ -0,0 +1,955 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Natural language inference using spaCy and Keras"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook details an implementation of the natural language inference model presented in [(Parikh et al, 2016)](https://arxiv.org/abs/1606.01933). The model is notable for the small number of paramaters *and hyperparameters* it specifices, while still yielding good performance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Constructing the dataset"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import spacy\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We only need the GloVe vectors from spaCy, not a full NLP pipeline."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"nlp = spacy.load('en_vectors_web_lg')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Function to load the SNLI dataset. The categories are converted to one-shot representation. The function comes from an example in spaCy."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/jds/tensorflow-gpu/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
" from ._conv import register_converters as _register_converters\n",
"Using TensorFlow backend.\n"
]
}
],
"source": [
"import ujson as json\n",
"from keras.utils import to_categorical\n",
"\n",
"LABELS = {'entailment': 0, 'contradiction': 1, 'neutral': 2}\n",
"def read_snli(path):\n",
" texts1 = []\n",
" texts2 = []\n",
" labels = []\n",
" with open(path, 'r') as file_:\n",
" for line in file_:\n",
" eg = json.loads(line)\n",
" label = eg['gold_label']\n",
" if label == '-': # per Parikh, ignore - SNLI entries\n",
" continue\n",
" texts1.append(eg['sentence1'])\n",
" texts2.append(eg['sentence2'])\n",
" labels.append(LABELS[label])\n",
" return texts1, texts2, to_categorical(np.asarray(labels, dtype='int32'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because Keras can do the train/test split for us, we'll load *all* SNLI triples from one file."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"texts,hypotheses,labels = read_snli('snli/snli_1.0_train.jsonl')"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"def create_dataset(nlp, texts, hypotheses, num_oov, max_length, norm_vectors = True):\n",
" sents = texts + hypotheses\n",
" \n",
" # the extra +1 is for a zero vector represting NULL for padding\n",
" num_vectors = max(lex.rank for lex in nlp.vocab) + 2 \n",
" \n",
" # create random vectors for OOV tokens\n",
" oov = np.random.normal(size=(num_oov, nlp.vocab.vectors_length))\n",
" oov = oov / oov.sum(axis=1, keepdims=True)\n",
" \n",
" vectors = np.zeros((num_vectors + num_oov, nlp.vocab.vectors_length), dtype='float32')\n",
" vectors[num_vectors:, ] = oov\n",
" for lex in nlp.vocab:\n",
" if lex.has_vector and lex.vector_norm > 0:\n",
" vectors[lex.rank + 1] = lex.vector / lex.vector_norm if norm_vectors == True else lex.vector\n",
" \n",
" sents_as_ids = []\n",
" for sent in sents:\n",
" doc = nlp(sent)\n",
" word_ids = []\n",
" \n",
" for i, token in enumerate(doc):\n",
" # skip odd spaces from tokenizer\n",
" if token.has_vector and token.vector_norm == 0:\n",
" continue\n",
" \n",
" if i > max_length:\n",
" break\n",
" \n",
" if token.has_vector:\n",
" word_ids.append(token.rank + 1)\n",
" else:\n",
" # if we don't have a vector, pick an OOV entry\n",
" word_ids.append(token.rank % num_oov + num_vectors) \n",
" \n",
" # there must be a simpler way of generating padded arrays from lists...\n",
" word_id_vec = np.zeros((max_length), dtype='int')\n",
" clipped_len = min(max_length, len(word_ids))\n",
" word_id_vec[:clipped_len] = word_ids[:clipped_len]\n",
" sents_as_ids.append(word_id_vec)\n",
" \n",
" \n",
" return vectors, np.array(sents_as_ids[:len(texts)]), np.array(sents_as_ids[len(texts):])"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"sem_vectors, text_vectors, hypothesis_vectors = create_dataset(nlp, texts, hypotheses, 100, 50, True)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"texts_test,hypotheses_test,labels_test = read_snli('snli/snli_1.0_test.jsonl')"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"_, text_vectors_test, hypothesis_vectors_test = create_dataset(nlp, texts_test, hypotheses_test, 100, 50, True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use spaCy to tokenize the sentences and return, when available, a semantic vector for each token. \n",
"\n",
"OOV terms (tokens for which no semantic vector is available) are assigned to one of a set of randomly-generated OOV vectors, per (Parikh et al, 2016).\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that we will clip sentences to 50 words maximum."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"from keras import layers, Model, models\n",
"from keras import backend as K"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Building the model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The embedding layer copies the 300-dimensional GloVe vectors into GPU memory. Per (Parikh et al, 2016), the vectors, which are not adapted during training, are projected down to lower-dimensional vectors using a trained projection matrix."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"def create_embedding(vectors, max_length, projected_dim):\n",
" return models.Sequential([\n",
" layers.Embedding(\n",
" vectors.shape[0],\n",
" vectors.shape[1],\n",
" input_length=max_length,\n",
" weights=[vectors],\n",
" trainable=False),\n",
" \n",
" layers.TimeDistributed(\n",
" layers.Dense(projected_dim,\n",
" activation=None,\n",
" use_bias=False))\n",
" ])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The Parikh model makes use of three feedforward blocks that construct nonlinear combinations of their input. Each block contains two ReLU layers and two dropout layers."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"def create_feedforward(num_units=200, activation='relu', dropout_rate=0.2):\n",
" return models.Sequential([\n",
" layers.Dense(num_units, activation=activation),\n",
" layers.Dropout(dropout_rate),\n",
" layers.Dense(num_units, activation=activation),\n",
" layers.Dropout(dropout_rate)\n",
" ])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The basic idea of the (Parikh et al, 2016) model is to:\n",
"\n",
"1. *Align*: Construct an alignment of subphrases in the text and hypothesis using an attention-like mechanism, called \"decompositional\" because the layer is applied to each of the two sentences individually rather than to their product. The dot product of the nonlinear transformations of the inputs is then normalized vertically and horizontally to yield a pair of \"soft\" alignment structures, from text->hypothesis and hypothesis->text. Concretely, for each word in one sentence, a multinomial distribution is computed over the words of the other sentence, by learning a multinomial logistic with softmax target.\n",
"2. *Compare*: Each word is now compared to its aligned phrase using a function modeled as a two-layer feedforward ReLU network. The output is a high-dimensional representation of the strength of association between word and aligned phrase.\n",
"3. *Aggregate*: The comparison vectors are summed, separately, for the text and the hypothesis. The result is two vectors: one that describes the degree of association of the text to the hypothesis, and the second, of the hypothesis to the text.\n",
"4. Finally, these two vectors are processed by a dense layer followed by a softmax classifier, as usual.\n",
"\n",
"Note that because in entailment the truth conditions of the consequent must be a subset of those of the antecedent, it is not obvious that we need both vectors in step (3). Entailment is not symmetric. It may be enough to just use the hypothesis->text vector. We will explore this possibility later."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We need a couple of little functions for Lambda layers to normalize and aggregate weights:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"def normalizer(axis):\n",
" def _normalize(att_weights):\n",
" exp_weights = K.exp(att_weights)\n",
" sum_weights = K.sum(exp_weights, axis=axis, keepdims=True)\n",
" return exp_weights/sum_weights\n",
" return _normalize\n",
"\n",
"def sum_word(x):\n",
" return K.sum(x, axis=1)\n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"def build_model(vectors, max_length, num_hidden, num_classes, projected_dim, entail_dir='both'):\n",
" input1 = layers.Input(shape=(max_length,), dtype='int32', name='words1')\n",
" input2 = layers.Input(shape=(max_length,), dtype='int32', name='words2')\n",
" \n",
" # embeddings (projected)\n",
" embed = create_embedding(vectors, max_length, projected_dim)\n",
" \n",
" a = embed(input1)\n",
" b = embed(input2)\n",
" \n",
" # step 1: attend\n",
" F = create_feedforward(num_hidden)\n",
" att_weights = layers.dot([F(a), F(b)], axes=-1)\n",
" \n",
" G = create_feedforward(num_hidden)\n",
" \n",
" if entail_dir == 'both':\n",
" norm_weights_a = layers.Lambda(normalizer(1))(att_weights)\n",
" norm_weights_b = layers.Lambda(normalizer(2))(att_weights)\n",
" alpha = layers.dot([norm_weights_a, a], axes=1)\n",
" beta = layers.dot([norm_weights_b, b], axes=1)\n",
"\n",
" # step 2: compare\n",
" comp1 = layers.concatenate([a, beta])\n",
" comp2 = layers.concatenate([b, alpha])\n",
" v1 = layers.TimeDistributed(G)(comp1)\n",
" v2 = layers.TimeDistributed(G)(comp2)\n",
"\n",
" # step 3: aggregate\n",
" v1_sum = layers.Lambda(sum_word)(v1)\n",
" v2_sum = layers.Lambda(sum_word)(v2)\n",
" concat = layers.concatenate([v1_sum, v2_sum])\n",
" elif entail_dir == 'left':\n",
" norm_weights_a = layers.Lambda(normalizer(1))(att_weights)\n",
" alpha = layers.dot([norm_weights_a, a], axes=1)\n",
" comp2 = layers.concatenate([b, alpha])\n",
" v2 = layers.TimeDistributed(G)(comp2)\n",
" v2_sum = layers.Lambda(sum_word)(v2)\n",
" concat = v2_sum\n",
" else:\n",
" norm_weights_b = layers.Lambda(normalizer(2))(att_weights)\n",
" beta = layers.dot([norm_weights_b, b], axes=1)\n",
" comp1 = layers.concatenate([a, beta])\n",
" v1 = layers.TimeDistributed(G)(comp1)\n",
" v1_sum = layers.Lambda(sum_word)(v1)\n",
" concat = v1_sum\n",
" \n",
" H = create_feedforward(num_hidden)\n",
" out = H(concat)\n",
" out = layers.Dense(num_classes, activation='softmax')(out)\n",
" \n",
" model = Model([input1, input2], out)\n",
" \n",
" model.compile(optimizer='adam',\n",
" loss='categorical_crossentropy',\n",
" metrics=['accuracy'])\n",
" return model\n",
" \n",
" \n",
" "
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"__________________________________________________________________________________________________\n",
"Layer (type) Output Shape Param # Connected to \n",
"==================================================================================================\n",
"words1 (InputLayer) (None, 50) 0 \n",
"__________________________________________________________________________________________________\n",
"words2 (InputLayer) (None, 50) 0 \n",
"__________________________________________________________________________________________________\n",
"sequential_1 (Sequential) (None, 50, 200) 321381600 words1[0][0] \n",
" words2[0][0] \n",
"__________________________________________________________________________________________________\n",
"sequential_2 (Sequential) (None, 50, 200) 80400 sequential_1[1][0] \n",
" sequential_1[2][0] \n",
"__________________________________________________________________________________________________\n",
"dot_1 (Dot) (None, 50, 50) 0 sequential_2[1][0] \n",
" sequential_2[2][0] \n",
"__________________________________________________________________________________________________\n",
"lambda_2 (Lambda) (None, 50, 50) 0 dot_1[0][0] \n",
"__________________________________________________________________________________________________\n",
"lambda_1 (Lambda) (None, 50, 50) 0 dot_1[0][0] \n",
"__________________________________________________________________________________________________\n",
"dot_3 (Dot) (None, 50, 200) 0 lambda_2[0][0] \n",
" sequential_1[2][0] \n",
"__________________________________________________________________________________________________\n",
"dot_2 (Dot) (None, 50, 200) 0 lambda_1[0][0] \n",
" sequential_1[1][0] \n",
"__________________________________________________________________________________________________\n",
"concatenate_1 (Concatenate) (None, 50, 400) 0 sequential_1[1][0] \n",
" dot_3[0][0] \n",
"__________________________________________________________________________________________________\n",
"concatenate_2 (Concatenate) (None, 50, 400) 0 sequential_1[2][0] \n",
" dot_2[0][0] \n",
"__________________________________________________________________________________________________\n",
"time_distributed_2 (TimeDistrib (None, 50, 200) 120400 concatenate_1[0][0] \n",
"__________________________________________________________________________________________________\n",
"time_distributed_3 (TimeDistrib (None, 50, 200) 120400 concatenate_2[0][0] \n",
"__________________________________________________________________________________________________\n",
"lambda_3 (Lambda) (None, 200) 0 time_distributed_2[0][0] \n",
"__________________________________________________________________________________________________\n",
"lambda_4 (Lambda) (None, 200) 0 time_distributed_3[0][0] \n",
"__________________________________________________________________________________________________\n",
"concatenate_3 (Concatenate) (None, 400) 0 lambda_3[0][0] \n",
" lambda_4[0][0] \n",
"__________________________________________________________________________________________________\n",
"sequential_4 (Sequential) (None, 200) 120400 concatenate_3[0][0] \n",
"__________________________________________________________________________________________________\n",
"dense_8 (Dense) (None, 3) 603 sequential_4[1][0] \n",
"==================================================================================================\n",
"Total params: 321,703,403\n",
"Trainable params: 381,803\n",
"Non-trainable params: 321,321,600\n",
"__________________________________________________________________________________________________\n"
]
}
],
"source": [
"K.clear_session()\n",
"m = build_model(sem_vectors, 50, 200, 3, 200)\n",
"m.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The number of trainable parameters, ~381k, is the number given by Parikh et al, so we're on the right track."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Training the model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Parikh et al use tiny batches of 4, training for 50MM batches, which amounts to around 500 epochs. Here we'll use large batches to better use the GPU, and train for fewer epochs -- for purposes of this experiment."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train on 549367 samples, validate on 9824 samples\n",
"Epoch 1/50\n",
"549367/549367 [==============================] - 34s 62us/step - loss: 0.7599 - acc: 0.6617 - val_loss: 0.5396 - val_acc: 0.7861\n",
"Epoch 2/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.5611 - acc: 0.7763 - val_loss: 0.4892 - val_acc: 0.8085\n",
"Epoch 3/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.5212 - acc: 0.7948 - val_loss: 0.4574 - val_acc: 0.8261\n",
"Epoch 4/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4986 - acc: 0.8045 - val_loss: 0.4410 - val_acc: 0.8274\n",
"Epoch 5/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4819 - acc: 0.8114 - val_loss: 0.4224 - val_acc: 0.8383\n",
"Epoch 6/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4714 - acc: 0.8166 - val_loss: 0.4200 - val_acc: 0.8379\n",
"Epoch 7/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4633 - acc: 0.8203 - val_loss: 0.4098 - val_acc: 0.8457\n",
"Epoch 8/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4558 - acc: 0.8232 - val_loss: 0.4114 - val_acc: 0.8415\n",
"Epoch 9/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4508 - acc: 0.8250 - val_loss: 0.4062 - val_acc: 0.8477\n",
"Epoch 10/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4433 - acc: 0.8286 - val_loss: 0.3982 - val_acc: 0.8486\n",
"Epoch 11/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4388 - acc: 0.8307 - val_loss: 0.3953 - val_acc: 0.8497\n",
"Epoch 12/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4351 - acc: 0.8321 - val_loss: 0.3973 - val_acc: 0.8522\n",
"Epoch 13/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4309 - acc: 0.8342 - val_loss: 0.3939 - val_acc: 0.8539\n",
"Epoch 14/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4269 - acc: 0.8355 - val_loss: 0.3932 - val_acc: 0.8517\n",
"Epoch 15/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4247 - acc: 0.8369 - val_loss: 0.3938 - val_acc: 0.8515\n",
"Epoch 16/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4208 - acc: 0.8379 - val_loss: 0.3936 - val_acc: 0.8504\n",
"Epoch 17/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4194 - acc: 0.8390 - val_loss: 0.3885 - val_acc: 0.8560\n",
"Epoch 18/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4162 - acc: 0.8402 - val_loss: 0.3874 - val_acc: 0.8561\n",
"Epoch 19/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4140 - acc: 0.8409 - val_loss: 0.3889 - val_acc: 0.8545\n",
"Epoch 20/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4114 - acc: 0.8426 - val_loss: 0.3864 - val_acc: 0.8583\n",
"Epoch 21/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4092 - acc: 0.8430 - val_loss: 0.3870 - val_acc: 0.8561\n",
"Epoch 22/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4062 - acc: 0.8442 - val_loss: 0.3852 - val_acc: 0.8577\n",
"Epoch 23/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4050 - acc: 0.8450 - val_loss: 0.3850 - val_acc: 0.8578\n",
"Epoch 24/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4035 - acc: 0.8455 - val_loss: 0.3825 - val_acc: 0.8555\n",
"Epoch 25/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4018 - acc: 0.8460 - val_loss: 0.3837 - val_acc: 0.8573\n",
"Epoch 26/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3989 - acc: 0.8476 - val_loss: 0.3843 - val_acc: 0.8599\n",
"Epoch 27/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3979 - acc: 0.8481 - val_loss: 0.3841 - val_acc: 0.8589\n",
"Epoch 28/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3967 - acc: 0.8484 - val_loss: 0.3811 - val_acc: 0.8575\n",
"Epoch 29/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3956 - acc: 0.8492 - val_loss: 0.3829 - val_acc: 0.8589\n",
"Epoch 30/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3938 - acc: 0.8499 - val_loss: 0.3859 - val_acc: 0.8562\n",
"Epoch 31/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3925 - acc: 0.8500 - val_loss: 0.3798 - val_acc: 0.8587\n",
"Epoch 32/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3906 - acc: 0.8509 - val_loss: 0.3834 - val_acc: 0.8569\n",
"Epoch 33/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3893 - acc: 0.8511 - val_loss: 0.3806 - val_acc: 0.8588\n",
"Epoch 34/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3885 - acc: 0.8515 - val_loss: 0.3828 - val_acc: 0.8603\n",
"Epoch 35/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3879 - acc: 0.8520 - val_loss: 0.3800 - val_acc: 0.8594\n",
"Epoch 36/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3860 - acc: 0.8530 - val_loss: 0.3796 - val_acc: 0.8577\n",
"Epoch 37/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3856 - acc: 0.8532 - val_loss: 0.3857 - val_acc: 0.8591\n",
"Epoch 38/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3838 - acc: 0.8535 - val_loss: 0.3835 - val_acc: 0.8603\n",
"Epoch 39/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3830 - acc: 0.8543 - val_loss: 0.3830 - val_acc: 0.8599\n",
"Epoch 40/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3818 - acc: 0.8548 - val_loss: 0.3832 - val_acc: 0.8559\n",
"Epoch 41/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3806 - acc: 0.8551 - val_loss: 0.3845 - val_acc: 0.8553\n",
"Epoch 42/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3803 - acc: 0.8550 - val_loss: 0.3789 - val_acc: 0.8617\n",
"Epoch 43/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3791 - acc: 0.8556 - val_loss: 0.3835 - val_acc: 0.8580\n",
"Epoch 44/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3778 - acc: 0.8565 - val_loss: 0.3799 - val_acc: 0.8580\n",
"Epoch 45/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3766 - acc: 0.8571 - val_loss: 0.3790 - val_acc: 0.8625\n",
"Epoch 46/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3770 - acc: 0.8569 - val_loss: 0.3820 - val_acc: 0.8590\n",
"Epoch 47/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3761 - acc: 0.8573 - val_loss: 0.3831 - val_acc: 0.8581\n",
"Epoch 48/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3739 - acc: 0.8579 - val_loss: 0.3828 - val_acc: 0.8599\n",
"Epoch 49/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3738 - acc: 0.8577 - val_loss: 0.3785 - val_acc: 0.8590\n",
"Epoch 50/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3726 - acc: 0.8580 - val_loss: 0.3820 - val_acc: 0.8585\n"
]
},
{
"data": {
"text/plain": [
"<keras.callbacks.History at 0x7f5c9f49c438>"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=50,validation_data=([text_vectors_test, hypothesis_vectors_test], labels_test))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The result is broadly in the region reported by Parikh et al: ~86 vs 86.3%. The small difference might be accounted by differences in `max_length` (here set at 50), in the training regime, and that here we use Keras' built-in validation splitting rather than the SNLI test set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Experiment: the asymmetric model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It was suggested earlier that, based on the semantics of entailment, the vector representing the strength of association between the hypothesis to the text is all that is needed for classifying the entailment.\n",
"\n",
"The following model removes consideration of the complementary vector (text to hypothesis) from the computation. This will decrease the paramater count slightly, because the final dense layers will be smaller, and speed up the forward pass when predicting, because fewer calculations will be needed."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"__________________________________________________________________________________________________\n",
"Layer (type) Output Shape Param # Connected to \n",
"==================================================================================================\n",
"words2 (InputLayer) (None, 50) 0 \n",
"__________________________________________________________________________________________________\n",
"words1 (InputLayer) (None, 50) 0 \n",
"__________________________________________________________________________________________________\n",
"sequential_5 (Sequential) (None, 50, 200) 321381600 words1[0][0] \n",
" words2[0][0] \n",
"__________________________________________________________________________________________________\n",
"sequential_6 (Sequential) (None, 50, 200) 80400 sequential_5[1][0] \n",
" sequential_5[2][0] \n",
"__________________________________________________________________________________________________\n",
"dot_4 (Dot) (None, 50, 50) 0 sequential_6[1][0] \n",
" sequential_6[2][0] \n",
"__________________________________________________________________________________________________\n",
"lambda_5 (Lambda) (None, 50, 50) 0 dot_4[0][0] \n",
"__________________________________________________________________________________________________\n",
"dot_5 (Dot) (None, 50, 200) 0 lambda_5[0][0] \n",
" sequential_5[1][0] \n",
"__________________________________________________________________________________________________\n",
"concatenate_4 (Concatenate) (None, 50, 400) 0 sequential_5[2][0] \n",
" dot_5[0][0] \n",
"__________________________________________________________________________________________________\n",
"time_distributed_5 (TimeDistrib (None, 50, 200) 120400 concatenate_4[0][0] \n",
"__________________________________________________________________________________________________\n",
"lambda_6 (Lambda) (None, 200) 0 time_distributed_5[0][0] \n",
"__________________________________________________________________________________________________\n",
"sequential_8 (Sequential) (None, 200) 80400 lambda_6[0][0] \n",
"__________________________________________________________________________________________________\n",
"dense_16 (Dense) (None, 3) 603 sequential_8[1][0] \n",
"==================================================================================================\n",
"Total params: 321,663,403\n",
"Trainable params: 341,803\n",
"Non-trainable params: 321,321,600\n",
"__________________________________________________________________________________________________\n"
]
}
],
"source": [
"m1 = build_model(sem_vectors, 50, 200, 3, 200, 'left')\n",
"m1.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The parameter count has indeed decreased by 40,000, corresponding to the 200x200 smaller H function."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train on 549367 samples, validate on 9824 samples\n",
"Epoch 1/50\n",
"549367/549367 [==============================] - 25s 46us/step - loss: 0.7331 - acc: 0.6770 - val_loss: 0.5257 - val_acc: 0.7936\n",
"Epoch 2/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.5518 - acc: 0.7799 - val_loss: 0.4717 - val_acc: 0.8159\n",
"Epoch 3/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.5147 - acc: 0.7967 - val_loss: 0.4449 - val_acc: 0.8278\n",
"Epoch 4/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4948 - acc: 0.8060 - val_loss: 0.4326 - val_acc: 0.8344\n",
"Epoch 5/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4814 - acc: 0.8122 - val_loss: 0.4247 - val_acc: 0.8359\n",
"Epoch 6/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4712 - acc: 0.8162 - val_loss: 0.4143 - val_acc: 0.8430\n",
"Epoch 7/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4635 - acc: 0.8205 - val_loss: 0.4172 - val_acc: 0.8401\n",
"Epoch 8/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4570 - acc: 0.8223 - val_loss: 0.4106 - val_acc: 0.8422\n",
"Epoch 9/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4505 - acc: 0.8259 - val_loss: 0.4043 - val_acc: 0.8451\n",
"Epoch 10/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4459 - acc: 0.8280 - val_loss: 0.4050 - val_acc: 0.8467\n",
"Epoch 11/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4405 - acc: 0.8300 - val_loss: 0.3975 - val_acc: 0.8481\n",
"Epoch 12/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4360 - acc: 0.8324 - val_loss: 0.4026 - val_acc: 0.8496\n",
"Epoch 13/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4327 - acc: 0.8334 - val_loss: 0.4024 - val_acc: 0.8471\n",
"Epoch 14/50\n",
"549367/549367 [==============================] - 24s 45us/step - loss: 0.4293 - acc: 0.8350 - val_loss: 0.3955 - val_acc: 0.8496\n",
"Epoch 15/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4263 - acc: 0.8369 - val_loss: 0.3980 - val_acc: 0.8490\n",
"Epoch 16/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4236 - acc: 0.8377 - val_loss: 0.3958 - val_acc: 0.8496\n",
"Epoch 17/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4213 - acc: 0.8384 - val_loss: 0.3954 - val_acc: 0.8496\n",
"Epoch 18/50\n",
"549367/549367 [==============================] - 24s 45us/step - loss: 0.4187 - acc: 0.8394 - val_loss: 0.3929 - val_acc: 0.8514\n",
"Epoch 19/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4157 - acc: 0.8409 - val_loss: 0.3939 - val_acc: 0.8507\n",
"Epoch 20/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4135 - acc: 0.8417 - val_loss: 0.3953 - val_acc: 0.8522\n",
"Epoch 21/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4122 - acc: 0.8424 - val_loss: 0.3974 - val_acc: 0.8506\n",
"Epoch 22/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4099 - acc: 0.8435 - val_loss: 0.3918 - val_acc: 0.8522\n",
"Epoch 23/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4075 - acc: 0.8443 - val_loss: 0.3901 - val_acc: 0.8513\n",
"Epoch 24/50\n",
"549367/549367 [==============================] - 24s 44us/step - loss: 0.4067 - acc: 0.8447 - val_loss: 0.3885 - val_acc: 0.8543\n",
"Epoch 25/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4047 - acc: 0.8454 - val_loss: 0.3846 - val_acc: 0.8531\n",
"Epoch 26/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4031 - acc: 0.8461 - val_loss: 0.3864 - val_acc: 0.8562\n",
"Epoch 27/50\n",
"549367/549367 [==============================] - 24s 45us/step - loss: 0.4020 - acc: 0.8467 - val_loss: 0.3874 - val_acc: 0.8546\n",
"Epoch 28/50\n",
"549367/549367 [==============================] - 24s 45us/step - loss: 0.4001 - acc: 0.8473 - val_loss: 0.3848 - val_acc: 0.8534\n",
"Epoch 29/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3991 - acc: 0.8479 - val_loss: 0.3865 - val_acc: 0.8562\n",
"Epoch 30/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3976 - acc: 0.8484 - val_loss: 0.3833 - val_acc: 0.8574\n",
"Epoch 31/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3961 - acc: 0.8487 - val_loss: 0.3846 - val_acc: 0.8585\n",
"Epoch 32/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3942 - acc: 0.8498 - val_loss: 0.3805 - val_acc: 0.8573\n",
"Epoch 33/50\n",
"549367/549367 [==============================] - 24s 44us/step - loss: 0.3935 - acc: 0.8503 - val_loss: 0.3856 - val_acc: 0.8579\n",
"Epoch 34/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3923 - acc: 0.8507 - val_loss: 0.3829 - val_acc: 0.8560\n",
"Epoch 35/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3920 - acc: 0.8508 - val_loss: 0.3864 - val_acc: 0.8575\n",
"Epoch 36/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3907 - acc: 0.8516 - val_loss: 0.3873 - val_acc: 0.8563\n",
"Epoch 37/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3891 - acc: 0.8519 - val_loss: 0.3850 - val_acc: 0.8570\n",
"Epoch 38/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3872 - acc: 0.8522 - val_loss: 0.3815 - val_acc: 0.8591\n",
"Epoch 39/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3887 - acc: 0.8520 - val_loss: 0.3829 - val_acc: 0.8590\n",
"Epoch 40/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3868 - acc: 0.8531 - val_loss: 0.3807 - val_acc: 0.8600\n",
"Epoch 41/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3859 - acc: 0.8537 - val_loss: 0.3832 - val_acc: 0.8574\n",
"Epoch 42/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3849 - acc: 0.8537 - val_loss: 0.3850 - val_acc: 0.8576\n",
"Epoch 43/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3834 - acc: 0.8541 - val_loss: 0.3825 - val_acc: 0.8563\n",
"Epoch 44/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3829 - acc: 0.8548 - val_loss: 0.3844 - val_acc: 0.8540\n",
"Epoch 45/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3816 - acc: 0.8552 - val_loss: 0.3841 - val_acc: 0.8559\n",
"Epoch 46/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3816 - acc: 0.8549 - val_loss: 0.3880 - val_acc: 0.8567\n",
"Epoch 47/50\n",
"549367/549367 [==============================] - 24s 45us/step - loss: 0.3799 - acc: 0.8559 - val_loss: 0.3767 - val_acc: 0.8635\n",
"Epoch 48/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3800 - acc: 0.8560 - val_loss: 0.3786 - val_acc: 0.8563\n",
"Epoch 49/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3781 - acc: 0.8563 - val_loss: 0.3812 - val_acc: 0.8596\n",
"Epoch 50/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3788 - acc: 0.8560 - val_loss: 0.3782 - val_acc: 0.8601\n"
]
},
{
"data": {
"text/plain": [
"<keras.callbacks.History at 0x7f5ca1bf3e48>"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m1.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=50,validation_data=([text_vectors_test, hypothesis_vectors_test], labels_test))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This model performs the same as the slightly more complex model that evaluates alignments in both directions. Note also that processing time is improved, from 64 down to 48 microseconds per step. \n",
"\n",
"Let's now look at an asymmetric model that evaluates text to hypothesis comparisons. The prediction is that such a model will correctly classify a decent proportion of the exemplars, but not as accurately as the previous two.\n",
"\n",
"We'll just use 10 epochs for expediency."
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"__________________________________________________________________________________________________\n",
"Layer (type) Output Shape Param # Connected to \n",
"==================================================================================================\n",
"words1 (InputLayer) (None, 50) 0 \n",
"__________________________________________________________________________________________________\n",
"words2 (InputLayer) (None, 50) 0 \n",
"__________________________________________________________________________________________________\n",
"sequential_13 (Sequential) (None, 50, 200) 321381600 words1[0][0] \n",
" words2[0][0] \n",
"__________________________________________________________________________________________________\n",
"sequential_14 (Sequential) (None, 50, 200) 80400 sequential_13[1][0] \n",
" sequential_13[2][0] \n",
"__________________________________________________________________________________________________\n",
"dot_8 (Dot) (None, 50, 50) 0 sequential_14[1][0] \n",
" sequential_14[2][0] \n",
"__________________________________________________________________________________________________\n",
"lambda_9 (Lambda) (None, 50, 50) 0 dot_8[0][0] \n",
"__________________________________________________________________________________________________\n",
"dot_9 (Dot) (None, 50, 200) 0 lambda_9[0][0] \n",
" sequential_13[2][0] \n",
"__________________________________________________________________________________________________\n",
"concatenate_6 (Concatenate) (None, 50, 400) 0 sequential_13[1][0] \n",
" dot_9[0][0] \n",
"__________________________________________________________________________________________________\n",
"time_distributed_9 (TimeDistrib (None, 50, 200) 120400 concatenate_6[0][0] \n",
"__________________________________________________________________________________________________\n",
"lambda_10 (Lambda) (None, 200) 0 time_distributed_9[0][0] \n",
"__________________________________________________________________________________________________\n",
"sequential_16 (Sequential) (None, 200) 80400 lambda_10[0][0] \n",
"__________________________________________________________________________________________________\n",
"dense_32 (Dense) (None, 3) 603 sequential_16[1][0] \n",
"==================================================================================================\n",
"Total params: 321,663,403\n",
"Trainable params: 341,803\n",
"Non-trainable params: 321,321,600\n",
"__________________________________________________________________________________________________\n"
]
}
],
"source": [
"m2 = build_model(sem_vectors, 50, 200, 3, 200, 'right')\n",
"m2.summary()"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train on 455226 samples, validate on 113807 samples\n",
"Epoch 1/10\n",
"455226/455226 [==============================] - 22s 49us/step - loss: 0.8920 - acc: 0.5771 - val_loss: 0.8001 - val_acc: 0.6435\n",
"Epoch 2/10\n",
"455226/455226 [==============================] - 22s 47us/step - loss: 0.7808 - acc: 0.6553 - val_loss: 0.7267 - val_acc: 0.6855\n",
"Epoch 3/10\n",
"455226/455226 [==============================] - 22s 47us/step - loss: 0.7329 - acc: 0.6825 - val_loss: 0.6966 - val_acc: 0.7006\n",
"Epoch 4/10\n",
"455226/455226 [==============================] - 22s 47us/step - loss: 0.7055 - acc: 0.6978 - val_loss: 0.6713 - val_acc: 0.7150\n",
"Epoch 5/10\n",
"455226/455226 [==============================] - 22s 47us/step - loss: 0.6862 - acc: 0.7081 - val_loss: 0.6533 - val_acc: 0.7253\n",
"Epoch 6/10\n",
"455226/455226 [==============================] - 21s 47us/step - loss: 0.6694 - acc: 0.7179 - val_loss: 0.6472 - val_acc: 0.7277\n",
"Epoch 7/10\n",
"455226/455226 [==============================] - 22s 47us/step - loss: 0.6555 - acc: 0.7252 - val_loss: 0.6338 - val_acc: 0.7347\n",
"Epoch 8/10\n",
"455226/455226 [==============================] - 22s 48us/step - loss: 0.6434 - acc: 0.7310 - val_loss: 0.6246 - val_acc: 0.7385\n",
"Epoch 9/10\n",
"455226/455226 [==============================] - 22s 47us/step - loss: 0.6325 - acc: 0.7367 - val_loss: 0.6164 - val_acc: 0.7424\n",
"Epoch 10/10\n",
"455226/455226 [==============================] - 22s 47us/step - loss: 0.6216 - acc: 0.7426 - val_loss: 0.6082 - val_acc: 0.7478\n"
]
},
{
"data": {
"text/plain": [
"<keras.callbacks.History at 0x7fa6850cf080>"
]
},
"execution_count": 97,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m2.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=10,validation_split=.2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Comparing this fit to the validation accuracy of the previous two models after 10 epochs, we observe that its accuracy is roughly 10% lower.\n",
"\n",
"It is reassuring that the neural modeling here reproduces what we know from the semantics of natural language!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}