Merge branch 'master' of https://github.com/explosion/spaCy

2025-10-24 20:51:30 +03:00 · 2018-10-02 19:44:43 +02:00 · 2018-10-02 19:44:43 +02:00 · 9e4079ddb2
commit 9e4079ddb2
parent 40f228c2f2 7806deceb4
13 changed files with 1326 additions and 394 deletions
--- a/examples/keras_parikh_entailment/README.md
+++ b/examples/keras_parikh_entailment/README.md
@ -2,11 +2,7 @@

 # A decomposable attention model for Natural Language Inference
 **by Matthew Honnibal, [@honnibal](https://github.com/honnibal)**
-
-> ⚠️ **IMPORTANT NOTE:** This example is currently only compatible with spaCy
-> v1.x. We're working on porting the example over to Keras v2.x and spaCy v2.x.
-> See [#1445](https://github.com/explosion/spaCy/issues/1445) for details –
-> contributions welcome!
+**Updated for spaCy 2.0+ and Keras 2.2.2+ by John Stewart, [@free-variation](https://github.com/free-variation)**

 This directory contains an implementation of the entailment prediction model described
 by [Parikh et al. (2016)](https://arxiv.org/pdf/1606.01933.pdf). The model is notable
@ -21,19 +17,25 @@ hook is installed to customise the `.similarity()` method of spaCy's `Doc`
 and `Span` objects:

 ```python
-def demo(model_dir):
-    nlp = spacy.load('en', path=model_dir,
-            create_pipeline=create_similarity_pipeline)
-    doc1 = nlp(u'Worst fries ever! Greasy and horrible...')
-    doc2 = nlp(u'The milkshakes are good. The fries are bad.')
-    print(doc1.similarity(doc2))
-    sent1a, sent1b = doc1.sents
-    print(sent1a.similarity(sent1b))
-    print(sent1a.similarity(doc2))
-    print(sent1b.similarity(doc2))
+def demo(shape):
+	nlp = spacy.load('en_vectors_web_lg')
+    nlp.add_pipe(KerasSimilarityShim.load(nlp.path / 'similarity', nlp, shape[0]))
+
+    doc1 = nlp(u'The king of France is bald.')
+    doc2 = nlp(u'France has no king.')
+
+    print("Sentence 1:", doc1)
+    print("Sentence 2:", doc2)
+
+    entailment_type, confidence = doc1.similarity(doc2)
+    print("Entailment type:", entailment_type, "(Confidence:", confidence, ")")
 ```

+Which gives the output `Entailment type: contradiction (Confidence: 0.60604566)`, showing that
+the system has definite opinions about Betrand Russell's [famous conundrum](https://users.drew.edu/jlenz/br-on-denoting.html)!
+
 I'm working on a blog post to explain Parikh et al.'s model in more detail.
+A [notebook](https://github.com/free-variation/spaCy/blob/master/examples/notebooks/Decompositional%20Attention.ipynb) is available that briefly explains this implementation.
 I think it is a very interesting example of the attention mechanism, which
 I didn't understand very well before working through this paper. There are
 lots of ways to extend the model.
@ -43,7 +45,7 @@ lots of ways to extend the model.
 | File | Description |
 | --- | --- |
 | `__main__.py` | The script that will be executed. Defines the CLI, the data reading, etc — all the boring stuff. |
-| `spacy_hook.py` | Provides a class `SimilarityShim` that lets you use an arbitrary function to customize spaCy's `doc.similarity()` method. Instead of the default average-of-vectors algorithm, when you call `doc1.similarity(doc2)`, you'll get the result of `your_model(doc1, doc2)`. |
+| `spacy_hook.py` | Provides a class `KerasSimilarityShim` that lets you use an arbitrary function to customize spaCy's `doc.similarity()` method. Instead of the default average-of-vectors algorithm, when you call `doc1.similarity(doc2)`, you'll get the result of `your_model(doc1, doc2)`. |
 | `keras_decomposable_attention.py` | Defines the neural network model. |

 ## Setting up
@ -52,17 +54,13 @@ First, install [Keras](https://keras.io/), [spaCy](https://spacy.io) and the spa
 English models (about 1GB of data):

 ```bash
-pip install https://github.com/fchollet/keras/archive/1.2.2.zip
+pip install keras
 pip install spacy
-python -m spacy.en.download
+python -m spacy download en_vectors_web_lg
 ```

-⚠️ **Important:** In order for the example to run, you'll need to install Keras from
-the 1.2.2 release (and not via `pip install keras`). For more info on this, see
-[#727](https://github.com/explosion/spaCy/issues/727).
-
-You'll also want to get Keras working on your GPU. This will depend on your
-set up, so you're mostly on your own for this step. If you're using AWS, try the
+You'll also want to get Keras working on your GPU, and you will need a backend, such as TensorFlow or Theano.
+This will depend on your set up, so you're mostly on your own for this step. If you're using AWS, try the
 [NVidia AMI](https://aws.amazon.com/marketplace/pp/B00FYCDDTE). It made things pretty easy.

 Once you've installed the dependencies, you can run a small preliminary test of
@ -80,22 +78,35 @@ Finally, download the [Stanford Natural Language Inference corpus](http://nlp.st
 ## Running the example

 You can run the `keras_parikh_entailment/` directory as a script, which executes the file
-[`keras_parikh_entailment/__main__.py`](__main__.py). The first thing you'll want to do is train the model:
+[`keras_parikh_entailment/__main__.py`](__main__.py).  If you run the script without arguments
+the usage is shown.  Running it with `-h` explains the command line arguments.
+
+The first thing you'll want to do is train the model:

 ```bash
-python keras_parikh_entailment/ train <train_directory> <dev_directory>
+python keras_parikh_entailment/ train -t <path to SNLI train JSON> -s <path to SNLI dev JSON>
 ```

 Training takes about 300 epochs for full accuracy, and I haven't rerun the full
 experiment since refactoring things to publish this example — please let me
-know if I've broken something. You should get to at least 85% on the development data.
+know if I've broken something. You should get to at least 85% on the development data even after 10-15 epochs.

 The other two modes demonstrate run-time usage. I never like relying on the accuracy printed
 by `.fit()` methods. I never really feel confident until I've run a new process that loads
 the model and starts making predictions, without access to the gold labels. I've therefore
-included an `evaluate` mode. Finally, there's also a little demo, which mostly exists to show
+included an `evaluate` mode. 
+
+```bash
+python keras_parikh_entailment/ evaluate -s <path to SNLI train JSON>
+```
+
+Finally, there's also a little demo, which mostly exists to show
 you how run-time usage will eventually look.

+```bash
+python keras_parikh_entailment/ demo
+```
+
 ## Getting updates

 We should have the blog post explaining the model ready before the end of the week. To get
--- a/examples/keras_parikh_entailment/main.py
+++ b/examples/keras_parikh_entailment/main.py
@ -1,82 +1,104 @@
-from __future__ import division, unicode_literals, print_function
-import spacy
-
-import plac
-from pathlib import Path
+import numpy as np
 import ujson as json
-import numpy
-from keras.utils.np_utils import to_categorical
-
-from spacy_hook import get_embeddings, get_word_ids
-from spacy_hook import create_similarity_pipeline
+from keras.utils import to_categorical
+import plac
+import sys

 from keras_decomposable_attention import build_model
+from spacy_hook import get_embeddings, KerasSimilarityShim

 try:
    import cPickle as pickle
 except ImportError:
    import pickle

+import spacy
+
+# workaround for keras/tensorflow bug
+# see https://github.com/tensorflow/tensorflow/issues/3388
+import os
+import importlib
+from keras import backend as K
+
+def set_keras_backend(backend):
+    if K.backend() != backend:
+        os.environ['KERAS_BACKEND'] = backend
+        importlib.reload(K)
+        assert K.backend() == backend
+    if backend == "tensorflow":
+        K.get_session().close()
+        cfg = K.tf.ConfigProto()
+        cfg.gpu_options.allow_growth = True
+        K.set_session(K.tf.Session(config=cfg))
+        K.clear_session()
+
+set_keras_backend("tensorflow") 
+

 def train(train_loc, dev_loc, shape, settings):
    train_texts1, train_texts2, train_labels = read_snli(train_loc)
    dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc)

    print("Loading spaCy")
-    nlp = spacy.load('en')
+    nlp = spacy.load('en_vectors_web_lg')
    assert nlp.path is not None
+   
+    print("Processing texts...")
+    train_X = create_dataset(nlp, train_texts1, train_texts2, 100, shape[0])
+    dev_X = create_dataset(nlp, dev_texts1, dev_texts2, 100, shape[0])
+
    print("Compiling network")
    model = build_model(get_embeddings(nlp.vocab), shape, settings)
-    print("Processing texts...")
-    Xs = []
-    for texts in (train_texts1, train_texts2, dev_texts1, dev_texts2):
-        Xs.append(get_word_ids(list(nlp.pipe(texts, n_threads=20, batch_size=20000)),
-                         max_length=shape[0],
-                         rnn_encode=settings['gru_encode'],
-                         tree_truncate=settings['tree_truncate']))
-    train_X1, train_X2, dev_X1, dev_X2 = Xs
+
    print(settings)
    model.fit(
-        [train_X1, train_X2],
+        train_X,
        train_labels,
-        validation_data=([dev_X1, dev_X2], dev_labels),
-        nb_epoch=settings['nr_epoch'],
-        batch_size=settings['batch_size'])
+        validation_data = (dev_X, dev_labels),
+        epochs = settings['nr_epoch'],
+        batch_size = settings['batch_size'])
+    
    if not (nlp.path / 'similarity').exists():
        (nlp.path / 'similarity').mkdir()
    print("Saving to", nlp.path / 'similarity')
    weights = model.get_weights()
+    # remove the embedding matrix.  We can reconstruct it.
+    del weights[1]
    with (nlp.path / 'similarity' / 'model').open('wb') as file_:
-        pickle.dump(weights[1:], file_)
-    with (nlp.path / 'similarity' / 'config.json').open('wb') as file_:
+        pickle.dump(weights, file_)
+    with (nlp.path / 'similarity' / 'config.json').open('w') as file_:
        file_.write(model.to_json())


-def evaluate(dev_loc):
+def evaluate(dev_loc, shape):
    dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc)
-    nlp = spacy.load('en',
-            create_pipeline=create_similarity_pipeline)
+    nlp = spacy.load('en_vectors_web_lg')
+    nlp.add_pipe(KerasSimilarityShim.load(nlp.path / 'similarity', nlp, shape[0]))
+    
    total = 0.
    correct = 0.
    for text1, text2, label in zip(dev_texts1, dev_texts2, dev_labels):
        doc1 = nlp(text1)
        doc2 = nlp(text2)
-        sim = doc1.similarity(doc2)
-        if sim.argmax() == label.argmax():
+        sim, _ = doc1.similarity(doc2)
+        if sim == KerasSimilarityShim.entailment_types[label.argmax()]:
            correct += 1
        total += 1
    return correct, total


-def demo():
-    nlp = spacy.load('en',
-            create_pipeline=create_similarity_pipeline)
-    doc1 = nlp(u'What were the best crime fiction books in 2016?')
-    doc2 = nlp(
-        u'What should I read that was published last year? I like crime stories.')
-    print(doc1)
-    print(doc2)
-    print("Similarity", doc1.similarity(doc2))
+def demo(shape):
+    nlp = spacy.load('en_vectors_web_lg')
+    nlp.add_pipe(KerasSimilarityShim.load(nlp.path / 'similarity', nlp, shape[0]))
+
+    doc1 = nlp(u'The king of France is bald.')
+    doc2 = nlp(u'France has no king.')
+
+    print("Sentence 1:", doc1)
+    print("Sentence 2:", doc2)
+
+    entailment_type, confidence = doc1.similarity(doc2)
+    print("Entailment type:", entailment_type, "(Confidence:", confidence, ")")


 LABELS = {'entailment': 0, 'contradiction': 1, 'neutral': 2}
@ -84,56 +106,92 @@ def read_snli(path):
    texts1 = []
    texts2 = []
    labels = []
-    with path.open() as file_:
+    with open(path, 'r') as file_:
        for line in file_:
            eg = json.loads(line)
            label = eg['gold_label']
-            if label == '-':
+            if label == '-':  # per Parikh, ignore - SNLI entries
                continue
            texts1.append(eg['sentence1'])
            texts2.append(eg['sentence2'])
            labels.append(LABELS[label])
-    return texts1, texts2, to_categorical(numpy.asarray(labels, dtype='int32'))
+    return texts1, texts2, to_categorical(np.asarray(labels, dtype='int32'))
+
+def create_dataset(nlp, texts, hypotheses, num_unk, max_length):
+    sents = texts + hypotheses
+    
+    sents_as_ids = []
+    for sent in sents:
+        doc = nlp(sent)
+        word_ids = []
+        
+        for i, token in enumerate(doc):
+            # skip odd spaces from tokenizer
+            if token.has_vector and token.vector_norm == 0:
+                continue
+                
+            if i > max_length:
+                break
+                
+            if token.has_vector:
+                word_ids.append(token.rank + num_unk + 1)
+            else:
+                # if we don't have a vector, pick an OOV entry
+                word_ids.append(token.rank % num_unk + 1) 
+                
+        # there must be a simpler way of generating padded arrays from lists...
+        word_id_vec = np.zeros((max_length), dtype='int')
+        clipped_len = min(max_length, len(word_ids))
+        word_id_vec[:clipped_len] = word_ids[:clipped_len]
+        sents_as_ids.append(word_id_vec)
+        
+        
+    return [np.array(sents_as_ids[:len(texts)]), np.array(sents_as_ids[len(texts):])]


@plac.annotations(
    mode=("Mode to execute", "positional", None, str, ["train", "evaluate", "demo"]),
-    train_loc=("Path to training data", "positional", None, Path),
-    dev_loc=("Path to development data", "positional", None, Path),
+    train_loc=("Path to training data", "option", "t", str),
+    dev_loc=("Path to development or test data", "option", "s", str),
    max_length=("Length to truncate sentences", "option", "L", int),
    nr_hidden=("Number of hidden units", "option", "H", int),
    dropout=("Dropout level", "option", "d", float),
-    learn_rate=("Learning rate", "option", "e", float),
+    learn_rate=("Learning rate", "option", "r", float),
    batch_size=("Batch size for neural network training", "option", "b", int),
-    nr_epoch=("Number of training epochs", "option", "i", int),
-    tree_truncate=("Truncate sentences by tree distance", "flag", "T", bool),
-    gru_encode=("Encode sentences with bidirectional GRU", "flag", "E", bool),
+    nr_epoch=("Number of training epochs", "option", "e", int),
+    entail_dir=("Direction of entailment", "option", "D", str, ["both", "left", "right"])
 )
 def main(mode, train_loc, dev_loc,
-        tree_truncate=False,
-        gru_encode=False,
-        max_length=100,
-        nr_hidden=100,
-        dropout=0.2,
-        learn_rate=0.001,
-        batch_size=100,
-        nr_epoch=5):
+        max_length = 50,
+        nr_hidden = 200,
+        dropout = 0.2,
+        learn_rate = 0.001,
+        batch_size = 1024,
+        nr_epoch = 10,
+        entail_dir="both"):
+    
    shape = (max_length, nr_hidden, 3)
    settings = {
        'lr': learn_rate,
        'dropout': dropout,
        'batch_size': batch_size,
        'nr_epoch': nr_epoch,
-        'tree_truncate': tree_truncate,
-        'gru_encode': gru_encode
+        'entail_dir': entail_dir
    }
+
    if mode == 'train':
+        if train_loc == None or dev_loc == None:
+            print("Train mode requires paths to training and development data sets.")
+            sys.exit(1)
        train(train_loc, dev_loc, shape, settings)
    elif mode == 'evaluate':
-        correct, total = evaluate(dev_loc)
+        if  dev_loc == None:
+            print("Evaluate mode requires paths to test data set.")
+            sys.exit(1)
+        correct, total = evaluate(dev_loc, shape)
        print(correct, '/', total, correct / total)
    else:
-        demo()
+        demo(shape)

 if __name__ == '__main__':
    plac.call(main)
--- a/examples/keras_parikh_entailment/keras_decomposable_attention.py
+++ b/examples/keras_parikh_entailment/keras_decomposable_attention.py
@ -1,259 +1,137 @@
-# Semantic similarity with decomposable attention (using spaCy and Keras)
-# Practical state-of-the-art text similarity with spaCy and Keras
-import numpy
-
-from keras.layers import InputSpec, Layer, Input, Dense, merge
-from keras.layers import Lambda, Activation, Dropout, Embedding, TimeDistributed
-from keras.layers import Bidirectional, GRU, LSTM
-from keras.layers.noise import GaussianNoise
-from keras.layers.advanced_activations import ELU
-import keras.backend as K
-from keras.models import Sequential, Model, model_from_json
-from keras.regularizers import l2
-from keras.optimizers import Adam
-from keras.layers.normalization import BatchNormalization
-from keras.layers.pooling import GlobalAveragePooling1D, GlobalMaxPooling1D
-from keras.layers import Merge
+# Semantic entailment/similarity with decomposable attention (using spaCy and Keras)
+# Practical state-of-the-art textual entailment with spaCy and Keras

+import numpy as np
+from keras import layers, Model, models, optimizers
+from keras import backend as K

 def build_model(vectors, shape, settings):
-    '''Compile the model.'''
    max_length, nr_hidden, nr_class = shape
-    # Declare inputs.
-    ids1 = Input(shape=(max_length,), dtype='int32', name='words1')
-    ids2 = Input(shape=(max_length,), dtype='int32', name='words2')

-    # Construct operations, which we'll chain together.
-    embed = _StaticEmbedding(vectors, max_length, nr_hidden, dropout=0.2, nr_tune=5000)
-    if settings['gru_encode']:
-        encode = _BiRNNEncoding(max_length, nr_hidden, dropout=settings['dropout'])
-    attend = _Attention(max_length, nr_hidden, dropout=settings['dropout'])
-    align = _SoftAlignment(max_length, nr_hidden)
-    compare = _Comparison(max_length, nr_hidden, dropout=settings['dropout'])
-    entail = _Entailment(nr_hidden, nr_class, dropout=settings['dropout'])
+    input1 = layers.Input(shape=(max_length,), dtype='int32', name='words1')
+    input2 = layers.Input(shape=(max_length,), dtype='int32', name='words2')
    
-    # Declare the model as a computational graph.
-    sent1 = embed(ids1) # Shape: (i, n)
-    sent2 = embed(ids2) # Shape: (j, n)
+    # embeddings (projected)
+    embed = create_embedding(vectors, max_length, nr_hidden)
   
-    if settings['gru_encode']:
-        sent1 = encode(sent1)
-        sent2 = encode(sent2)
+    a = embed(input1)
+    b = embed(input2)
    
-    attention = attend(sent1, sent2)  # Shape: (i, j)
+    # step 1: attend
+    F = create_feedforward(nr_hidden)
+    att_weights = layers.dot([F(a), F(b)], axes=-1)
    
-    align1 = align(sent2, attention)
-    align2 = align(sent1, attention, transpose=True)
+    G = create_feedforward(nr_hidden)
    
-    feats1 = compare(sent1, align1)
-    feats2 = compare(sent2, align2)
+    if settings['entail_dir'] == 'both':
+        norm_weights_a = layers.Lambda(normalizer(1))(att_weights)
+        norm_weights_b = layers.Lambda(normalizer(2))(att_weights)
+        alpha = layers.dot([norm_weights_a, a], axes=1)
+        beta  = layers.dot([norm_weights_b, b], axes=1)

-    scores = entail(feats1, feats2)
+        # step 2: compare
+        comp1 = layers.concatenate([a, beta])
+        comp2 = layers.concatenate([b, alpha])
+        v1 = layers.TimeDistributed(G)(comp1)
+        v2 = layers.TimeDistributed(G)(comp2)

-    # Now that we have the input/output, we can construct the Model object...
-    model = Model(input=[ids1, ids2], output=[scores])
+        # step 3: aggregate
+        v1_sum = layers.Lambda(sum_word)(v1)
+        v2_sum = layers.Lambda(sum_word)(v2)
+        concat = layers.concatenate([v1_sum, v2_sum])
+
+    elif settings['entail_dir'] == 'left':
+        norm_weights_a = layers.Lambda(normalizer(1))(att_weights)
+        alpha = layers.dot([norm_weights_a, a], axes=1)
+        comp2 = layers.concatenate([b, alpha])
+        v2 = layers.TimeDistributed(G)(comp2)
+        v2_sum = layers.Lambda(sum_word)(v2)
+        concat = v2_sum
+
+    else:
+        norm_weights_b = layers.Lambda(normalizer(2))(att_weights)
+        beta  = layers.dot([norm_weights_b, b], axes=1)
+        comp1 = layers.concatenate([a, beta])
+        v1 = layers.TimeDistributed(G)(comp1)
+        v1_sum = layers.Lambda(sum_word)(v1)
+        concat = v1_sum
+    
+    H = create_feedforward(nr_hidden)
+    out = H(concat)
+    out = layers.Dense(nr_class, activation='softmax')(out)
+    
+    model = Model([input1, input2], out)
    
-    # ...Compile it...
    model.compile(
-        optimizer=Adam(lr=settings['lr']),
+        optimizer=optimizers.Adam(lr=settings['lr']),
        loss='categorical_crossentropy',
        metrics=['accuracy'])
-    # ...And return it for training.
+    
    return model


-class _StaticEmbedding(object):
-    def __init__(self, vectors, max_length, nr_out, nr_tune=1000, dropout=0.0):
-        self.nr_out = nr_out
-        self.max_length = max_length
-        self.embed = Embedding(
+def create_embedding(vectors, max_length, projected_dim):
+    return models.Sequential([
+        layers.Embedding(
            vectors.shape[0],
            vectors.shape[1],
            input_length=max_length,
            weights=[vectors],
-                        name='embed',
-                        trainable=False)
-        self.tune = Embedding(
-                        nr_tune,
-                        nr_out,
-                        input_length=max_length,
-                        weights=None,
-                        name='tune',
-                        trainable=True,
-                        dropout=dropout)
-        self.mod_ids = Lambda(lambda sent: sent % (nr_tune-1)+1,
-                              output_shape=(self.max_length,))
+            trainable=False),
        
-        self.project = TimeDistributed(
-                            Dense(
-                                nr_out,
+        layers.TimeDistributed(
+            layers.Dense(projected_dim,
                         activation=None,
-                                bias=False,
-                                name='project'))
+                         use_bias=False))
+    ])

-    def __call__(self, sentence):
-        def get_output_shape(shapes):
-            print(shapes)
-            return shapes[0]
-        mod_sent = self.mod_ids(sentence)
-        tuning = self.tune(mod_sent)
-        #tuning = merge([tuning, mod_sent],
-        #    mode=lambda AB: AB[0] * (K.clip(K.cast(AB[1], 'float32'), 0, 1)),
-        #    output_shape=(self.max_length, self.nr_out))
-        pretrained = self.project(self.embed(sentence))
-        vectors = merge([pretrained, tuning], mode='sum')
-        return vectors
+def create_feedforward(num_units=200, activation='relu', dropout_rate=0.2):
+    return models.Sequential([
+        layers.Dense(num_units, activation=activation),
+        layers.Dropout(dropout_rate),
+        layers.Dense(num_units, activation=activation),
+        layers.Dropout(dropout_rate)
+    ])


-class _BiRNNEncoding(object):
-    def __init__(self, max_length, nr_out, dropout=0.0):
-        self.model = Sequential()
-        self.model.add(Bidirectional(LSTM(nr_out, return_sequences=True,
-                                         dropout_W=dropout, dropout_U=dropout),
-                                         input_shape=(max_length, nr_out)))
-        self.model.add(TimeDistributed(Dense(nr_out, activation='relu', init='he_normal')))
-        self.model.add(TimeDistributed(Dropout(0.2)))
+def normalizer(axis):
+    def _normalize(att_weights):
+        exp_weights = K.exp(att_weights)
+        sum_weights = K.sum(exp_weights, axis=axis, keepdims=True)
+        return exp_weights/sum_weights
+    return _normalize

-    def __call__(self, sentence):
-        return self.model(sentence)
-
-
-class _Attention(object):
-    def __init__(self, max_length, nr_hidden, dropout=0.0, L2=0.0, activation='relu'):
-        self.max_length = max_length
-        self.model = Sequential()
-        self.model.add(Dropout(dropout, input_shape=(nr_hidden,)))
-        self.model.add(
-            Dense(nr_hidden, name='attend1',
-                init='he_normal', W_regularizer=l2(L2),
-                input_shape=(nr_hidden,), activation='relu'))
-        self.model.add(Dropout(dropout))
-        self.model.add(Dense(nr_hidden, name='attend2',
-            init='he_normal', W_regularizer=l2(L2), activation='relu'))
-        self.model = TimeDistributed(self.model)
-
-    def __call__(self, sent1, sent2):
-        def _outer(AB):
-            att_ji = K.batch_dot(AB[1], K.permute_dimensions(AB[0], (0, 2, 1)))
-            return K.permute_dimensions(att_ji,(0, 2, 1))
-        return merge(
-                [self.model(sent1), self.model(sent2)],
-                mode=_outer,
-                output_shape=(self.max_length, self.max_length))
-
-
-class _SoftAlignment(object):
-    def __init__(self, max_length, nr_hidden):
-        self.max_length = max_length
-        self.nr_hidden = nr_hidden
-
-    def __call__(self, sentence, attention, transpose=False):
-        def _normalize_attention(attmat):
-            att = attmat[0]
-            mat = attmat[1]
-            if transpose:
-                att = K.permute_dimensions(att,(0, 2, 1))
-            # 3d softmax
-            e = K.exp(att - K.max(att, axis=-1, keepdims=True))
-            s = K.sum(e, axis=-1, keepdims=True)
-            sm_att = e / s
-            return K.batch_dot(sm_att, mat)
-        return merge([attention, sentence], mode=_normalize_attention,
-                      output_shape=(self.max_length, self.nr_hidden)) # Shape: (i, n)
-
-
-class _Comparison(object):
-    def __init__(self, words, nr_hidden, L2=0.0, dropout=0.0):
-        self.words = words
-        self.model = Sequential()
-        self.model.add(Dropout(dropout, input_shape=(nr_hidden*2,)))
-        self.model.add(Dense(nr_hidden, name='compare1',
-            init='he_normal', W_regularizer=l2(L2)))
-        self.model.add(Activation('relu'))
-        self.model.add(Dropout(dropout))
-        self.model.add(Dense(nr_hidden, name='compare2',
-                        W_regularizer=l2(L2), init='he_normal'))
-        self.model.add(Activation('relu'))
-        self.model = TimeDistributed(self.model)
-
-    def __call__(self, sent, align, **kwargs):
-        result = self.model(merge([sent, align], mode='concat')) # Shape: (i, n)
-        avged = GlobalAveragePooling1D()(result, mask=self.words)
-        maxed = GlobalMaxPooling1D()(result, mask=self.words)
-        merged = merge([avged, maxed])
-        result = BatchNormalization()(merged)
-        return result
-
-
-class _Entailment(object):
-    def __init__(self, nr_hidden, nr_out, dropout=0.0, L2=0.0):
-        self.model = Sequential()
-        self.model.add(Dropout(dropout, input_shape=(nr_hidden*2,)))
-        self.model.add(Dense(nr_hidden, name='entail1',
-            init='he_normal', W_regularizer=l2(L2)))
-        self.model.add(Activation('relu'))
-        self.model.add(Dropout(dropout))
-        self.model.add(Dense(nr_hidden, name='entail2',
-            init='he_normal', W_regularizer=l2(L2)))
-        self.model.add(Activation('relu'))
-        self.model.add(Dense(nr_out, name='entail_out', activation='softmax',
-                        W_regularizer=l2(L2), init='zero'))
-
-    def __call__(self, feats1, feats2):
-        features = merge([feats1, feats2], mode='concat')
-        return self.model(features)
-
-
-class _GlobalSumPooling1D(Layer):
-    '''Global sum pooling operation for temporal data.
-
-    # Input shape
-        3D tensor with shape: `(samples, steps, features)`.
-
-    # Output shape
-        2D tensor with shape: `(samples, features)`.
-    '''
-    def __init__(self, **kwargs):
-        super(_GlobalSumPooling1D, self).__init__(**kwargs)
-        self.input_spec = [InputSpec(ndim=3)]
-
-    def get_output_shape_for(self, input_shape):
-        return (input_shape[0], input_shape[2])
-
-    def call(self, x, mask=None):
-        if mask is not None:
-            return K.sum(x * K.clip(mask, 0, 1), axis=1)
-        else:
+def sum_word(x):
    return K.sum(x, axis=1)


 def test_build_model():
-    vectors = numpy.ndarray((100, 8), dtype='float32')
+    vectors = np.ndarray((100, 8), dtype='float32')
    shape = (10, 16, 3)
-    settings = {'lr': 0.001, 'dropout': 0.2, 'gru_encode':True}
+    settings = {'lr': 0.001, 'dropout': 0.2, 'gru_encode':True, 'entail_dir':'both'}
    model = build_model(vectors, shape, settings)


 def test_fit_model():

    def _generate_X(nr_example, length, nr_vector):
-        X1 = numpy.ndarray((nr_example, length), dtype='int32')
+        X1 = np.ndarray((nr_example, length), dtype='int32')
        X1 *= X1 < nr_vector
        X1 *= 0 <= X1
-        X2 = numpy.ndarray((nr_example, length), dtype='int32')
+        X2 = np.ndarray((nr_example, length), dtype='int32')
        X2 *= X2 < nr_vector
        X2 *= 0 <= X2
        return [X1, X2]

    def _generate_Y(nr_example, nr_class):
-        ys = numpy.zeros((nr_example, nr_class), dtype='int32')
+        ys = np.zeros((nr_example, nr_class), dtype='int32')
        for i in range(nr_example):
            ys[i, i % nr_class] = 1
        return ys

-    vectors = numpy.ndarray((100, 8), dtype='float32')
+    vectors = np.ndarray((100, 8), dtype='float32')
    shape = (10, 16, 3)
-    settings = {'lr': 0.001, 'dropout': 0.2, 'gru_encode':True}
+    settings = {'lr': 0.001, 'dropout': 0.2, 'gru_encode':True, 'entail_dir':'both'}
    model = build_model(vectors, shape, settings)

    train_X = _generate_X(20, shape[0], vectors.shape[0])
@ -261,8 +139,7 @@ def test_fit_model():
    dev_X = _generate_X(15, shape[0], vectors.shape[0])
    dev_Y = _generate_Y(15, shape[2])

-    model.fit(train_X, train_Y, validation_data=(dev_X, dev_Y), nb_epoch=5,
-              batch_size=4)
+    model.fit(train_X, train_Y, validation_data=(dev_X, dev_Y), epochs=5, batch_size=4)


 __all__ = [build_model]
--- a/examples/keras_parikh_entailment/spacy_hook.py
+++ b/examples/keras_parikh_entailment/spacy_hook.py
@ -1,8 +1,5 @@
+import numpy as np
 from keras.models import model_from_json
-import numpy
-import numpy.random
-import json
-from spacy.tokens.span import Span

 try:
    import cPickle as pickle
@ -11,16 +8,23 @@ except ImportError:


 class KerasSimilarityShim(object):
+    entailment_types = ["entailment", "contradiction", "neutral"]
+
    @classmethod
-    def load(cls, path, nlp, get_features=None, max_length=100):
+    def load(cls, path, nlp, max_length=100, get_features=None):
+        
        if get_features is None:
            get_features = get_word_ids
+            
        with (path / 'config.json').open() as file_:
            model = model_from_json(file_.read())
        with (path / 'model').open('rb') as file_:
            weights = pickle.load(file_)
+            
        embeddings = get_embeddings(nlp.vocab)
-        model.set_weights([embeddings] + weights)
+        weights.insert(1, embeddings)
+        model.set_weights(weights)
+
        return cls(model, get_features=get_features, max_length=max_length)

    def __init__(self, model, get_features=None, max_length=100):
@ -32,58 +36,42 @@ class KerasSimilarityShim(object):
        doc.user_hooks['similarity'] = self.predict
        doc.user_span_hooks['similarity'] = self.predict

+        return doc
+
    def predict(self, doc1, doc2):
-        x1 = self.get_features([doc1], max_length=self.max_length, tree_truncate=True)
-        x2 = self.get_features([doc2], max_length=self.max_length, tree_truncate=True)
+        x1 = self.get_features([doc1], max_length=self.max_length)
+        x2 = self.get_features([doc2], max_length=self.max_length)
        scores = self.model.predict([x1, x2])
-        return scores[0]
+
+        return self.entailment_types[scores.argmax()], scores.max()


 def get_embeddings(vocab, nr_unk=100):
-    nr_vector = max(lex.rank for lex in vocab) + 1
-    vectors = numpy.zeros((nr_vector+nr_unk+2, vocab.vectors_length), dtype='float32')
+    # the extra +1 is for a zero vector representing sentence-final padding
+    num_vectors = max(lex.rank for lex in vocab) + 2 
+    
+    # create random vectors for OOV tokens
+    oov = np.random.normal(size=(nr_unk, vocab.vectors_length))
+    oov = oov / oov.sum(axis=1, keepdims=True)
+    
+    vectors = np.zeros((num_vectors + nr_unk, vocab.vectors_length), dtype='float32')
+    vectors[1:(nr_unk + 1), ] = oov
    for lex in vocab:
-        if lex.has_vector:
-            vectors[lex.rank+1] = lex.vector / lex.vector_norm
+        if lex.has_vector and lex.vector_norm > 0:
+            vectors[nr_unk + lex.rank + 1] = lex.vector / lex.vector_norm 
+
    return vectors


-def get_word_ids(docs, rnn_encode=False, tree_truncate=False, max_length=100, nr_unk=100):
-    Xs = numpy.zeros((len(docs), max_length), dtype='int32')
+def get_word_ids(docs, max_length=100, nr_unk=100):
+    Xs = np.zeros((len(docs), max_length), dtype='int32')
+    
    for i, doc in enumerate(docs):
-        if tree_truncate:
-            if isinstance(doc, Span):
-                queue = [doc.root]
-            else:
-                queue = [sent.root for sent in doc.sents]
-        else:
-            queue = list(doc)
-        words = []
-        while len(words) <= max_length and queue:
-            word = queue.pop(0)
-            if rnn_encode or (not word.is_punct and not word.is_space):
-                words.append(word)
-            if tree_truncate:
-                queue.extend(list(word.lefts))
-                queue.extend(list(word.rights))
-        words.sort()
-        for j, token in enumerate(words):
-            if token.has_vector:
-                Xs[i, j] = token.rank+1
-            else:
-                Xs[i, j] = (token.shape % (nr_unk-1))+2
-            j += 1
-            if j >= max_length:
+        for j, token in enumerate(doc):
+            if j == max_length:
                break
+            if token.has_vector:
+                Xs[i, j] = token.rank + nr_unk + 1
            else:
-            Xs[i, len(words)] = 1
+                Xs[i, j] = token.rank % nr_unk + 1
    return Xs
-
-
-def create_similarity_pipeline(nlp, max_length=100):
-    return [
-        nlp.tagger,
-        nlp.entity,
-        nlp.parser,
-        KerasSimilarityShim.load(nlp.path / 'similarity', nlp, max_length)
-    ]
--- a/examples/notebooks/Decompositional
+++ b/examples/notebooks/Decompositional
@ -0,0 +1,955 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Natural language inference using spaCy and Keras"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Introduction"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This notebook details an implementation of the natural language inference model presented in [(Parikh et al, 2016)](https://arxiv.org/abs/1606.01933).  The model is notable for the small number of paramaters *and hyperparameters* it specifices, while still yielding good performance."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Constructing the dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import spacy\n",
+    "import numpy as np"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We only need the GloVe vectors from spaCy, not a full NLP pipeline."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "nlp = spacy.load('en_vectors_web_lg')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Function to load the SNLI dataset.  The categories are converted to one-shot representation.  The function comes from an example in spaCy."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/jds/tensorflow-gpu/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
+      "  from ._conv import register_converters as _register_converters\n",
+      "Using TensorFlow backend.\n"
+     ]
+    }
+   ],
+   "source": [
+    "import ujson as json\n",
+    "from keras.utils import to_categorical\n",
+    "\n",
+    "LABELS = {'entailment': 0, 'contradiction': 1, 'neutral': 2}\n",
+    "def read_snli(path):\n",
+    "    texts1 = []\n",
+    "    texts2 = []\n",
+    "    labels = []\n",
+    "    with open(path, 'r') as file_:\n",
+    "        for line in file_:\n",
+    "            eg = json.loads(line)\n",
+    "            label = eg['gold_label']\n",
+    "            if label == '-':  # per Parikh, ignore - SNLI entries\n",
+    "                continue\n",
+    "            texts1.append(eg['sentence1'])\n",
+    "            texts2.append(eg['sentence2'])\n",
+    "            labels.append(LABELS[label])\n",
+    "    return texts1, texts2, to_categorical(np.asarray(labels, dtype='int32'))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Because Keras can do the train/test split for us, we'll load *all* SNLI triples from one file."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "texts,hypotheses,labels = read_snli('snli/snli_1.0_train.jsonl')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def create_dataset(nlp, texts, hypotheses, num_oov, max_length, norm_vectors = True):\n",
+    "    sents = texts + hypotheses\n",
+    "    \n",
+    "    # the extra +1 is for a zero vector represting NULL for padding\n",
+    "    num_vectors = max(lex.rank for lex in nlp.vocab) + 2 \n",
+    "    \n",
+    "    # create random vectors for OOV tokens\n",
+    "    oov = np.random.normal(size=(num_oov, nlp.vocab.vectors_length))\n",
+    "    oov = oov / oov.sum(axis=1, keepdims=True)\n",
+    "    \n",
+    "    vectors = np.zeros((num_vectors + num_oov, nlp.vocab.vectors_length), dtype='float32')\n",
+    "    vectors[num_vectors:, ] = oov\n",
+    "    for lex in nlp.vocab:\n",
+    "        if lex.has_vector and lex.vector_norm > 0:\n",
+    "            vectors[lex.rank + 1] = lex.vector / lex.vector_norm if norm_vectors == True else lex.vector\n",
+    "            \n",
+    "    sents_as_ids = []\n",
+    "    for sent in sents:\n",
+    "        doc = nlp(sent)\n",
+    "        word_ids = []\n",
+    "        \n",
+    "        for i, token in enumerate(doc):\n",
+    "            # skip odd spaces from tokenizer\n",
+    "            if token.has_vector and token.vector_norm == 0:\n",
+    "                continue\n",
+    "                \n",
+    "            if i > max_length:\n",
+    "                break\n",
+    "                \n",
+    "            if token.has_vector:\n",
+    "                word_ids.append(token.rank + 1)\n",
+    "            else:\n",
+    "                # if we don't have a vector, pick an OOV entry\n",
+    "                word_ids.append(token.rank % num_oov + num_vectors) \n",
+    "                \n",
+    "        # there must be a simpler way of generating padded arrays from lists...\n",
+    "        word_id_vec = np.zeros((max_length), dtype='int')\n",
+    "        clipped_len = min(max_length, len(word_ids))\n",
+    "        word_id_vec[:clipped_len] = word_ids[:clipped_len]\n",
+    "        sents_as_ids.append(word_id_vec)\n",
+    "        \n",
+    "        \n",
+    "    return vectors, np.array(sents_as_ids[:len(texts)]), np.array(sents_as_ids[len(texts):])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sem_vectors, text_vectors, hypothesis_vectors = create_dataset(nlp, texts, hypotheses, 100, 50, True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "texts_test,hypotheses_test,labels_test = read_snli('snli/snli_1.0_test.jsonl')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "_, text_vectors_test, hypothesis_vectors_test = create_dataset(nlp, texts_test, hypotheses_test, 100, 50, True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We use spaCy to tokenize the sentences and return, when available, a semantic vector for each token.  \n",
+    "\n",
+    "OOV terms (tokens for which no semantic vector is available) are assigned to one of a set of randomly-generated OOV vectors, per (Parikh et al, 2016).\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note that we will clip sentences to 50 words maximum."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from keras import layers, Model, models\n",
+    "from keras import backend as K"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Building the model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The embedding layer copies the 300-dimensional GloVe vectors into GPU memory.  Per (Parikh et al, 2016), the vectors, which are not adapted during training, are projected down to lower-dimensional vectors using a trained projection matrix."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def create_embedding(vectors, max_length, projected_dim):\n",
+    "    return models.Sequential([\n",
+    "        layers.Embedding(\n",
+    "            vectors.shape[0],\n",
+    "            vectors.shape[1],\n",
+    "            input_length=max_length,\n",
+    "            weights=[vectors],\n",
+    "            trainable=False),\n",
+    "        \n",
+    "        layers.TimeDistributed(\n",
+    "            layers.Dense(projected_dim,\n",
+    "                         activation=None,\n",
+    "                         use_bias=False))\n",
+    "    ])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The Parikh model makes use of three feedforward blocks that construct nonlinear combinations of their input.  Each block contains two ReLU layers and two dropout layers."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def create_feedforward(num_units=200, activation='relu', dropout_rate=0.2):\n",
+    "    return models.Sequential([\n",
+    "        layers.Dense(num_units, activation=activation),\n",
+    "        layers.Dropout(dropout_rate),\n",
+    "        layers.Dense(num_units, activation=activation),\n",
+    "        layers.Dropout(dropout_rate)\n",
+    "    ])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The basic idea of the (Parikh et al, 2016) model is to:\n",
+    "\n",
+    "1.  *Align*: Construct an alignment of subphrases in the text and hypothesis using an attention-like mechanism, called \"decompositional\" because the layer is applied to each of the two sentences individually rather than to their product.  The dot product of the nonlinear transformations of the inputs is then normalized vertically and horizontally to yield a pair of \"soft\" alignment structures, from text->hypothesis and hypothesis->text.  Concretely, for each word in one sentence, a multinomial distribution is computed over the words of the other sentence, by learning a multinomial logistic with softmax target.\n",
+    "2.  *Compare*: Each word is now compared to its aligned phrase using a function modeled as a two-layer feedforward ReLU network.  The output is a high-dimensional representation of the strength of association between word and aligned phrase.\n",
+    "3.  *Aggregate*: The comparison vectors are summed, separately, for the text and the hypothesis.  The result is two vectors: one that describes the degree of association of the text to the hypothesis, and the second, of the hypothesis to the text.\n",
+    "4.  Finally, these two vectors are processed by a dense layer followed by a softmax classifier, as usual.\n",
+    "\n",
+    "Note that because in entailment the truth conditions of the consequent must be a subset of those of the antecedent, it is not obvious that we need both vectors in step (3).  Entailment is not symmetric.  It may be enough to just use the hypothesis->text vector.  We will explore this possibility later."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We need a couple of little functions for Lambda layers to normalize and aggregate weights:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def normalizer(axis):\n",
+    "    def _normalize(att_weights):\n",
+    "        exp_weights = K.exp(att_weights)\n",
+    "        sum_weights = K.sum(exp_weights, axis=axis, keepdims=True)\n",
+    "        return exp_weights/sum_weights\n",
+    "    return _normalize\n",
+    "\n",
+    "def sum_word(x):\n",
+    "    return K.sum(x, axis=1)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def build_model(vectors, max_length, num_hidden, num_classes, projected_dim, entail_dir='both'):\n",
+    "    input1 = layers.Input(shape=(max_length,), dtype='int32', name='words1')\n",
+    "    input2 = layers.Input(shape=(max_length,), dtype='int32', name='words2')\n",
+    "    \n",
+    "    # embeddings (projected)\n",
+    "    embed = create_embedding(vectors, max_length, projected_dim)\n",
+    "   \n",
+    "    a = embed(input1)\n",
+    "    b = embed(input2)\n",
+    "    \n",
+    "    # step 1: attend\n",
+    "    F = create_feedforward(num_hidden)\n",
+    "    att_weights = layers.dot([F(a), F(b)], axes=-1)\n",
+    "    \n",
+    "    G = create_feedforward(num_hidden)\n",
+    "    \n",
+    "    if entail_dir == 'both':\n",
+    "        norm_weights_a = layers.Lambda(normalizer(1))(att_weights)\n",
+    "        norm_weights_b = layers.Lambda(normalizer(2))(att_weights)\n",
+    "        alpha = layers.dot([norm_weights_a, a], axes=1)\n",
+    "        beta  = layers.dot([norm_weights_b, b], axes=1)\n",
+    "\n",
+    "        # step 2: compare\n",
+    "        comp1 = layers.concatenate([a, beta])\n",
+    "        comp2 = layers.concatenate([b, alpha])\n",
+    "        v1 = layers.TimeDistributed(G)(comp1)\n",
+    "        v2 = layers.TimeDistributed(G)(comp2)\n",
+    "\n",
+    "        # step 3: aggregate\n",
+    "        v1_sum = layers.Lambda(sum_word)(v1)\n",
+    "        v2_sum = layers.Lambda(sum_word)(v2)\n",
+    "        concat = layers.concatenate([v1_sum, v2_sum])\n",
+    "    elif entail_dir == 'left':\n",
+    "        norm_weights_a = layers.Lambda(normalizer(1))(att_weights)\n",
+    "        alpha = layers.dot([norm_weights_a, a], axes=1)\n",
+    "        comp2 = layers.concatenate([b, alpha])\n",
+    "        v2 = layers.TimeDistributed(G)(comp2)\n",
+    "        v2_sum = layers.Lambda(sum_word)(v2)\n",
+    "        concat = v2_sum\n",
+    "    else:\n",
+    "        norm_weights_b = layers.Lambda(normalizer(2))(att_weights)\n",
+    "        beta  = layers.dot([norm_weights_b, b], axes=1)\n",
+    "        comp1 = layers.concatenate([a, beta])\n",
+    "        v1 = layers.TimeDistributed(G)(comp1)\n",
+    "        v1_sum = layers.Lambda(sum_word)(v1)\n",
+    "        concat = v1_sum\n",
+    "    \n",
+    "    H = create_feedforward(num_hidden)\n",
+    "    out = H(concat)\n",
+    "    out = layers.Dense(num_classes, activation='softmax')(out)\n",
+    "    \n",
+    "    model = Model([input1, input2], out)\n",
+    "    \n",
+    "    model.compile(optimizer='adam',\n",
+    "                  loss='categorical_crossentropy',\n",
+    "                  metrics=['accuracy'])\n",
+    "    return model\n",
+    "    \n",
+    "    \n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "__________________________________________________________________________________________________\n",
+      "Layer (type)                    Output Shape         Param #     Connected to                     \n",
+      "==================================================================================================\n",
+      "words1 (InputLayer)             (None, 50)           0                                            \n",
+      "__________________________________________________________________________________________________\n",
+      "words2 (InputLayer)             (None, 50)           0                                            \n",
+      "__________________________________________________________________________________________________\n",
+      "sequential_1 (Sequential)       (None, 50, 200)      321381600   words1[0][0]                     \n",
+      "                                                                 words2[0][0]                     \n",
+      "__________________________________________________________________________________________________\n",
+      "sequential_2 (Sequential)       (None, 50, 200)      80400       sequential_1[1][0]               \n",
+      "                                                                 sequential_1[2][0]               \n",
+      "__________________________________________________________________________________________________\n",
+      "dot_1 (Dot)                     (None, 50, 50)       0           sequential_2[1][0]               \n",
+      "                                                                 sequential_2[2][0]               \n",
+      "__________________________________________________________________________________________________\n",
+      "lambda_2 (Lambda)               (None, 50, 50)       0           dot_1[0][0]                      \n",
+      "__________________________________________________________________________________________________\n",
+      "lambda_1 (Lambda)               (None, 50, 50)       0           dot_1[0][0]                      \n",
+      "__________________________________________________________________________________________________\n",
+      "dot_3 (Dot)                     (None, 50, 200)      0           lambda_2[0][0]                   \n",
+      "                                                                 sequential_1[2][0]               \n",
+      "__________________________________________________________________________________________________\n",
+      "dot_2 (Dot)                     (None, 50, 200)      0           lambda_1[0][0]                   \n",
+      "                                                                 sequential_1[1][0]               \n",
+      "__________________________________________________________________________________________________\n",
+      "concatenate_1 (Concatenate)     (None, 50, 400)      0           sequential_1[1][0]               \n",
+      "                                                                 dot_3[0][0]                      \n",
+      "__________________________________________________________________________________________________\n",
+      "concatenate_2 (Concatenate)     (None, 50, 400)      0           sequential_1[2][0]               \n",
+      "                                                                 dot_2[0][0]                      \n",
+      "__________________________________________________________________________________________________\n",
+      "time_distributed_2 (TimeDistrib (None, 50, 200)      120400      concatenate_1[0][0]              \n",
+      "__________________________________________________________________________________________________\n",
+      "time_distributed_3 (TimeDistrib (None, 50, 200)      120400      concatenate_2[0][0]              \n",
+      "__________________________________________________________________________________________________\n",
+      "lambda_3 (Lambda)               (None, 200)          0           time_distributed_2[0][0]         \n",
+      "__________________________________________________________________________________________________\n",
+      "lambda_4 (Lambda)               (None, 200)          0           time_distributed_3[0][0]         \n",
+      "__________________________________________________________________________________________________\n",
+      "concatenate_3 (Concatenate)     (None, 400)          0           lambda_3[0][0]                   \n",
+      "                                                                 lambda_4[0][0]                   \n",
+      "__________________________________________________________________________________________________\n",
+      "sequential_4 (Sequential)       (None, 200)          120400      concatenate_3[0][0]              \n",
+      "__________________________________________________________________________________________________\n",
+      "dense_8 (Dense)                 (None, 3)            603         sequential_4[1][0]               \n",
+      "==================================================================================================\n",
+      "Total params: 321,703,403\n",
+      "Trainable params: 381,803\n",
+      "Non-trainable params: 321,321,600\n",
+      "__________________________________________________________________________________________________\n"
+     ]
+    }
+   ],
+   "source": [
+    "K.clear_session()\n",
+    "m = build_model(sem_vectors, 50, 200, 3, 200)\n",
+    "m.summary()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The number of trainable parameters, ~381k, is the number given by Parikh et al, so we're on the right track."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Training the model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Parikh et al use tiny batches of 4, training for 50MM batches, which amounts to around 500 epochs.  Here we'll use large batches to better use the GPU, and train for fewer epochs -- for purposes of this experiment."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Train on 549367 samples, validate on 9824 samples\n",
+      "Epoch 1/50\n",
+      "549367/549367 [==============================] - 34s 62us/step - loss: 0.7599 - acc: 0.6617 - val_loss: 0.5396 - val_acc: 0.7861\n",
+      "Epoch 2/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.5611 - acc: 0.7763 - val_loss: 0.4892 - val_acc: 0.8085\n",
+      "Epoch 3/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.5212 - acc: 0.7948 - val_loss: 0.4574 - val_acc: 0.8261\n",
+      "Epoch 4/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4986 - acc: 0.8045 - val_loss: 0.4410 - val_acc: 0.8274\n",
+      "Epoch 5/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4819 - acc: 0.8114 - val_loss: 0.4224 - val_acc: 0.8383\n",
+      "Epoch 6/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4714 - acc: 0.8166 - val_loss: 0.4200 - val_acc: 0.8379\n",
+      "Epoch 7/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4633 - acc: 0.8203 - val_loss: 0.4098 - val_acc: 0.8457\n",
+      "Epoch 8/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4558 - acc: 0.8232 - val_loss: 0.4114 - val_acc: 0.8415\n",
+      "Epoch 9/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4508 - acc: 0.8250 - val_loss: 0.4062 - val_acc: 0.8477\n",
+      "Epoch 10/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4433 - acc: 0.8286 - val_loss: 0.3982 - val_acc: 0.8486\n",
+      "Epoch 11/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4388 - acc: 0.8307 - val_loss: 0.3953 - val_acc: 0.8497\n",
+      "Epoch 12/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4351 - acc: 0.8321 - val_loss: 0.3973 - val_acc: 0.8522\n",
+      "Epoch 13/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4309 - acc: 0.8342 - val_loss: 0.3939 - val_acc: 0.8539\n",
+      "Epoch 14/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4269 - acc: 0.8355 - val_loss: 0.3932 - val_acc: 0.8517\n",
+      "Epoch 15/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4247 - acc: 0.8369 - val_loss: 0.3938 - val_acc: 0.8515\n",
+      "Epoch 16/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4208 - acc: 0.8379 - val_loss: 0.3936 - val_acc: 0.8504\n",
+      "Epoch 17/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4194 - acc: 0.8390 - val_loss: 0.3885 - val_acc: 0.8560\n",
+      "Epoch 18/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4162 - acc: 0.8402 - val_loss: 0.3874 - val_acc: 0.8561\n",
+      "Epoch 19/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4140 - acc: 0.8409 - val_loss: 0.3889 - val_acc: 0.8545\n",
+      "Epoch 20/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4114 - acc: 0.8426 - val_loss: 0.3864 - val_acc: 0.8583\n",
+      "Epoch 21/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4092 - acc: 0.8430 - val_loss: 0.3870 - val_acc: 0.8561\n",
+      "Epoch 22/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4062 - acc: 0.8442 - val_loss: 0.3852 - val_acc: 0.8577\n",
+      "Epoch 23/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4050 - acc: 0.8450 - val_loss: 0.3850 - val_acc: 0.8578\n",
+      "Epoch 24/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4035 - acc: 0.8455 - val_loss: 0.3825 - val_acc: 0.8555\n",
+      "Epoch 25/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4018 - acc: 0.8460 - val_loss: 0.3837 - val_acc: 0.8573\n",
+      "Epoch 26/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3989 - acc: 0.8476 - val_loss: 0.3843 - val_acc: 0.8599\n",
+      "Epoch 27/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3979 - acc: 0.8481 - val_loss: 0.3841 - val_acc: 0.8589\n",
+      "Epoch 28/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3967 - acc: 0.8484 - val_loss: 0.3811 - val_acc: 0.8575\n",
+      "Epoch 29/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3956 - acc: 0.8492 - val_loss: 0.3829 - val_acc: 0.8589\n",
+      "Epoch 30/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3938 - acc: 0.8499 - val_loss: 0.3859 - val_acc: 0.8562\n",
+      "Epoch 31/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3925 - acc: 0.8500 - val_loss: 0.3798 - val_acc: 0.8587\n",
+      "Epoch 32/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3906 - acc: 0.8509 - val_loss: 0.3834 - val_acc: 0.8569\n",
+      "Epoch 33/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3893 - acc: 0.8511 - val_loss: 0.3806 - val_acc: 0.8588\n",
+      "Epoch 34/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3885 - acc: 0.8515 - val_loss: 0.3828 - val_acc: 0.8603\n",
+      "Epoch 35/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3879 - acc: 0.8520 - val_loss: 0.3800 - val_acc: 0.8594\n",
+      "Epoch 36/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3860 - acc: 0.8530 - val_loss: 0.3796 - val_acc: 0.8577\n",
+      "Epoch 37/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3856 - acc: 0.8532 - val_loss: 0.3857 - val_acc: 0.8591\n",
+      "Epoch 38/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3838 - acc: 0.8535 - val_loss: 0.3835 - val_acc: 0.8603\n",
+      "Epoch 39/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3830 - acc: 0.8543 - val_loss: 0.3830 - val_acc: 0.8599\n",
+      "Epoch 40/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3818 - acc: 0.8548 - val_loss: 0.3832 - val_acc: 0.8559\n",
+      "Epoch 41/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3806 - acc: 0.8551 - val_loss: 0.3845 - val_acc: 0.8553\n",
+      "Epoch 42/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3803 - acc: 0.8550 - val_loss: 0.3789 - val_acc: 0.8617\n",
+      "Epoch 43/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3791 - acc: 0.8556 - val_loss: 0.3835 - val_acc: 0.8580\n",
+      "Epoch 44/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3778 - acc: 0.8565 - val_loss: 0.3799 - val_acc: 0.8580\n",
+      "Epoch 45/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3766 - acc: 0.8571 - val_loss: 0.3790 - val_acc: 0.8625\n",
+      "Epoch 46/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3770 - acc: 0.8569 - val_loss: 0.3820 - val_acc: 0.8590\n",
+      "Epoch 47/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3761 - acc: 0.8573 - val_loss: 0.3831 - val_acc: 0.8581\n",
+      "Epoch 48/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3739 - acc: 0.8579 - val_loss: 0.3828 - val_acc: 0.8599\n",
+      "Epoch 49/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3738 - acc: 0.8577 - val_loss: 0.3785 - val_acc: 0.8590\n",
+      "Epoch 50/50\n",
+      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3726 - acc: 0.8580 - val_loss: 0.3820 - val_acc: 0.8585\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "<keras.callbacks.History at 0x7f5c9f49c438>"
+      ]
+     },
+     "execution_count": 19,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "m.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=50,validation_data=([text_vectors_test, hypothesis_vectors_test], labels_test))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The result is broadly in the region reported by Parikh et al: ~86 vs 86.3%.  The small difference might be accounted by differences in `max_length` (here set at 50), in the training regime, and that here we use Keras' built-in validation splitting rather than the SNLI test set."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Experiment: the asymmetric model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "It was suggested earlier that, based on the semantics of entailment, the vector representing the strength of association between the hypothesis to the text is all that is needed for classifying the entailment.\n",
+    "\n",
+    "The following model removes consideration of the complementary vector (text to hypothesis) from the computation.  This will decrease the paramater count slightly, because the final dense layers will be smaller, and speed up the forward pass when predicting, because fewer calculations will be needed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "__________________________________________________________________________________________________\n",
+      "Layer (type)                    Output Shape         Param #     Connected to                     \n",
+      "==================================================================================================\n",
+      "words2 (InputLayer)             (None, 50)           0                                            \n",
+      "__________________________________________________________________________________________________\n",
+      "words1 (InputLayer)             (None, 50)           0                                            \n",
+      "__________________________________________________________________________________________________\n",
+      "sequential_5 (Sequential)       (None, 50, 200)      321381600   words1[0][0]                     \n",
+      "                                                                 words2[0][0]                     \n",
+      "__________________________________________________________________________________________________\n",
+      "sequential_6 (Sequential)       (None, 50, 200)      80400       sequential_5[1][0]               \n",
+      "                                                                 sequential_5[2][0]               \n",
+      "__________________________________________________________________________________________________\n",
+      "dot_4 (Dot)                     (None, 50, 50)       0           sequential_6[1][0]               \n",
+      "                                                                 sequential_6[2][0]               \n",
+      "__________________________________________________________________________________________________\n",
+      "lambda_5 (Lambda)               (None, 50, 50)       0           dot_4[0][0]                      \n",
+      "__________________________________________________________________________________________________\n",
+      "dot_5 (Dot)                     (None, 50, 200)      0           lambda_5[0][0]                   \n",
+      "                                                                 sequential_5[1][0]               \n",
+      "__________________________________________________________________________________________________\n",
+      "concatenate_4 (Concatenate)     (None, 50, 400)      0           sequential_5[2][0]               \n",
+      "                                                                 dot_5[0][0]                      \n",
+      "__________________________________________________________________________________________________\n",
+      "time_distributed_5 (TimeDistrib (None, 50, 200)      120400      concatenate_4[0][0]              \n",
+      "__________________________________________________________________________________________________\n",
+      "lambda_6 (Lambda)               (None, 200)          0           time_distributed_5[0][0]         \n",
+      "__________________________________________________________________________________________________\n",
+      "sequential_8 (Sequential)       (None, 200)          80400       lambda_6[0][0]                   \n",
+      "__________________________________________________________________________________________________\n",
+      "dense_16 (Dense)                (None, 3)            603         sequential_8[1][0]               \n",
+      "==================================================================================================\n",
+      "Total params: 321,663,403\n",
+      "Trainable params: 341,803\n",
+      "Non-trainable params: 321,321,600\n",
+      "__________________________________________________________________________________________________\n"
+     ]
+    }
+   ],
+   "source": [
+    "m1 = build_model(sem_vectors, 50, 200, 3, 200, 'left')\n",
+    "m1.summary()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The parameter count has indeed decreased by 40,000, corresponding to the 200x200 smaller H function."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Train on 549367 samples, validate on 9824 samples\n",
+      "Epoch 1/50\n",
+      "549367/549367 [==============================] - 25s 46us/step - loss: 0.7331 - acc: 0.6770 - val_loss: 0.5257 - val_acc: 0.7936\n",
+      "Epoch 2/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.5518 - acc: 0.7799 - val_loss: 0.4717 - val_acc: 0.8159\n",
+      "Epoch 3/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.5147 - acc: 0.7967 - val_loss: 0.4449 - val_acc: 0.8278\n",
+      "Epoch 4/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4948 - acc: 0.8060 - val_loss: 0.4326 - val_acc: 0.8344\n",
+      "Epoch 5/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4814 - acc: 0.8122 - val_loss: 0.4247 - val_acc: 0.8359\n",
+      "Epoch 6/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4712 - acc: 0.8162 - val_loss: 0.4143 - val_acc: 0.8430\n",
+      "Epoch 7/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4635 - acc: 0.8205 - val_loss: 0.4172 - val_acc: 0.8401\n",
+      "Epoch 8/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4570 - acc: 0.8223 - val_loss: 0.4106 - val_acc: 0.8422\n",
+      "Epoch 9/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4505 - acc: 0.8259 - val_loss: 0.4043 - val_acc: 0.8451\n",
+      "Epoch 10/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4459 - acc: 0.8280 - val_loss: 0.4050 - val_acc: 0.8467\n",
+      "Epoch 11/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4405 - acc: 0.8300 - val_loss: 0.3975 - val_acc: 0.8481\n",
+      "Epoch 12/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4360 - acc: 0.8324 - val_loss: 0.4026 - val_acc: 0.8496\n",
+      "Epoch 13/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4327 - acc: 0.8334 - val_loss: 0.4024 - val_acc: 0.8471\n",
+      "Epoch 14/50\n",
+      "549367/549367 [==============================] - 24s 45us/step - loss: 0.4293 - acc: 0.8350 - val_loss: 0.3955 - val_acc: 0.8496\n",
+      "Epoch 15/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4263 - acc: 0.8369 - val_loss: 0.3980 - val_acc: 0.8490\n",
+      "Epoch 16/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4236 - acc: 0.8377 - val_loss: 0.3958 - val_acc: 0.8496\n",
+      "Epoch 17/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4213 - acc: 0.8384 - val_loss: 0.3954 - val_acc: 0.8496\n",
+      "Epoch 18/50\n",
+      "549367/549367 [==============================] - 24s 45us/step - loss: 0.4187 - acc: 0.8394 - val_loss: 0.3929 - val_acc: 0.8514\n",
+      "Epoch 19/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4157 - acc: 0.8409 - val_loss: 0.3939 - val_acc: 0.8507\n",
+      "Epoch 20/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4135 - acc: 0.8417 - val_loss: 0.3953 - val_acc: 0.8522\n",
+      "Epoch 21/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4122 - acc: 0.8424 - val_loss: 0.3974 - val_acc: 0.8506\n",
+      "Epoch 22/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4099 - acc: 0.8435 - val_loss: 0.3918 - val_acc: 0.8522\n",
+      "Epoch 23/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4075 - acc: 0.8443 - val_loss: 0.3901 - val_acc: 0.8513\n",
+      "Epoch 24/50\n",
+      "549367/549367 [==============================] - 24s 44us/step - loss: 0.4067 - acc: 0.8447 - val_loss: 0.3885 - val_acc: 0.8543\n",
+      "Epoch 25/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4047 - acc: 0.8454 - val_loss: 0.3846 - val_acc: 0.8531\n",
+      "Epoch 26/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4031 - acc: 0.8461 - val_loss: 0.3864 - val_acc: 0.8562\n",
+      "Epoch 27/50\n",
+      "549367/549367 [==============================] - 24s 45us/step - loss: 0.4020 - acc: 0.8467 - val_loss: 0.3874 - val_acc: 0.8546\n",
+      "Epoch 28/50\n",
+      "549367/549367 [==============================] - 24s 45us/step - loss: 0.4001 - acc: 0.8473 - val_loss: 0.3848 - val_acc: 0.8534\n",
+      "Epoch 29/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3991 - acc: 0.8479 - val_loss: 0.3865 - val_acc: 0.8562\n",
+      "Epoch 30/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3976 - acc: 0.8484 - val_loss: 0.3833 - val_acc: 0.8574\n",
+      "Epoch 31/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3961 - acc: 0.8487 - val_loss: 0.3846 - val_acc: 0.8585\n",
+      "Epoch 32/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3942 - acc: 0.8498 - val_loss: 0.3805 - val_acc: 0.8573\n",
+      "Epoch 33/50\n",
+      "549367/549367 [==============================] - 24s 44us/step - loss: 0.3935 - acc: 0.8503 - val_loss: 0.3856 - val_acc: 0.8579\n",
+      "Epoch 34/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3923 - acc: 0.8507 - val_loss: 0.3829 - val_acc: 0.8560\n",
+      "Epoch 35/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3920 - acc: 0.8508 - val_loss: 0.3864 - val_acc: 0.8575\n",
+      "Epoch 36/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3907 - acc: 0.8516 - val_loss: 0.3873 - val_acc: 0.8563\n",
+      "Epoch 37/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3891 - acc: 0.8519 - val_loss: 0.3850 - val_acc: 0.8570\n",
+      "Epoch 38/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3872 - acc: 0.8522 - val_loss: 0.3815 - val_acc: 0.8591\n",
+      "Epoch 39/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3887 - acc: 0.8520 - val_loss: 0.3829 - val_acc: 0.8590\n",
+      "Epoch 40/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3868 - acc: 0.8531 - val_loss: 0.3807 - val_acc: 0.8600\n",
+      "Epoch 41/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3859 - acc: 0.8537 - val_loss: 0.3832 - val_acc: 0.8574\n",
+      "Epoch 42/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3849 - acc: 0.8537 - val_loss: 0.3850 - val_acc: 0.8576\n",
+      "Epoch 43/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3834 - acc: 0.8541 - val_loss: 0.3825 - val_acc: 0.8563\n",
+      "Epoch 44/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3829 - acc: 0.8548 - val_loss: 0.3844 - val_acc: 0.8540\n",
+      "Epoch 45/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3816 - acc: 0.8552 - val_loss: 0.3841 - val_acc: 0.8559\n",
+      "Epoch 46/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3816 - acc: 0.8549 - val_loss: 0.3880 - val_acc: 0.8567\n",
+      "Epoch 47/50\n",
+      "549367/549367 [==============================] - 24s 45us/step - loss: 0.3799 - acc: 0.8559 - val_loss: 0.3767 - val_acc: 0.8635\n",
+      "Epoch 48/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3800 - acc: 0.8560 - val_loss: 0.3786 - val_acc: 0.8563\n",
+      "Epoch 49/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3781 - acc: 0.8563 - val_loss: 0.3812 - val_acc: 0.8596\n",
+      "Epoch 50/50\n",
+      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3788 - acc: 0.8560 - val_loss: 0.3782 - val_acc: 0.8601\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "<keras.callbacks.History at 0x7f5ca1bf3e48>"
+      ]
+     },
+     "execution_count": 21,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "m1.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=50,validation_data=([text_vectors_test, hypothesis_vectors_test], labels_test))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This model performs the same as the slightly more complex model that evaluates alignments in both directions.  Note also that processing time is improved, from 64 down to 48 microseconds per step. \n",
+    "\n",
+    "Let's now look at an asymmetric model that evaluates text to hypothesis comparisons.  The prediction is that such a model will correctly classify a decent proportion of the exemplars, but not as accurately as the previous two.\n",
+    "\n",
+    "We'll just use 10 epochs for expediency."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 96,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "__________________________________________________________________________________________________\n",
+      "Layer (type)                    Output Shape         Param #     Connected to                     \n",
+      "==================================================================================================\n",
+      "words1 (InputLayer)             (None, 50)           0                                            \n",
+      "__________________________________________________________________________________________________\n",
+      "words2 (InputLayer)             (None, 50)           0                                            \n",
+      "__________________________________________________________________________________________________\n",
+      "sequential_13 (Sequential)      (None, 50, 200)      321381600   words1[0][0]                     \n",
+      "                                                                 words2[0][0]                     \n",
+      "__________________________________________________________________________________________________\n",
+      "sequential_14 (Sequential)      (None, 50, 200)      80400       sequential_13[1][0]              \n",
+      "                                                                 sequential_13[2][0]              \n",
+      "__________________________________________________________________________________________________\n",
+      "dot_8 (Dot)                     (None, 50, 50)       0           sequential_14[1][0]              \n",
+      "                                                                 sequential_14[2][0]              \n",
+      "__________________________________________________________________________________________________\n",
+      "lambda_9 (Lambda)               (None, 50, 50)       0           dot_8[0][0]                      \n",
+      "__________________________________________________________________________________________________\n",
+      "dot_9 (Dot)                     (None, 50, 200)      0           lambda_9[0][0]                   \n",
+      "                                                                 sequential_13[2][0]              \n",
+      "__________________________________________________________________________________________________\n",
+      "concatenate_6 (Concatenate)     (None, 50, 400)      0           sequential_13[1][0]              \n",
+      "                                                                 dot_9[0][0]                      \n",
+      "__________________________________________________________________________________________________\n",
+      "time_distributed_9 (TimeDistrib (None, 50, 200)      120400      concatenate_6[0][0]              \n",
+      "__________________________________________________________________________________________________\n",
+      "lambda_10 (Lambda)              (None, 200)          0           time_distributed_9[0][0]         \n",
+      "__________________________________________________________________________________________________\n",
+      "sequential_16 (Sequential)      (None, 200)          80400       lambda_10[0][0]                  \n",
+      "__________________________________________________________________________________________________\n",
+      "dense_32 (Dense)                (None, 3)            603         sequential_16[1][0]              \n",
+      "==================================================================================================\n",
+      "Total params: 321,663,403\n",
+      "Trainable params: 341,803\n",
+      "Non-trainable params: 321,321,600\n",
+      "__________________________________________________________________________________________________\n"
+     ]
+    }
+   ],
+   "source": [
+    "m2 = build_model(sem_vectors, 50, 200, 3, 200, 'right')\n",
+    "m2.summary()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 97,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Train on 455226 samples, validate on 113807 samples\n",
+      "Epoch 1/10\n",
+      "455226/455226 [==============================] - 22s 49us/step - loss: 0.8920 - acc: 0.5771 - val_loss: 0.8001 - val_acc: 0.6435\n",
+      "Epoch 2/10\n",
+      "455226/455226 [==============================] - 22s 47us/step - loss: 0.7808 - acc: 0.6553 - val_loss: 0.7267 - val_acc: 0.6855\n",
+      "Epoch 3/10\n",
+      "455226/455226 [==============================] - 22s 47us/step - loss: 0.7329 - acc: 0.6825 - val_loss: 0.6966 - val_acc: 0.7006\n",
+      "Epoch 4/10\n",
+      "455226/455226 [==============================] - 22s 47us/step - loss: 0.7055 - acc: 0.6978 - val_loss: 0.6713 - val_acc: 0.7150\n",
+      "Epoch 5/10\n",
+      "455226/455226 [==============================] - 22s 47us/step - loss: 0.6862 - acc: 0.7081 - val_loss: 0.6533 - val_acc: 0.7253\n",
+      "Epoch 6/10\n",
+      "455226/455226 [==============================] - 21s 47us/step - loss: 0.6694 - acc: 0.7179 - val_loss: 0.6472 - val_acc: 0.7277\n",
+      "Epoch 7/10\n",
+      "455226/455226 [==============================] - 22s 47us/step - loss: 0.6555 - acc: 0.7252 - val_loss: 0.6338 - val_acc: 0.7347\n",
+      "Epoch 8/10\n",
+      "455226/455226 [==============================] - 22s 48us/step - loss: 0.6434 - acc: 0.7310 - val_loss: 0.6246 - val_acc: 0.7385\n",
+      "Epoch 9/10\n",
+      "455226/455226 [==============================] - 22s 47us/step - loss: 0.6325 - acc: 0.7367 - val_loss: 0.6164 - val_acc: 0.7424\n",
+      "Epoch 10/10\n",
+      "455226/455226 [==============================] - 22s 47us/step - loss: 0.6216 - acc: 0.7426 - val_loss: 0.6082 - val_acc: 0.7478\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "<keras.callbacks.History at 0x7fa6850cf080>"
+      ]
+     },
+     "execution_count": 97,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "m2.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=10,validation_split=.2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Comparing this fit to the validation accuracy of the previous two models after 10 epochs, we observe that its accuracy is roughly 10% lower.\n",
+    "\n",
+    "It is reassuring that the neural modeling here reproduces what we know from the semantics of natural language!"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.5.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/spacy/lang/pt/init.py
+++ b/spacy/lang/pt/init.py
@ -6,8 +6,10 @@ from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .lemmatizer import LOOKUP
 from .tag_map import TAG_MAP
+from .norm_exceptions import NORM_EXCEPTIONS

 from ..tokenizer_exceptions import BASE_EXCEPTIONS
+from .punctuation import TOKENIZER_INFIXES, TOKENIZER_PREFIXES
 from ..norm_exceptions import BASE_NORMS
 from ...language import Language
 from ...attrs import LANG, NORM
@ -17,13 +19,14 @@ from ...util import update_exc, add_lookups
 class PortugueseDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters[LANG] = lambda text: 'pt'
-    lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
+    lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS)
    lex_attr_getters.update(LEX_ATTRS)
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS
    lemma_lookup = LOOKUP
    tag_map = TAG_MAP
-
+    infixes = TOKENIZER_INFIXES
+    prefixes = TOKENIZER_PREFIXES

 class Portuguese(Language):
    lang = 'pt'
--- a/spacy/lang/pt/lex_attrs.py
+++ b/spacy/lang/pt/lex_attrs.py
@ -23,7 +23,7 @@ _ordinal_words = ['primeiro', 'segundo', 'terceiro', 'quarto', 'quinto', 'sexto'


 def like_num(text):
-    text = text.replace(',', '').replace('.', '')
+    text = text.replace(',', '').replace('.', '').replace('º','').replace('ª','')
    if text.isdigit():
        return True
    if text.count('/') == 1:
--- a/spacy/lang/pt/norm_exceptions.py
+++ b/spacy/lang/pt/norm_exceptions.py
@ -0,0 +1,23 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+# These exceptions are used to add NORM values based on a token's ORTH value.
+# Individual languages can also add their own exceptions and overwrite them -
+# for example, British vs. American spelling in English.
+
+# Norms are only set if no alternative is provided in the tokenizer exceptions.
+# Note that this does not change any other token attributes. Its main purpose
+# is to normalise the word representations so that equivalent tokens receive
+# similar representations. For example: $ and € are very different, but they're
+# both currency symbols. By normalising currency symbols to $, all symbols are
+# seen as similar, no matter how common they are in the training data.
+
+
+NORM_EXCEPTIONS = {
+    "R$": "$",    # Real
+    "r$": "$",    # Real
+    "Cz$": "$",   # Cruzado
+    "cz$": "$",   # Cruzado
+    "NCz$": "$",  # Cruzado Novo
+    "ncz$": "$"   # Cruzado Novo
+}
--- a/spacy/lang/pt/punctuation.py
+++ b/spacy/lang/pt/punctuation.py
@ -0,0 +1,18 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES
+from ..punctuation import TOKENIZER_SUFFIXES as BASE_TOKENIZER_SUFFIXES
+from ..punctuation import TOKENIZER_INFIXES as BASE_TOKENIZER_INFIXES
+
+_prefixes = ([r'\w{1,3}\$'] + BASE_TOKENIZER_PREFIXES)
+
+_suffixes = (BASE_TOKENIZER_SUFFIXES)
+
+_infixes = ([r'(\w+-\w+(-\w+)*)'] +
+             BASE_TOKENIZER_INFIXES
+             )
+
+TOKENIZER_PREFIXES = _prefixes
+TOKENIZER_SUFFIXES = _suffixes
+TOKENIZER_INFIXES = _infixes
--- a/spacy/lang/pt/stop_words.py
+++ b/spacy/lang/pt/stop_words.py
@ -3,67 +3,66 @@ from __future__ import unicode_literals


 STOP_WORDS = set("""
-à às acerca adeus agora ainda algo algumas alguns ali além ambas ambos ano
-anos antes ao aos apenas apoio apoia apontar após aquela aquelas aquele aqueles
-aqui aquilo área as assim através atrás até aí
+à às área acerca ademais adeus agora ainda algo algumas alguns ali além ambas ambos antes
+ao aos apenas apoia apoio apontar após aquela aquelas aquele aqueles aqui aquilo
+as assim através atrás até aí

 baixo bastante bem boa bom breve

 cada caminho catorze cedo cento certamente certeza cima cinco coisa com como
-comprido comprida conhecida conhecido conselho contra corrente custa cá
+comprida comprido conhecida conhecido conselho contra contudo corrente cuja
+cujo custa cá

-da daquela daquele dar das de debaixo demais dentro depois desde desligada
-desligado dessa desse desta deste deve devem deverá dez dezanove dezasseis
-dezassete dezoito dia diante direita diz dizem dizer do dois dos doze duas dá
-dão dúvida
+da daquela daquele dar das de debaixo demais dentro depois des desde dessa desse
+desta deste deve devem deverá dez dezanove dezasseis dezassete dezoito diante
+direita disso diz dizem dizer do dois dos doze duas dá dão

-é ela elas ele eles em embora enquanto entre então era és essa essas esse esses
-esta estado estar estará estas estava este estes esteve estive estivemos
-estiveram estiveste estivestes estou está estás estão eu exemplo
+é és ela elas ele eles em embora enquanto entre então era essa essas esse esses esta
+estado estar estará estas estava este estes esteve estive estivemos estiveram
+estiveste estivestes estou está estás estão eu eventual exemplo

 falta fará favor faz fazeis fazem fazemos fazer fazes fazia faço fez fim final
 foi fomos for fora foram forma foste fostes fui

 geral grande grandes grupo

-hoje horas há
+inclusive iniciar inicio ir irá isso isto

-iniciar inicio ir irá isso isto já
+já 

-lado ligado local logo longe lugar lá
+lado lhe ligado local logo longe lugar lá

-maior maioria maiorias mais mal mas me meio menor menos meses mesmo meu meus
-mil minha minhas momento muito muitos máximo mês
+maior maioria maiorias mais mal mas me meio menor menos meses mesmo meu meus mil
+minha minhas momento muito muitos máximo mês

-na nada naquela naquele nas nem nenhuma nessa nesse nesta neste no noite nome
-nos nossa nossas nosso nossos nova novas nove novo novos num numa nunca nuns
-não nível nós número números
+na nada naquela naquele nas nem nenhuma nessa nesse nesta neste no nos nossa
+nossas nosso nossos nova novas nove novo novos num numa nunca nuns não nível nós
+número números

-obra obrigada obrigado oitava oitavo oito onde ontem onze os ou outra outras
-outro outros
+obrigada obrigado oitava oitavo oito onde ontem onze ora os ou outra outras outros

-para parece parte partir pegar pela pelas pelo pelos perto pessoas pode podem
-poder poderá podia ponto pontos por porque porquê posição possivelmente posso
-possível pouca pouco povo primeira primeiro próprio próxima próximo puderam pôde
-põe põem
+para parece parte partir pegar pela pelas pelo pelos perto pode podem poder poderá
+podia pois ponto pontos por porquanto porque porquê portanto porém posição
+possivelmente posso possível pouca pouco povo primeira primeiro próprio próxima
+próximo puderam pôde põe põem

-qual qualquer quando quanto quarta quarto quatro que quem quer querem quero
+quais qual qualquer quando quanto quarta quarto quatro que quem quer querem quero
 questão quieta quieto quinta quinto quinze quê

 relação

 sabe saber se segunda segundo sei seis sem sempre ser seria sete seu seus sexta
-sexto sim sistema sob sobre sois somente somos sou sua suas são sétima sétimo
+sexto sim sistema sob sobre sois somente somos sou sua suas são sétima sétimo só

-tal talvez também tanta tanto tarde te tem temos tempo tendes tenho tens tentar
-tentaram tente tentei ter terceira terceiro teu teus teve tipo tive tivemos
-tiveram tiveste tivestes toda todas todo todos trabalhar trabalho treze três tu
-tua tuas tudo tão têm
+tais tal talvez também tanta tanto tarde te tem temos tempo tendes tenho tens
+tentar tentaram tente tentei ter terceira terceiro teu teus teve tipo tive
+tivemos tiveram tiveste tivestes toda todas todo todos treze três tu tua tuas
+tudo tão têm

-último um uma umas uns usa usar
+um uma umas uns usa usar último

-vai vais valor veja vem vens ver verdade verdadeira verdadeiro vez vezes viagem
-vinda vindo vinte você vocês vos vossa vossas vosso vossos vários vão vêm vós
+vai vais valor veja vem vens ver vez vezes vinda vindo vinte você vocês vos vossa
+vossas vosso vossos vários vão vêm vós

 zero
 """.split())
--- a/spacy/lang/pt/tokenizer_exceptions.py
+++ b/spacy/lang/pt/tokenizer_exceptions.py
@ -67,7 +67,7 @@ for orth in _per_pron + _dem_pron + _und_pron:
 for orth in [
    "Adm.", "Dr.", "e.g.", "E.g.", "E.G.", "Gen.", "Gov.", "i.e.", "I.e.",
    "I.E.", "Jr.", "Ltd.", "p.m.", "Ph.D.", "Rep.", "Rev.", "Sen.", "Sr.",
-    "Sra.", "vs."]:
+    "Sra.", "vs.", "tel.", "pág.", "pag."]:
    _exc[orth] = [{ORTH: orth}]


--- a/website/api/_top-level/_compat.jade
+++ b/website/api/_top-level/_compat.jade
@ -6,7 +6,7 @@ p
    |  but somewhat ugly in Python. Logic that deals with Python or platform
    |  compatibility only lives in #[code spacy.compat]. To distinguish them from
    |  the builtin functions, replacement functions are suffixed with an
-    |  undersocre, e.e #[code unicode_].
+    |  underscore, e.e #[code unicode_].

 +aside-code("Example").
    from spacy.compat import unicode_, json_dumps
--- a/website/universe/universe.json
+++ b/website/universe/universe.json
@ -184,7 +184,7 @@
                "from spacy_lookup import Entity",
                "",
                "nlp = spacy.load('en')",
-                "entity = Entity(nlp, keywords_list=['python', 'java platform'])",
+                "entity = Entity(keywords_list=['python', 'java platform'])",
                "nlp.add_pipe(entity, last=True)",
                "",
                "doc = nlp(u\"I am a product manager for a java and python.\")",