Iycee Charles de Gaulle Summary CSE527: is the dog” shares its substructure

CSE527: is the dog” shares its substructure

CSE527: Neural Module Networks for Visual Question Answering

Harsh Trivedi

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

[email protected]


Visual question answering (VQA) is inherently compo-
sitional in nature1. Questions like “where is the dog”
shares its substructure and so the computation required with
“where is the cat?” to answer the question. The authors
who proposed Neural Module Networks (NMN) have shown
that instead of having to train huge and static monolithic
network architecture to perform a task like this, one can
have elementary neural computation modules that are com-
posed together to form a computation graph specific for
each question. These modules share parameters and are
jointly trained end to end. At the test time as well, a com-
putation graph consisting of neural modules is dynamically
constructed based on the syntactic parse tree of question.
This leads to a very complex parameter tying at many levels
which is quite non-trival to implement let alone having it
trained well.

We have implemented Neural Module Network (NMN)
for VQA1 and a simple LSTM2 based baseline to
compare the results. The entire code (3K+ lines) is avail-
able at https://github.com/HarshTrivedi/
nmn-pytorch with detailed function level documenta-
tion and thorough instructions on how to reproduce the
results. The demo notebook loading our trained model (on
gpu) and predicting visual outputs is available at https:

1. Introduction

Given an image and a question on the image (eg. “where
are the sunglasses”) we intend to answer the question based
on the visual information present in the image. The ques-
tions like “where is the dog” shares its substructure and
so the computation required with “where is the cat?” to
answer the question. Naively, for both the questions one
needs to find where cat / dog is in the present and describe
that location to get an answer like, “on the couch” or may
be “in the park”. The NMN approach leverages this over-
lapping linguistic substructure and attempts to replace the
use monolithic network (static for all questions) with a net-

Punit Mehta

[email protected]

work which is dynamically assembled from a collection of
jointly-learned modules. These modules are computational
units which are determined from the linguistic structure of
the sentence.

At the beginning, each question is analyzed using stan-
ford parser. This is used to determine the composition of
computational units (attention, classification etc) required
to answer the question. This would include soft unit opera-
tions like finding subject, finding what is around it, describ-
ing it etc. Using that we assemble with a deterministic pro-
cess, these computational units to a tree like computation
graph which encodes necessary hierarchical operations to
compute the answers. On the outside of this tree like com-
putation graph is a simple LSTM based question encoder.
This encoded question representation is composed with the
dynamically generated computation graph at the pre-final
layer. Finally, the last layer predicts over the answer labels
and gives distribution of scores using softmax activation.
Figure 1 shows an overview of the architecture. Section 2

describes the implemented approach in detail. Subsection
2.1 covers individual architecture of neural modules, sub-
section 2.2 describes how to convert textual question to as-
sembled layout of modules, subsection 2.3 shows how we
deal with noisy layout prediction and subsection 2.4 briefly
discusses training details. Section 3 shows our experiments
on the VQA dataset and demo results from the accompanied
software package. Section 4 discusses the challenges faced
and learning outcomes of this project and concludes.


2. Approach

Each training instance for VQA is 3-tupled (w,x,y)
where: w is question text, x is image and y is the pre-
dicted answer. We want a model that encodes the proba-
bility distribution p(y|w, x; ?), where ? are the parameters
of the model. The model is fully specified by collection
of modules m with corresponding module parameters ?m.
Given some question w, a network layout predictor deter-
mines P(w) determines the composition of module units
for that question.

2.1. Modules

We need to select an inventory of modules that will be
necessary for representing the questions in the task. Dif-
ferent modules functions have different input domains and
output range but all of them operate on the following ba-
sic data-types: image features, unnormalized attentions and
answer labels. We can think this end to end model as an
attention model in which lower individual modules learn to
pass messages to the upper individual models in terms of
attention so as to jointly learn to predict correct answer.

We have limited our inventory of modules to FIND,
types refer to network structure. But there can be many
instantiations of these modules. For example, FINDdog
would return heat map (unnormalized attention) of where
the dog present in image and FINDcat would find heat
map for cat. The network parameters of all occurrences
of FINDx in any question are tied. Here cat and dog
are the lexicon vectors which are used to instantiate these
FIND modules. These vector embeddings are trained
and tied parametrically as well. This complex dynamic
parameter tying at network and embedding level, across
all questions has an end effect that each module learns to
perform its task locally well even though the entire model
is trained end to end. This parameter tying also warrants
on-the-fly composition of network from modules at predic-
tion time since each module has learnt to behave locally
so as to give a good result globally with its tree companions!

Let xvis be the image features, xtxt be lexicon vector, aout
the attention output, ? be the element-wise multiplication
operator and sum operator be summing the result over
spatial dimensions. Following are module network types.


FIND : Image ?? Attention

FINDtable looks at the image features and the lexicon
embedding of table to generate unnormalized attention
map of where the table is in the image.

aout = conv2 conv1(xvis) ? W (xtxt)

Here, W(xtxt) is feedforward one layer to map lexicon
feature to map dimensions. conv1 is 1X1 convolution
to reduce depth (channels) of image feature to map
dimensions. Finally, both W(xtxt) and conv1(xvis)
composed with elementwise dot-product composition.
conv2 is again 1X1 convolutions, but it reduces channels
of image to 1 which is equivalent to unnormalized attention.

TRANSFORM : Image × Attention ?? Attention

TRANSFORMabove looks at the attention map passed
to it from previous layer, the image features and lex-
icon embedding of above and transforms the atten-
tion as required by vector of above. Other exam-
ples are TRANFORMbelow, TRANFORMout-of,
TRANFORMon-top-of etc.

aout = conv2(conv1(xvis)?W1sum(ain?xvis)?W2xtxt)

Here, W2 (xtxt ) is feedforward one layer to map lexicon
feature to map dimensions. ain ? xvis is like spatial
attended read of image features, which is then summed
over spatial dimentions to get vector representation. This
is mapped to map dimensions using 1 layer feed forward
(W1). Now, two vector representations (1) attended image
vector representation (W1sum(ain ? xvis)) and (2) text
vector representation ( W2xtxt ) are composed with ele-
mentwise product. This is then dot producted with image
feature representation on each spatial coordinate. conv2
maps channeled output to attention by reducing depth to 1.

AN D : Attention × Attention ?? Attention

AND looks at 2 attention maps passed to it from below and
generates an attention map where both of them hold. For ex-
ample, AND( FINDcat, FINDdog ) would re-
turn unnormalized attention map where cat and dog both are
present. AND doesnt take lexicon vector for instantiation.



aout = minimum(a1, a2)

OR : Attention × Attention ?? Attention

• What is above the table?
TRANSFORMabove(FINDtable) )

• Where is the laptop?

• Where are children playing?
AND( FINDchildren,FINDplaying) )

2.2. Text to Modules

We have covered the nuts and bolts of individual mod-
ules. What remains is that how to determine a layout which
says how to assemble these modules based on question text.

Each question is processed with stanford parser3 to
obtain dependency representation. The set of dependen-
cies are filtered which are connected to wh-word in the
question. And some more lightweight processing on these
dependencies give primary query like representations of
question. For example, what is standing in the field?
becomes what(stand), Is TV on the table? becomes is(TV,
on(table)). The details of this procedure are not given in
the original paper so we use the script provided by authors
to do this very primary transformation. We need to convert
this primary layouts to the module layout consisting of
our module operations describe above.. For example,
where(car) to DESCRIBEwhere( FINDcar
) and is(TV, on(table) ) to DESCRIBEis(
TRANSFORMabove( FIND(table)) ). This
transformation was done using the prior knowledge about
modules domain and range: DESCRIBE module should
be at the root, FIND module should be at leaf. The mid
ones can be either AND, OR or TRANSFORM which can be

2.3. Overcoming Oversimplification

By transforming the question text to a very coarse level
layout of computation required, we often discard much
of the important information from the question. Layout
generation is somewhat noisy and sometimes important
cues are not present in the layout generated. For exam-
ple, where are the children playing in the park might have

Similar to AND module, OR maps 2 attentions from be-
low to one attention where either is true. For example,
OR( FINDcat, FINDdog )wouldreturnanun-
normalized attention map of where either of a cat or dog is
present. OR doesnt take lexicon vector for instantiation.

aout = maximum(a1, a2)

OR : Image × Attention ?? AnswerLabel

The DESCRIBE module is the top most module is the com-
putation tree. It looks at the final attention map generated
and transformed at lowers levels and the image features and
predicts score distribution over answer labels. For example,
DESCRIBEwhere(FINDlaptop) would take at-
tention map for laptop and lexicon vector of where and use
it to predict answer scores – probably with score of table as

y = W1T W2sum(a ? xvis) ? W3xtxt

Here, W2sum(a ? xvis) ? W3xtxt is similar to
TRANSFORM module. W1 is feed forward layer for map-
ping to answer label vocabulary.

Finally, below we have some of the examples for the
question and layouts:


noisylayoutDESCRIBEwhere( FINDchildren
) discarding the park information which is not as expected.
Hence, we use LSTM encoder to encode question in vector
representation. This representation is composed (element-
wise product) with final representation vector generated by
DESCRIBE module – just before softmax layer. We use
Glove embeddings (size = 300) 4 for the word sequences
for LSTM. Simple LSTM with 1024 hidden units is used. It
is important to note that the word embeddings here are dif-
ferent from label embeddings that we used to instantiate the
modules. Word embeddings for LSTM based question en-
coding are initialized with Glove embeddings and are non-
trainable. But label embeddings for modules are randomly
initialized and trainable by back-propogation.

2.4. Training Details

Dynamic Module Network along LSTM is jointly
trained end-to-end. Parameters of model described above
are trained with AdaDelta optimizer with standard mo-
mentum and decay rate. Batch size used is 1 since each
question-datum has different computation graph determined
by its corresponding layout. Data is shuffled for each epoch
which is quite important for optimizer to converge. We
build answer label vocabulary from the training data and
treat it as classification problem. Loss used is standard
cross entropy loss. Each question is paired with 10 hand
annotated answers (could be repeating). We select one with
highest confidence and use that answer for training purpose.

All the images were preprocessed with a forward pass of
trained Oxford VGGNet 5 using Caffe config and model
files published with VGGNet. We do a forward pass till last
conv5 layer of VGG in order to have a mapping of each the
image to it’s compact representation of 512X14X14. These
features are normalized for training set and its mean / stan-
dard deviations are used to normalize validation and test set
to maintain distribution. Finally, these input image features
are used everywhere in the modules described above.

3. Experiments and Results

We have used VQA dataset released by MSCOCO for
our experiments. This dataset has more than 200,000 im-
ages each coupled with 3 questions and for each question
we have 10 human labeled annotations. This is an extremely
huge dataset which perhaps is necessary for neural model
to work. However, due to constrained computational re-
sources, we could only scope down our problem on a spe-
cific question type. We consider only ‘where” type of ques-
tions. We have about 13,924 questions and 1,302 images.
Annotations for Test data is not released by Challenge or-
ganizers and so we make random split of validation into
two parts yielding train:val:test distribution 4:1:1. It is im-
portant to note that even on Nvidia P100 GPU, it took about
a day to finish training for our models. The, lack of pro-

Table1. Accuracies:Top1isusingstandardevaluationprovided
by VQA challenge organizer. Top3 is considers top 3 predictions

cessing power for such a huge dataset for enough time, puts
our model at severe disadvantage compared to the model of
original authors. Hence direct comparisons cant be made.
Accuracy evaluation is done using standard script provided
by the task organizers.

Figure 3 shows the loss curve of our training. It follows
the standard pattern in which training loss keeps low but
validation loss starts increasing hinting the over-fitting. We
run for 50 epochs but model is selected automatically based
on minimum validation accuracy across the epochs.

Figure1. TrainvsValidationLossplot

The results are documented in table 1. Top1 is the stan-
dard evaluation used by VQA challenge and was computed
using the script provided by the task organizers. For Top3
evaluation, we tweaked the script to consider top 3 predic-
tions instead of just 1. It should be noted that with random
model weights (before epoch 0 of training), the accuracy is
0.009%. Clearly predicting randomly over answer vocabu-
lary of 2000 labels has ¡ 0.01% chance of being correct. In
this regards, our model learns really well. The performance
is lower than state-of-the art results, but this comparison is
not fair since we could use only about 5% of actual data for
deep learning model which is generally very data hungry.
Original authors have not reported individual question-type
wise results, but from the results table of task-overview pa-
per 6 it can be clearly seen that “where” type of question
are much more difficult than the ones like, “is this a …”,
“what color is …” and many others.

Baseline: We have also implemented a very simple base-
line model that only uses LSTM based question encoder and
2 layers of relu feed forward to predict the answer. The
comparisons is shown in table 1.

Finally, the figures 3, 3 and 3 show the visual output pre-
dictions of our model. The jupyter-notebook is attached as
demo (and also on github) to see extended results of our



















model. We have shown the layout used for each image and
top 3 answers predicted by it.

Figure 2. Demo-1

Figure 3. Demo-2

Figure 4. Demo-3

4. Learning Outcomes and Challenges:

As part of this project, we learnt many things while im-
plementing this non-trival dynamic neural model and bat-
tling to debug its training. The complex parameter tying
present in the tree like dynamic computations graphs (DCG)
of this model as described in section 2 was very difficult
to implement from scratch. We have learnt to use many

deep learning models like LSTM encoder / decoder, Con-
volutional Networks, Feed Forward networks and about the
flexibility dynamic computation graphs. We have learnt to
build jointly learnt module networks which is a general con-
cept and can be applied to many scenarios other than Visual
QA. We have learnt use Caffe and leverage transfer learning
using VGG image representations. We leveraged GPU to do
all the work which was otherwise impossible! At times, we
struggled to fit the data in gpu memory and hence used asyn-
chronous and lazy background batch loading. Debugging
the training is always difficult if model is so dynamic. We
spent about 2 weeks just trying to figure out why loss was
not converging. We have also learnt to use glove word em-
beddings. The entire pipeline requires much amount of pre-
processing. Converting text sentences to valid module lay-
out representation was quite time-consuming as well. We
have learnt lots and lots of specifics of PyTorch which now
allows use to experiment with crazy neural models that Py-
Torchs DCG supports. We have a working and reproducible
end model of a complex neural architecture written from
scratch. This alone signifies that we have faced many chal-
lenges but have iteratively solved them to reach a good re-


1  Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and
Dan Klein. Neural module networks. In Proceedings of
the IEEE Conference on Computer Vision and Pattern
Recognition, pages 39–48, 2016.

2  Sepp Hochreiter and Ju ?rgen Schmidhuber. Long short-
term memory. Neural computation, 9(8):1735–1780,

3  DanqiChenandChristopherManning.Afastandaccu-
rate dependency parser using neural networks. In Pro-
ceedings of the 2014 conference on empirical methods
in natural language processing (EMNLP), pages 740–
750, 2014.

4  Jeffrey Pennington, Richard Socher, and Christopher
Manning. Glove: Global vectors for word representa-
tion. In Proceedings of the 2014 conference on empiri-
cal methods in natural language processing (EMNLP),
pages 1532–1543, 2014.

5  Karen Simonyan and Andrew Zisserman. Very deep
convolutional networks for large-scale image recogni-
tion. arXiv preprint arXiv:1409.1556, 2014.

6  Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-
garet Mitchell, Dhruv Batra, C Lawrence Zitnick, and
Devi Parikh. Vqa: Visual question answering. In
Proceedings of the IEEE International Conference on
Computer Vision, pages 2425–2433, 2015.