CSE527: is the dog” shares its substructure

CSE527: Neural Module Networks for Visual Question Answering Harsh Trivedi [email protected] Abstract Visual question answering (VQA) is inherently compo-sitional in nature1. Questions like “where is the dog”shares its substructure and so the computation required with”where is the cat?” to answer the question. The authorswho proposed Neural Module Networks (NMN) have shownthat instead of having to train huge and static monolithicnetwork architecture to perform a task like this, one canhave elementary neural computation modules that are com-posed together to form a computation graph specific foreach question.

These modules share parameters and arejointly trained end to end. At the test time as well, a com-putation graph consisting of neural modules is dynamicallyconstructed based on the syntactic parse tree of question.This leads to a very complex parameter tying at many levelswhich is quite non-trival to implement let alone having ittrained well. We have implemented Neural Module Network (NMN)for VQA1 and a simple LSTM2 based baseline tocompare the results. The entire code (3K+ lines) is avail-able at https://github.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

com/HarshTrivedi/nmn-pytorch with detailed function level documenta-tion and thorough instructions on how to reproduce theresults. The demo notebook loading our trained model (ongpu) and predicting visual outputs is available at https://github.com/HarshTrivedi/nmn-pytorch/blob/master/visualize_model.ipynb. 1.

Introduction Given an image and a question on the image (eg. “whereare the sunglasses”) we intend to answer the question basedon the visual information present in the image. The ques-tions like “where is the dog” shares its substructure andso the computation required with “where is the cat?” toanswer the question. Naively, for both the questions oneneeds to find where cat / dog is in the present and describethat location to get an answer like, “on the couch” or maybe “in the park”. The NMN approach leverages this over-lapping linguistic substructure and attempts to replace theuse monolithic network (static for all questions) with a net- Punit Mehta [email protected] work which is dynamically assembled from a collection ofjointly-learned modules. These modules are computationalunits which are determined from the linguistic structure ofthe sentence.

At the beginning, each question is analyzed using stan-ford parser. This is used to determine the composition ofcomputational units (attention, classification etc) requiredto answer the question. This would include soft unit opera-tions like finding subject, finding what is around it, describ-ing it etc.

Using that we assemble with a deterministic pro-cess, these computational units to a tree like computationgraph which encodes necessary hierarchical operations tocompute the answers. On the outside of this tree like com-putation graph is a simple LSTM based question encoder.This encoded question representation is composed with thedynamically generated computation graph at the pre-finallayer. Finally, the last layer predicts over the answer labelsand gives distribution of scores using softmax activation.

Figure 1 shows an overview of the architecture. Section 2 describes the implemented approach in detail. Subsection2.1 covers individual architecture of neural modules, sub-section 2.2 describes how to convert textual question to as-sembled layout of modules, subsection 2.3 shows how wedeal with noisy layout prediction and subsection 2.4 brieflydiscusses training details.

Section 3 shows our experimentson the VQA dataset and demo results from the accompaniedsoftware package. Section 4 discusses the challenges facedand learning outcomes of this project and concludes. 4321 2. Approach Each training instance for VQA is 3-tupled (w,x,y)where: w is question text, x is image and y is the pre-dicted answer. We want a model that encodes the proba-bility distribution p(y|w, x; ?), where ? are the parametersof the model.

The model is fully specified by collectionof modules m with corresponding module parameters ?m.Given some question w, a network layout predictor deter-mines P(w) determines the composition of module unitsfor that question. 2.1. Modules We need to select an inventory of modules that will benecessary for representing the questions in the task.

Dif-ferent modules functions have different input domains andoutput range but all of them operate on the following ba-sic data-types: image features, unnormalized attentions andanswer labels. We can think this end to end model as anattention model in which lower individual modules learn topass messages to the upper individual models in terms ofattention so as to jointly learn to predict correct answer. We have limited our inventory of modules to FIND,TRANSFORM, AND, OR and DESCRIBE. These moduletypes refer to network structure. But there can be manyinstantiations of these modules. For example, FINDdogwould return heat map (unnormalized attention) of wherethe dog present in image and FINDcat would find heatmap for cat.

The network parameters of all occurrencesof FINDx in any question are tied. Here cat and dogare the lexicon vectors which are used to instantiate theseFIND modules. These vector embeddings are trainedand tied parametrically as well. This complex dynamicparameter tying at network and embedding level, acrossall questions has an end effect that each module learns toperform its task locally well even though the entire modelis trained end to end.

This parameter tying also warrantson-the-fly composition of network from modules at predic-tion time since each module has learnt to behave locallyso as to give a good result globally with its tree companions! Let xvis be the image features, xtxt be lexicon vector, aoutthe attention output, ? be the element-wise multiplicationoperator and sum operator be summing the result overspatial dimensions. Following are module network types. FIND: FIND : Image ?? Attention FINDtable looks at the image features and the lexiconembedding of table to generate unnormalized attentionmap of where the table is in the image. aout = conv2 conv1(xvis) ? W (xtxt) Here, W(xtxt) is feedforward one layer to map lexiconfeature to map dimensions. conv1 is 1X1 convolutionto reduce depth (channels) of image feature to mapdimensions.

Finally, both W(xtxt) and conv1(xvis)composed with elementwise dot-product composition.conv2 is again 1X1 convolutions, but it reduces channelsof image to 1 which is equivalent to unnormalized attention. TRANSFORM:TRANSFORM : Image × Attention ?? Attention TRANSFORMabove looks at the attention map passedto it from previous layer, the image features and lex-icon embedding of above and transforms the atten-tion as required by vector of above. Other exam-ples are TRANFORMbelow, TRANFORMout-of,TRANFORMon-top-of etc. aout = conv2(conv1(xvis)?W1sum(ain?xvis)?W2xtxt) Here, W2 (xtxt ) is feedforward one layer to map lexiconfeature to map dimensions. ain ? xvis is like spatialattended read of image features, which is then summedover spatial dimentions to get vector representation. Thisis mapped to map dimensions using 1 layer feed forward(W1).

Now, two vector representations (1) attended imagevector representation (W1sum(ain ? xvis)) and (2) textvector representation ( W2xtxt ) are composed with ele-mentwise product. This is then dot producted with imagefeature representation on each spatial coordinate. conv2maps channeled output to attention by reducing depth to 1. AND:AN D : Attention × Attention ?? Attention AND looks at 2 attention maps passed to it from below andgenerates an attention map where both of them hold. For ex-ample, AND( FINDcat, FINDdog ) would re-turn unnormalized attention map where cat and dog both arepresent. AND doesnt take lexicon vector for instantiation. 4322 OR: aout = minimum(a1, a2) OR : Attention × Attention ?? Attention • What is above the table?DESCRIBEwhat(TRANSFORMabove(FINDtable) ) • Where is the laptop?DESCRIBEwhere(FINDlaptop) • Where are children playing?DESCRIBEwhere(AND( FINDchildren,FINDplaying) ) 2.2.

Text to Modules We have covered the nuts and bolts of individual mod-ules. What remains is that how to determine a layout whichsays how to assemble these modules based on question text. Each question is processed with stanford parser3 toobtain dependency representation. The set of dependen-cies are filtered which are connected to wh-word in thequestion. And some more lightweight processing on thesedependencies give primary query like representations ofquestion. For example, what is standing in the field?becomes what(stand), Is TV on the table? becomes is(TV,on(table)).

The details of this procedure are not given inthe original paper so we use the script provided by authorsto do this very primary transformation. We need to convertthis primary layouts to the module layout consisting ofour module operations describe above.. For example,where(car) to DESCRIBEwhere( FINDcar) and is(TV, on(table) ) to DESCRIBEis(TRANSFORMabove( FIND(table)) ). Thistransformation was done using the prior knowledge aboutmodules domain and range: DESCRIBE module shouldbe at the root, FIND module should be at leaf. The midones can be either AND, OR or TRANSFORM which can bedetermined.

2.3. Overcoming Oversimplification By transforming the question text to a very coarse levellayout of computation required, we often discard muchof the important information from the question. Layoutgeneration is somewhat noisy and sometimes importantcues are not present in the layout generated. For exam-ple, where are the children playing in the park might have Similar to AND module, OR maps 2 attentions from be-low to one attention where either is true. For example,OR( FINDcat, FINDdog )wouldreturnanun-normalized attention map of where either of a cat or dog ispresent.

OR doesnt take lexicon vector for instantiation. aout = maximum(a1, a2) DESCRIBE:OR : Image × Attention ?? AnswerLabel The DESCRIBE module is the top most module is the com-putation tree. It looks at the final attention map generatedand transformed at lowers levels and the image features andpredicts score distribution over answer labels. For example,DESCRIBEwhere(FINDlaptop) would take at-tention map for laptop and lexicon vector of where and useit to predict answer scores – probably with score of table ashighest. y = W1T W2sum(a ? xvis) ? W3xtxt Here, W2sum(a ? xvis) ? W3xtxt is similar toTRANSFORM module. W1 is feed forward layer for map-ping to answer label vocabulary.

Finally, below we have some of the examples for thequestion and layouts: 4323 noisylayoutDESCRIBEwhere( FINDchildren) discarding the park information which is not as expected.Hence, we use LSTM encoder to encode question in vectorrepresentation. This representation is composed (element-wise product) with final representation vector generated byDESCRIBE module – just before softmax layer. We useGlove embeddings (size = 300) 4 for the word sequencesfor LSTM. Simple LSTM with 1024 hidden units is used. Itis important to note that the word embeddings here are dif-ferent from label embeddings that we used to instantiate themodules. Word embeddings for LSTM based question en-coding are initialized with Glove embeddings and are non-trainable.

But label embeddings for modules are randomlyinitialized and trainable by back-propogation. 2.4. Training Details Dynamic Module Network along LSTM is jointlytrained end-to-end. Parameters of model described aboveare trained with AdaDelta optimizer with standard mo-mentum and decay rate. Batch size used is 1 since eachquestion-datum has different computation graph determinedby its corresponding layout. Data is shuffled for each epochwhich is quite important for optimizer to converge. Webuild answer label vocabulary from the training data andtreat it as classification problem.

Loss used is standardcross entropy loss. Each question is paired with 10 handannotated answers (could be repeating). We select one withhighest confidence and use that answer for training purpose.

All the images were preprocessed with a forward pass oftrained Oxford VGGNet 5 using Caffe config and modelfiles published with VGGNet. We do a forward pass till lastconv5 layer of VGG in order to have a mapping of each theimage to it’s compact representation of 512X14X14. Thesefeatures are normalized for training set and its mean / stan-dard deviations are used to normalize validation and test setto maintain distribution. Finally, these input image featuresare used everywhere in the modules described above. 3.

Experiments and Results We have used VQA dataset released by MSCOCO forour experiments. This dataset has more than 200,000 im-ages each coupled with 3 questions and for each questionwe have 10 human labeled annotations. This is an extremelyhuge dataset which perhaps is necessary for neural modelto work. However, due to constrained computational re-sources, we could only scope down our problem on a spe-cific question type. We consider only ‘where” type of ques-tions. We have about 13,924 questions and 1,302 images.Annotations for Test data is not released by Challenge or-ganizers and so we make random split of validation intotwo parts yielding train:val:test distribution 4:1:1.

It is im-portant to note that even on Nvidia P100 GPU, it took abouta day to finish training for our models. The, lack of pro- Table1. Accuracies:Top1isusingstandardevaluationprovidedby VQA challenge organizer.

Top3 is considers top 3 predictions cessing power for such a huge dataset for enough time, putsour model at severe disadvantage compared to the model oforiginal authors. Hence direct comparisons cant be made.Accuracy evaluation is done using standard script providedby the task organizers. Figure 3 shows the loss curve of our training. It followsthe standard pattern in which training loss keeps low butvalidation loss starts increasing hinting the over-fitting. Werun for 50 epochs but model is selected automatically basedon minimum validation accuracy across the epochs.

Figure1. TrainvsValidationLossplot The results are documented in table 1. Top1 is the stan-dard evaluation used by VQA challenge and was computedusing the script provided by the task organizers. For Top3evaluation, we tweaked the script to consider top 3 predic-tions instead of just 1. It should be noted that with randommodel weights (before epoch 0 of training), the accuracy is0.

009%. Clearly predicting randomly over answer vocabu-lary of 2000 labels has ¡ 0.01% chance of being correct.

Inthis regards, our model learns really well. The performanceis lower than state-of-the art results, but this comparison isnot fair since we could use only about 5% of actual data fordeep learning model which is generally very data hungry.Original authors have not reported individual question-typewise results, but from the results table of task-overview pa-per 6 it can be clearly seen that “where” type of questionare much more difficult than the ones like, “is this a …”,”what color is .

..” and many others. Baseline: We have also implemented a very simple base-line model that only uses LSTM based question encoder and2 layers of relu feed forward to predict the answer. Thecomparisons is shown in table 1. Finally, the figures 3, 3 and 3 show the visual output pre-dictions of our model.

The jupyter-notebook is attached asdemo (and also on github) to see extended results of our Validation Test Model Top1 Top3 Top1 Top3 NMN 18.78 23.65 19.63 25.18 LSTM 10.82 15.58 12.78 17.

71 4324 model. We have shown the layout used for each image andtop 3 answers predicted by it. Figure 2. Demo-1 Figure 3.

Demo-2 Figure 4. Demo-3 4. Learning Outcomes and Challenges: As part of this project, we learnt many things while im-plementing this non-trival dynamic neural model and bat-tling to debug its training.

The complex parameter tyingpresent in the tree like dynamic computations graphs (DCG)of this model as described in section 2 was very difficultto implement from scratch. We have learnt to use many deep learning models like LSTM encoder / decoder, Con-volutional Networks, Feed Forward networks and about theflexibility dynamic computation graphs. We have learnt tobuild jointly learnt module networks which is a general con-cept and can be applied to many scenarios other than VisualQA. We have learnt use Caffe and leverage transfer learningusing VGG image representations. We leveraged GPU to doall the work which was otherwise impossible! At times, westruggled to fit the data in gpu memory and hence used asyn-chronous and lazy background batch loading.

Debuggingthe training is always difficult if model is so dynamic. Wespent about 2 weeks just trying to figure out why loss wasnot converging. We have also learnt to use glove word em-beddings.

The entire pipeline requires much amount of pre-processing. Converting text sentences to valid module lay-out representation was quite time-consuming as well. Wehave learnt lots and lots of specifics of PyTorch which nowallows use to experiment with crazy neural models that Py-Torchs DCG supports. We have a working and reproducibleend model of a complex neural architecture written fromscratch. This alone signifies that we have faced many chal-lenges but have iteratively solved them to reach a good re-sult. References 1  Jacob Andreas, Marcus Rohrbach, Trevor Darrell, andDan Klein. Neural module networks.

In Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition, pages 39–48, 2016. 2  Sepp Hochreiter and Ju ?rgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780,1997.

3  DanqiChenandChristopherManning.Afastandaccu-rate dependency parser using neural networks. In Pro-ceedings of the 2014 conference on empirical methodsin natural language processing (EMNLP), pages 740–750, 2014. 4  Jeffrey Pennington, Richard Socher, and ChristopherManning.

Glove: Global vectors for word representa-tion. In Proceedings of the 2014 conference on empiri-cal methods in natural language processing (EMNLP),pages 1532–1543, 2014. 5  Karen Simonyan and Andrew Zisserman. Very deepconvolutional networks for large-scale image recogni-tion. arXiv preprint arXiv:1409.

1556, 2014. 6  Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet Mitchell, Dhruv Batra, C Lawrence Zitnick, andDevi Parikh. Vqa: Visual question answering. InProceedings of the IEEE International Conference onComputer Vision, pages 2425–2433, 2015.



I'm Ruth!

Would you like to get a custom essay? How about receiving a customized one?

Check it out