Learning Representations from Explainable and Connectionist Approaches for Visual Question Answering

By Aakansha Mishra Samsung R&D Institute India-Bangalore
By Miriyala Srinivas Soumitri Samsung R&D Institute India-Bangalore
By Vikram N Rajendiran Samsung R&D Institute India-Bangalore


VQA [1,2] is the field of research that aims to develop methods for answering natural language questions based on the information provided in corresponding images. VQA is also being constantly applied in many areas of prominent significance including human-machine interactions, robotics, educational and pedagogical applications, healthcare, visual aids, and the metaverse [3]. In most of the literature [4,2,5], VQA has always been solved as a supervised classification problem.

Most VQA methods learn to answer questions based on the statistical correlations between the multimodal inputs and the answer [6,4]. To perform multimodal reasoning, recent studies have leveraged Neural Module Networks (NMNs) [6,7] that explicitly model the multi-step reasoning process [8,5,9]. They parse the input question into a functional program and dynamically assemble a network of explainable neural modules to execute the program. Recent studies aim to address the issue of superficial correlations between questions and answers (limited in making inferences or generalizations) by representing the visual information as a structured scene graph [10] or converting the question into a program of executable neural modules.

In this work, a method is proposed to learn the importance of explainable and connectionist modules from the visual and linguistic modalities for answering. In this approach, for the given instance of a natural language question, image, and the true answer, a semantic parser is applied to parse the question in natural language into latent embeddings. Similarly, the image is converted into the scene graph through visual parsing. The scene graph is then utilized with the structured parsing of natural language question to determine the layout of the explainable modular network. The resulting embeddings are processed through a dense network to generate the prediction which is then compared with the ground truth to create the explainable loss (LaN). In parallel to this network, the embeddings obtained from visual and semantic parsing are also processed through a cross-modality multi-head attention transformer [15] network to predict the answer, yielding the connectionist loss (LaT). Eventually, the outputs of explainable (EEM) and connectionist (ECM) configurations are weighted by the learned question-guided attention (ad). The obtained joint embedding is used to predict the answer from the proposed network, resulting in the combined loss. The highlights of the proposed work (see Figure 1) include the following:

A novel architecture that leverages question-guided attention to combine explainable and connectionist approaches for robust visual reasoning and question answering.
Method for training and learning the contributions of explainable and connectionist networks for better visual question answering.
Comprehensive empirical analysis on the VQAv2.0 and GQA dataset to justify the need for the combination of explainable and connectionist configurations for VQA.

Proposed Approach

VQA systems are trained on a comprehensive set of images I, questions Q associated with these images, and all possible answers A. In the proposed work, a model is designed to extract the features from an explainable Neuro-Symbolic network and a connectionist transformer-based network. These features are then fused based on learned weights from question encoding to determine which model should be given higher weightage. The fusion is achieved through a guided attention mechanism that selects relevant features from each network. The overall framework of the proposed method is shown in Figure 1.

Multi-modal Feature Extraction

In the proposed VQA model, a given image is represented as a set of visual features {V= vi; ∀ i = 1 ….k } where vi∈Rⅆv is the ResNet-101[16] feature extracted from the ith region proposal obtained from Faster-RCNN [17].

All the questions are padded or truncated to a fixed length w, and GloVe embedding [18] is used to obtain an ordered sequence of word embedding for each question, given by {T= ti; ∀ i =1 ….w } where ti∈Rⅆw. In the next section, the multi-source feature extraction is described in detail.

Figure 1. Framework of proposed model: NS-VQA module outputs the explainable multimodal features (EEM = IN⊙TN).. Cross-modal attention and self-attention outputs connectionist based on the attended features (ECM = ZV⊙ZT). The approach learns the question guided attention to weight the connectionist and conceptual modalities (ad[0], ad[1]). Here, LaT, LaN, LaF are losses and aN, aT, aF predicted answers corresponds to explainable(EEM), connectionist (ECM) and weighted combination (E) features respectively.

Multi-modal Feature Learning

In the proposed model the input features (V and T) are processed through two modules, namely, a) the explicit and explainable network [5] based on Neuro-Symbolic AI and b) a deep connectionist multi-head attention network based on transformers [19].

Neuro-Symbolic Explainable Module: The explicit and explainable representation (EEM) is obtained by parsing the input image into a scene graph. The nodes of the scene graph are the ResNet-101 embedding of salient region proposals ({vi; ∀ i = 1 ….k }) extracted from pre-trained faster-RCNN. The edges between two nodes in the scene graph are a fusion of their features (E= eij; ∀ i,j =1 ….k) where eij = [vi, vj]. The input question is parsed using the module program generator [5] to generate the queries from the question which are then encoded into vector q through an LSTM network. The module program is executed on the scene graph through attention-based meta-types to resolve the task of VQA.

MHA-based Connectionist Module: Here the objective is to learn a joint embedding of question representation T and image features V for the answer prediction through the MHA based mechanism. The proposed model leverages the multihead self-attention module of the transformer [15] architecture to encode the text features. Both feature representation matrices, T and V are transformed into T() and V() to have the common d-dimensional feature vector. Let Qi , Ki and Vi be the Query, Key, and Value obtained from the question embedding T() through linear transformation using the weights WQi, WKi , and WVi , respectively, where d represents the ith head in the multi-head attention calculated as shown

The cross-modality interaction guides the question to attend to relevant region-based features of the image. Finally, unified joint embedding from the connectionist module EEM is obtained by elementwise multiplication.

Multiple heads are concatenated and a linear transformation is applied. The output of the multi-head self-attention block is then followed by a Feed-forward network completing the encoder block [15] by repeating N times. Let ZT be the feature map obtained from multi-head self-attention-based encoder blocks applied on text T. To incorporate cross-modality attention, ZT is used to generate the Value for multi-head attention mechanism on visual features V.These attention blocks are then followed by a Feed-forward network completing the cross modality attention encoder, which is also repeated N times to generate ZV. ZV and ZT are fused through elementwise multiplication (⊙) to obtain the fused embedding ECM.

Model Learning

It is obvious that all questions may not require a similar way of processing to answer. Some may require complex reasoning, while others may be directly inferred by looking at the image. In order to accomplish this, attention weights are learned in the context of text modality to guide whether the question is required to be answered through complex reasoning-based features or through connectionist features.

Let T() be the LSTM encoding of the question. This encoding is used to learn the dynamic attention (ad) for the type of feature (explainable or connectionist) required to answer the question.

Since in the proposed model, two sources are considered for feature processing, ad ∈ R2. With these attentions ad, the features obtained from multi-source are weighted to generate embeddings from the proposed model E as shown below

Therefore, three embeddings are: multi-source weighted E, explainable module EEM, and connectionist module ECM exploited to train the model. Each of these embeddings are processed through a fully connected network with sigmoid activation and binary cross entropy loss to predict the answers (aN,aT,aF). Total loss (LT) comprises of loss from explainable (LaN), connectionist (LaT) and fused model (LaF) allowing end-to-end training of the network.

During the inference, the prediction is only made with the fused model E which is obtained from the weighted sum of multi-source feature embedding.


Dataset & Experimental Details

To validate the proposed model, experiments are performed on two widely used real image-based VQA datasets: GQA [8] and VQA2.0 [20]. We have experimented on the balanced version of GQA dataset. In this section, a comparative analysis of performance on GQA and VQA2.0 datasets is presented. To analyze the efficacy of proposed model further ablation analysis of different components is presented. Here, dv = 2048, w = 20 words, dw = 300. The sizes of the hidden and output layers of LSTM are both set to 1024. Question embeddings are of size 1024. The dimension of shared space for MHA is d = 1024, and nh is 8. The number of labels (answers) for VQA2.0 and GQA are 3129 and 1842, respectively.

Comparison with State-of-the-Art Methods

The baseline methods primarily consisted of attention and explainable models. We did not compare our results with the Large Language Models that are pre-trained on vast datasets. Our approach is generic and can also be further improved with pre-training. The comparison is presented in Table 1 for the VQA2.0 and GQA datasets. In case of VQA2.0, we compared proposed method with two situations, namely one where there were only 36 bounding boxes to start with (val36) and other where there were 100 bounding boxes to start with (val-100). In both cases, our proposed approach has performed better than the state-of-the-art. In the case of val-36, when compared with LXMERT (a purely connectionist approach), our method outperforms by approximately 0.3%, and in the case of val-100, compared to XNM [5] (a purely explainable approach), our method outperforms by 1.82%. In the case of GQA dataset, the results are presented for validation and test split, where our method, compared to the top-performing XNM [5], shows a relative improvement of around 3.08%. The proposed method achieves state-of-the-art performance comparable to existing methods. To the best of our knowledge, it is a first-of-its-kind approach to combine the connectionist and explainable modules for resolving the task of VQA. The proposed method has a significant scope and its performance can be improved by improvising both the connectionist and explainable modules.

Table 1.Performance comparison on VQA2.0 & GQA dataset.

Table 2. Effect of different components on validation split of VQA2.0 and GQA.

Figure 2. NS and TR shows the frequency count of α > β and α < β respectively.

Ablation Studies

In the ablation analysis, the model was evaluated to assess the effectiveness of incorporating learnable weights for multisource features. The results are summarized in Table 2, which presents three different configurations: In row I, the model utilizes only explainable features. In row II, the model employs deep transformer-based features. In the last row, the model demonstrates the outcomes when using a weighted combination of features from multiple sources.

By incorporating learnable weights guided by the question, the model aims to extract features that are more relevant for answering the given question. This weighted combination of features allows the model to focus on the most informative aspects of the input, enhancing its ability to provide accurate answers. We can observe from Table 2 that the question-guided attention over the multisource feature helps to significantly improve the model performance.

In order to examine the distribution of question types in the dataset, an additional ablation analysis was conducted. The aim was to determine the prevalence of different types of questions, such as those requiring reasoning or those related to attributes and object detection. To accomplish this, the frequencies of α and β were computed. The underlying assumption was that datasets containing more explainable or reasoning-based questions would exhibit a higher count of samples with α > β. Conversely, datasets primarily composed of general deep feature-based questions would have α < β. The results depicted in Figure 2 indicate that approximately 62.04% of the total samples in the dataset consist of questions related to deep features, attributes, and object presence. The remaining 37.95% of samples are reasoning-based questions. Realistically, the VQA2.0 dataset predominantly comprises questions focused on attributes and object presence, rather than reasoning. This finding suggests that the proposed model effectively addresses the task of answering questions based on the question type at hand.


While connectionist models in VQA have shown high accuracy, they often struggle with contextual understanding. To overcome this, symbolic models incorporating explainable visual reasoning through scene graphs were proposed. However, real-world VQA scenarios involve the interplay between connectionist and conceptual modalities. To address this, a framework is proposed that combines explainable visual reasoning with cross-modality-based multi-head attention mechanisms for accurate VQA performance.

Link to the paper

Paper :


[1] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh, “Vqa: Visual question answering,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2425– 2433.

[2] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh, “Making the v in vqa matter: Elevating the role of image understanding in visual question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6904–6913.

[3] Silvio Barra, Carmen Bisogni, Maria De Marsico, and Stefano Ricciardi, “Visual question answering: Which investigated applications?,” Pattern Recognition Letters, vol. 151, pp. 325– 331, 2021.

[4] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6077– 6086.

[5] Jiaxin Shi, Hanwang Zhang, and Juanzi Li, “Explainable and explicit visual reasoning over scene graphs,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8376–8384.

[6] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein, “Neural module networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 39–48. [7] Jingying Gao, Alan Blair, and Maurice Pagnucco, “A symbolic-neural reasoning model for visual question answering,” in 2023 International Joint Conference on Neural Networks (IJCNN). IEEE, 2023, pp. 1–9.

[8] Drew Hudson and Christopher D Manning, “Learning by abstraction: The neural state machine,” Advances in Neural Information Processing Systems, vol. 32, 2019.

[9] Abdulganiyu Abdu Yusuf, Feng Chong, and Mao Xianling, “An analysis of graph convolutional networks and recent datasets for visual question answering,” Artificial Intelligence Review, vol. 55, no. 8, pp. 6277–6300, 2022.

[10] Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton Van Den Hengel, “Image captioning and visual question answering based on attributes and external knowledge,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 6, pp. 1367–1381, 2017.

[11] Yuke Zhu, Joseph J Lim, and Li Fei-Fei, “Knowledge acquisition for visual question answering via iterative querying,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1154–1163.

[12] Somak Aditya, Yezhou Yang, and Chitta Baral, “Explicit reasoning over end-to-end neural architectures for visual question answering,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2018, vol. 32.

[13] Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai, and Mingyang Ling, “Scene graph generation with external knowledge and image reconstruction,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1969–1978.

[14] Alireza Zareian, Svebor Karaman, and Shih-Fu Chang, “Bridging knowledge graphs to generate scene graphs,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16. Springer, 2020, pp. 606–623.

[15] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.

[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

[17] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.

[18] Jeffrey Pennington, Richard Socher, and Christopher Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.

[19] Hao Tan and Mohit Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” arXiv preprint arXiv:1908.07490, 2019.

[20] Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Devi Parikh, and Dhruv Batra, “Vqa: Visual question answering,” International Journal of Computer Vision, vol. 123, no. 1, pp. 4–31, May 2017.

[21] Yifeng Zhang, Ming Jiang, and Qi Zhao, “Explicit knowledge incorporation for visual reasoning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1356–1365.

[22] Liangjun Zhu, Li Peng, Weinan Zhou, and Jielong Yang, “Dual-decoder transformer network for answer grounding in visual question answering,” Pattern Recognition Letters, vol. 171, pp. 53–60, 2023.

[23] Dingbang Li, Xin Lin, Haibin Cai, and Wenzhou Chen, “Visual graph reasoning network,” in 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.

[24] Thao Minh Le, Vuong Le, Sunil Gupta, Svetha Venkatesh, and Truyen Tran, “Guiding visual question answering with attention priors,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 4381– 4390.