Enhancing Visual Question Answering through Ranking-Based Hybrid Training and Multimodal Fusion

Authors

  • Peiyuan Chen oregon state university Author
  • Zecheng Zhang Author
  • Yiping Dong Author
  • Li Zhou Author
  • Han Wang Author

Keywords:

Visual Question Answering, Rank VQA, Faster R-CNN, BERT, Multimodal Fusion, Ranking Learning, Hybrid Training Strategy

Abstract

Visual Question Answering (VQA) is a challenging task that requires systems to provide accurate answers to questions based on image content. Current VQA models struggle with complex questions due to limitations in capturing and integrating multimodal information effectively. To address these challenges, we propose the Rank VQA model, which leverages a ranking-inspired hybrid training strategy to enhance VQA performance. The Rank VQA model integrates high-quality visual features extracted using the Faster R-CNN model and rich semantic text features obtained from a pre-trained BERT model. These features are fused through a sophisticated multimodal fusion technique employing multi-head self-attention mechanisms. Additionally, a ranking learning module is incorporated to optimize the relative ranking of answers, thus improving answer accuracy. The hybrid training strategy combines classification and ranking losses, enhancing the model's generalization ability and robustness across diverse datasets. Experimental results demonstrate that RankVQA significantly outperforms existing state-of-the-art models on standard VQA datasets, achieving an accuracy of 71.5% and a Mean Reciprocal Rank (MRR) of 0.75 on the VQA v2.0 dataset, and an accuracy of 72.3% and an MRR of 0.76 on the COCO-QA dataset. The main contribution of this work is the introduction of a ranking-based hybrid training strategy that significantly enhances the model's ability to handle complex questions by effectively integrating high-quality visual and semantic text features. The main contribution of this work is the introduction of a ranking-based hybrid training strategy that enhances the model's ability to handle complex questions by integrating high-quality visual and semantic text features, improving VQA performance and laying the groundwork for further multimodal learning research.

Published

2024-10-21

How to Cite

Enhancing Visual Question Answering through Ranking-Based Hybrid Training and Multimodal Fusion. (2024). Journal of Intelligence Technology and Innovation, 2(3), 19-46. https://itip-submit.com/index.php/JITI/article/view/65