Enhancing Visual Question Answering through Ranking-Based Hybrid Training and Multimodal Fusion
Keywords:
Visual Question Answering, Rank VQA, Faster R-CNN, BERT, Multimodal Fusion, Ranking Learning, Hybrid Training StrategyAbstract
Visual Question Answering (VQA) is a challenging task that requires systems to provide accurate answers to questions based on image content. Current VQA models struggle with complex questions due to limitations in capturing and integrating multimodal information effectively. To address these challenges, we propose the Rank VQA model, which leverages a ranking-inspired hybrid training strategy to enhance VQA performance. The Rank VQA model integrates high-quality visual features extracted using the Faster R-CNN model and rich semantic text features obtained from a pre-trained BERT model. These features are fused through a sophisticated multimodal fusion technique employing multi-head self-attention mechanisms. Additionally, a ranking learning module is incorporated to optimize the relative ranking of answers, thus improving answer accuracy. The hybrid training strategy combines classification and ranking losses, enhancing the model's generalization ability and robustness across diverse datasets. Experimental results demonstrate that RankVQA significantly outperforms existing state-of-the-art models on standard VQA datasets, achieving an accuracy of 71.5% and a Mean Reciprocal Rank (MRR) of 0.75 on the VQA v2.0 dataset, and an accuracy of 72.3% and an MRR of 0.76 on the COCO-QA dataset. The main contribution of this work is the introduction of a ranking-based hybrid training strategy that significantly enhances the model's ability to handle complex questions by effectively integrating high-quality visual and semantic text features. The main contribution of this work is the introduction of a ranking-based hybrid training strategy that enhances the model's ability to handle complex questions by integrating high-quality visual and semantic text features, improving VQA performance and laying the groundwork for further multimodal learning research.
Published
Issue
Section
License
Copyright (c) 2024 Journal of Intelligence Technology and Innovation
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.