Visual Image Retrieval Based on Multimodal Information Fusion

Authors

Keywords:

Multimodal information fusion, Visual image retrieval, Feature extraction, Transformer model, Retrieval performance optimization

Abstract

This study proposes a multimodal information fusion approach for visual image retrieval. The model comprises three core components: a multimodal feature extraction module (MFEM), a multimodal feature fusion module (MFFM), and a unified feature retrieval module (UFRM) that process and integrate input data from different modalities. We design a Transformer-based multimodal fusion framework that combines image and text features through multi-head self-attention and cross-modal attention mechanisms, enabling joint feature representations with enhanced expressiveness and precision. Unlike existing methods that rely on simple concatenation or weighted fusion, the proposed approach learns fine-grained inter-modal interactions, thereby improving retrieval accuracy. Experimental evaluations on three public benchmarks—FashionIQ, CIRR, and Fashion200K—show that the proposed method outperforms current state-of-the-art approaches across multiple metrics. The method exhibits robust performance in both accuracy and generalization across diverse retrieval scenarios, confirming its effectiveness for complex image retrieval tasks.

Published

2025-10-31

Issue

Section

Articles

How to Cite

Visual Image Retrieval Based on Multimodal Information Fusion. (2025). Journal of Intelligence Technology and Innovation, 3(3), 53-69. https://itip-submit.com/index.php/JITI/article/view/191