Visual Image Retrieval Based on Multimodal Information Fusion
Keywords:
Multimodal information fusion, Visual image retrieval, Feature extraction, Transformer model, Retrieval performance optimizationAbstract
This study proposes a multimodal information fusion approach for visual image retrieval. The model comprises three core components: a multimodal feature extraction module (MFEM), a multimodal feature fusion module (MFFM), and a unified feature retrieval module (UFRM) that process and integrate input data from different modalities. We design a Transformer-based multimodal fusion framework that combines image and text features through multi-head self-attention and cross-modal attention mechanisms, enabling joint feature representations with enhanced expressiveness and precision. Unlike existing methods that rely on simple concatenation or weighted fusion, the proposed approach learns fine-grained inter-modal interactions, thereby improving retrieval accuracy. Experimental evaluations on three public benchmarks—FashionIQ, CIRR, and Fashion200K—show that the proposed method outperforms current state-of-the-art approaches across multiple metrics. The method exhibits robust performance in both accuracy and generalization across diverse retrieval scenarios, confirming its effectiveness for complex image retrieval tasks.
Published
Issue
Section
License
Copyright (c) 2025 Journal of Intelligence Technology and Innovation

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.