Visual Image Retrieval Based on Multimodal Information Fusion

Dr. Hemachandran K

Authors

Dr. Hemachandran K Professor, AI Research Centre, School of Business, Woxsen University, India Author https://orcid.org/0000-0002-7424-5022

Keywords:

Multimodal information fusion, Visual image retrieval, Feature extraction, Transformer model, Retrieval performance optimization

Abstract

This study proposes a multimodal information fusion approach for visual image retrieval. The model comprises three core components: a multimodal feature extraction module (MFEM), a multimodal feature fusion module (MFFM), and a unified feature retrieval module (UFRM) that process and integrate input data from different modalities. We design a Transformer-based multimodal fusion framework that combines image and text features through multi-head self-attention and cross-modal attention mechanisms, enabling joint feature representations with enhanced expressiveness and precision. Unlike existing methods that rely on simple concatenation or weighted fusion, the proposed approach learns fine-grained inter-modal interactions, thereby improving retrieval accuracy. Experimental evaluations on three public benchmarks—FashionIQ, CIRR, and Fashion200K—show that the proposed method outperforms current state-of-the-art approaches across multiple metrics. The method exhibits robust performance in both accuracy and generalization across diverse retrieval scenarios, confirming its effectiveness for complex image retrieval tasks.

Visual Image Retrieval Based on Multimodal Information Fusion

Authors

Keywords:

Abstract

Published

Issue

Section

License

How to Cite

Make a Submission