Research on Cross-modal Image Retrieval and Image-Text Matching Based on Visual-Language Pre-trained Models

<p>Xia Jiayue, Long Yanbin</p>

doi:10.25236/AJETS.2026.090302

Academic Journal of Engineering and Technology Science, 2026, 9(3); doi: 10.25236/AJETS.2026.090302.

Research on Cross-modal Image Retrieval and Image-Text Matching Based on Visual-Language Pre-trained Models

Author(s)

Xia Jiayue, Long Yanbin

Corresponding Author:

Long Yanbin

Affiliation(s)

University of Science and Technology Liaoning, Anshan, China

Download PDF
|
Download: 6
|
View: 586

Abstract

Cross-modal image retrieval and image-text matching are core tasks connecting computer vision and natural language processing, aiming to eliminate the heterogeneous gap between visual and text modalities. Visual-language pre-trained models, through large-scale data learning and cross-modal alignment, have become the dominant technical paradigm for solving this task. This paper systematically reviews the development of visual-language pre-trained models in the field of cross-modal retrieval, classifies and analyzes existing methods from three dimensions: model architecture, pre-training objectives, and downstream adaptation, and focuses on discussing the architectural differences between dual encoders and fusion encoders, the design evolution of pre-training tasks, and adaptation techniques such as efficient parameter fine-tuning. Based on this, we summarize mainstream datasets and evaluation metrics, compare the performance of representative models, and deeply analyze three key challenges: fine-grained alignment, noise robustness, and inference efficiency. Finally, we look forward to future research directions such as few-shot generalization, unified multi-task framework, and interpretability, hoping to provide a reference for further research in this field.

Keywords

Visual-Language Pre-Training; Cross-modal Retrieval; Image-Text Matching; Multimodal Learning; Feature Alignment

Cite This Paper

Xia Jiayue, Long Yanbin. Research on Cross-modal Image Retrieval and Image-Text Matching Based on Visual-Language Pre-trained Models. Academic Journal of Engineering and Technology Science (2026), Vol. 9, Issue 3: 15-21. https://doi.org/10.25236/AJETS.2026.090302.

References

[1] Dong C, Wei C. A review for image–text matching from deep learning perspective[J]. Information Fusion, 2026, 126: 103453.

[2] Li M, Wang H, Zhang Y, et al. Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking[J]. arXiv preprint arXiv:2601.04720, 2026.

[3] Qin Y, Xie H, Ding S, et al. Enhancing vision-and-language transformers through two-stage generative alignment pre-training[J]. Engineering Applications of Artificial Intelligence, 2025, 142: 109876.

[4] Zhang Y, Wang L, Chen X. A review of cross-modal image-text retrieval[J]. Remote Sensing, 2025, 17(24): 3995.

[5] Zhan Y, Liu J, Wang T, et al. ELIP: Enhanced Language-Image Pre-training for Multimedia Retrieval[C]. Proceedings of the IEEE International Conference on Content-Based Multimedia Indexing (CBMI), 2025: 156-163.

[6] Liu W, Chen Y, Zhao H. Cross-Modal Deep Interaction and Semantic Aligning for Image-Text Retrieval[J]. IEICE Transactions on Information and Systems, 2025, E108.D(10): 1230-1238.

[7] Li X, Li J, Li F, et al. Generalizing Vision-Language Models to Novel Domains: A Comprehensive Survey[J]. arXiv preprint arXiv:2506.18504, 2025.

[8] Wang T, Li F, Zhu L, et al. Cross-Modal Retrieval: A Systematic Review of Methods and Future Directions[J]. Proceedings of the IEEE, 2024, 112: 1716-1754.

[9] Xiao T, Wang S, Li Z, et al. OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment[J]. arXiv preprint arXiv:2509.19018, 2025.

[10] Wu J, Chen L, Zhang R. Cross-modal independent matching network for image-text retrieval[J]. Pattern Recognition, 2025, 159: 111096.