Frontiers in Medical Science Research, 2025, 7(3); doi: 10.25236/FMSR.2025.070313.
Siqi Chen
Fuzhou University, Fuzhou, Fujian, 350100, China
Current medical image diagnosis technology is limited by manual analysis bias and efficiency constraints, which restricts the continuous improvement of diagnosis accuracy. Multimodal medical report generation technology achieves intelligent transformation from medical images to structured diagnostic reports by constructing a cross-modal model, which shows breakthrough value in improving diagnostic accuracy and diagnostic and therapeutic efficiency. The research focuses on three core dimensions: model architecture, dataset optimisation and evaluation system construction: multi-scale feature extraction based on cross-modal comparative learning effectively captures image-text associations, attention-guided hierarchical fusion mechanism realises dynamic interaction between radiological images and clinical data, and retrieval-enhanced generation (RAG) framework ensures the professional standardisation of the report through the constraints of the medical knowledge graph. Despite the series of progress, the technology still faces clinical translation bottlenecks such as insufficient model reliability validation, significant heterogeneity of multi-centre data, and stringent compliance requirements for medical-grade deployment. In the future, the development of 3D spatial and temporal fusion upscaling modelling methods, the establishment of a end-to-end diagnostic and therapeutic assessment system, the construction of an adaptive medical model architecture, and the promotion of a global multimodal data collaboration platform will accelerate the technology's transition from laboratory validation to clinical utility, and ultimately achieve the development of precision medicine for the benefit of all.
multimodal large models, medical imaging report generation, deep learning, feature extraction, cross-modal alignment
Siqi Chen. A Review of Multimodal Large Model Based Medical Image Report Generation. Frontiers in Medical Science Research (2025), Vol. 7, Issue 3: 92-100. https://doi.org/10.25236/FMSR.2025.070313.
[1] Yong, Li. (2023). Exploring the relationship between medical imaging technology and medical imaging diagnosis. International Family Medicine. 4. 62-64. 10.37155/2717-5669-0401-21.
[2] Wang, Yuanyuan, & Cuihua Xuan, & Xiaoli Zhang,. (2023). Exploring the relationship between medical imaging technology and medical imaging diagnosis. International Family Medicine. 4. 56-58. 10.37155/2717-5669-0401-19.
[3] Junaid Bajwa et al. "Artiffcial Intelligence in Healthcare: Transforming the Practice of Medicine". In: Future Healthcare Journal 8.2 (2021), e188-e194. doi: 10.7861/ fhj.2021-0095.
[4] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, (Nov. 1998) "Gradient-based learning applied to document recognition," in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, doi: 10.1109/5.726791
[5] Ashish Vaswani et al. "Attention is All You Need," in Advances in Neural Information Processing Systems, vol. 30, ed. I. Guyon et al. (Curran Associates, Inc., 2017), https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa -Paper.pdf .
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171-4186, Minneapolis, Minnesota. association for Computational Linguistics.
[7] Wang, Z., Wu, Z., Agarwal, D., & Sun, J. (2022, December). Medclip: Contrastive learning from unpaired medical images and text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing (Vol. 2022, p. 3876).
[8] Goodfellow, I., Pouget-Abadie, J., Mirza,M., Xu, B., Warde-Farley, D., Ozair, S.,Courville, A., Bengio, Y.: Generative adversarial nets. advances in neural information processing systems 27 (2014)
[9] Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training.
[10] Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X., ... & Poon, H. (2021). Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1) , 1-23.
[11] J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya, et al. (2019) "Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison". In: Proceedings of the AAAI conference on artificial intelligence. vol. 33. no. 01. 590-597.
[12] A. E. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, R. G. Mark, and S. Horng, (2019)"MIMIC-CXR, a de- identified publicly available database of chest radiographs with free-text reports", Scientific data, 317.
[13] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. (2002) "BLEU: a method for automatic evaluation of machine translation". In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311-318.
[14] S. Jain, A. Agrawal, A. Saporta, S. Q. Truong, D. N. Duong, T. Bui, P. Chambon, Y. Zhang, M. P. Lungren, A. Y. Ng, et al. (2021)."Radgraph: extracting clinical entities and relations from radiology reports", arXiv preprint arXiv:2106.14463.
[15] Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1), 37-46.
[16] Danielczuk, M., Matl, M., Gupta, S., Li, A., Lee, A., Mahler, J., & Goldberg, K. (2019, May). Segmenting unknown 3d objects from real depth images using mask r-cnn trained on synthetic data. in 2019 International Conference on Robotics and Automation (ICRA) (pp. 7283-7290). IEEE.18 Elman, J. L. (1990). Finding structure in time. cognitive science, 14(2), 179-211.
[17] Y. Miura, Y. Zhang, E. B. Tsai, C. P. Langlotz, and D. Jurafsky, (2020)"Improving factual completeness and consistency of image-to-text radiology report generation", arXiv preprint arXiv:2010.10042
[18] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. (2009)"Imagenet: A large-scale hierarchical image database". In: 2009 IEEE conference on computer vision and pattern recognition. ieee. 248-255.
[19] Lin, C. Y. (2004, July). Rouge: A package for automatic evaluation of summaries. In Text summarisation branches out (pp. 74-81).
[20] Smit, A., Jain, S., Rajpurkar, P., Pareek, A., Ng, A. Y., & Lungren, M. P. (2020). CheXbert: combining automatic labellers and expert annotations for accurate radiology report labelling using BERT. arXiv preprint arXiv:2004.09167.
[21] Li, M., Lin, B., Chen, Z., Lin, H., Liang, X., & Chang, X. (2023). Dynamic graph enhanced contrastive learning for chest x-ray report generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3334-3343).
[22] Wu, C., Zhang, X., Zhang, Y., Wang, Y., & Xie, W. (2023). Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 21372-21383).
[23] Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., & Yakhnenko, O. (2013). Translating embeddings for modelling multi-relational data. advances in neural information processing systems, 26.
[24] Wang, Z., Liu, L., Wang, L., & Zhou, L. (2023). Metransformer: Radiology report generation by transformer with multiple learnable expert tokens. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11558-11567).
[25] Tanida, T., Müller, P., Kaissis, G., & Rueckert, D. (2023). Interactive and explainable region-guided radiology report generation. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7433-7442).
[26] Liu, F., Yin, C., Wu, X., Ge, S., Zou, Y., Zhang, P., & Sun, X. (2021). Contrastive attention for automatic chest x-ray report generation. arXiv preprint arXiv:2106.06965.
[27] Thawakar, O. C., Shaker, A. M., Mullappilly, S. S., Cholakkal, H., Anwer, R. M., Khan, S., ... & Khan, F. (2024, August). XrayGPT: Chest radiographs summarisation using large medical vision-language models. in Proceedings of the 23rd workshop on biomedical natural language processing (pp. 440-448).
[28] Zhan, C., Lin, Y., Wang, G., Wang, H., & Wu, J. (2024). Medm2g: Unifying medical multi-modal generation via cross-guided diffusion with visual invariant. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11502-11512).
[29] Cheng, P., Lin, L., Lyu, J., Huang, Y., Luo, W., & Tang, X. (2023). Prior: Prototype representation joint learning from medical images and reports. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 21361-21371).
[30] Li, Z., Yang, L. T., Ren, B., Nie, X., Gao, Z., Tan, C., & Li, S. Z. (2024). Mlip: Enhancing medical visual representation with divergence encoder and knowledge-guided contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11704-11714).
[31] Kale, K., & Jadhav, K. (2023). Replace and report: NLP assisted radiology report generation. arXiv preprint arXiv:2306.17180.
[32] Harrison, J. E., Weber, S., Jakob, R., & Chute, C. G. (2021). ICD-11: an international classification of diseases for the twenty-first century. BMC medical informatics and decision making, 21, 1-10.
[33] Gao, D., Kong, M., Zhao, Y., Huang, J., Huang, Z., Kuang, K., ... & Zhu, Q. (2024). Simulating doctors' thinking logic for chest X-ray report generation via Transformer-based Semantic Query learning. Medical Image Analysis, 91, 102982.
[34] Jin, H., Che, H., Lin, Y., & Chen, H. (2024, March). Promptmrg: Diagnosis-driven prompts for medical report generation. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, No. 3, pp. 2607-2615). pp. 2607-2615).
[35] Huang, Z., Zhang, X., & Zhang, S. (2023). Kiut: Knowledge-injected u-transformer for radiology report generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 19809-19818).
[36] Miura, Y., Zhang, Y., Tsai, E. B., Langlotz, C. P., & Jurafsky, D. (2020). Improving factual completeness and consistency of image-to-text radiology report generation. arXiv preprint arXiv:2010.10042.
[37] Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: consensus-based image description evaluation. in Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4566-4575).
[38] D. Demner-Fushman, M. D. Kohli, M. B. Rosenman, S. E. Shooshan, L. Rodriguez, S. Antani, G. R. Thoma, and C. J. McDonald, (2016)"Preparing a collection of radiology examinations for distribution and retrieval," Journal of the American Medical Informatics Association, pp. 304-310.
[39] Ye, J., Cheng, J., Chen, J., Deng, Z., Li, T., Wang, H., ... & Qiao, Y. (2023). Sa-med2d-20m dataset: Segment anything in 2d medical imaging with 20 million masks. arXiv preprint arXiv:2311.11969.
[40] Li, T., Su, Y., Li, W., Fu, B., Chen, Z., Huang, Z., ... & He, J. (2024). GMAI-VL & GMAI-VL-5.5 M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI. arXiv preprint arXiv. 2411.14522.
[41] Xie, Y., Zhou, C., Gao, L., Wu, J., Li, X., Zhou, H. Y., ... & Zhou, Y. (2024). Medtrinity-25m: A large-scale multimodal dataset with multigranular annotations for medicine. arXiv preprint arXiv:2408.02900.
[42] Moor, M., Huang, Q., Wu, S., Yasunaga, M., Dalmia, Y., Leskovec, J., ... & Rajpurkar, P. (2023, December). Med-flamingo: a multimodal medical few-shot learner. in Machine Learning for Health (ML4H) (pp. 353-367). PMLR.
[43] Yu, F., Endo, M., Krishnan, R., Pan, I., Tsai, A., Reis, E. P., ... & Rajpurkar, P. (2023). Evaluating progress in automatic chest x-ray radiology report generation. patterns, 4(9).
[44] Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019). Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
[45] Zhao, W., Wu, C., Zhang, X., Zhang, Y., Wang, Y., & Xie, W. (2024). Ratescore: A metric for radiology report generation. arXiv preprint arXiv:2406.16845.
[46] Tanno, R., Barrett, D. G., Sellergren, A., Ghaisas, S., Dathathri, S., See, A., ... & Ktena, I. (2023). Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report generation. arXiv preprint arXiv. 2311.18260.
[47] Huang, S., Sirejiding, S., Lu, Y., Ding, Y., Liu, L., Zhou, H., & Lu, H. (2024, April). Yolo-med: Multi-task interaction network for biomedical images. in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (ICASSP). Processing (ICASSP) (pp. 2175-2179). IEEE.
[48] Yao, L., Mao, C., & Luo, Y. (2019). KG-BERT: BERT for knowledge graph completion. arXiv preprint arXiv:1909.03193.