Automated Chest X-Ray Captioning Using Pretrained Vision Transformer with LSTM and Multi-Head Attention

Rafy Aulia Akbar; Ricky Eka Putra; Wiyli Yustanti

doi:10.26740/jieet.v9n1.p1-10

pdf

Published: Jul 1, 2025

DOI: https://doi.org/10.26740/jieet.v9n1.p1-10

Keywords:

Vision Transformer, LSTM, Multi-Head Attention, Chest X-Ray, Medical Image Captioning

Rafy Aulia Akbar

Universitas Negeri Surabaya

Ricky Eka Putra

Universitas Negeri Surabaya

Wiyli Yustanti

Universitas Negeri Surabaya

Abstract

Radiology report generation is a complex and error-prone task, especially for radiologists with limited experience. To overcome this, this study aims to develop an automated system for generating text-based radiology reports using chest X-ray images. The proposed approach combines computer vision and natural language processing through an encoder-decoder architecture. As an encoder, a Vision Transformer (ViT) model trained on the CheXpert dataset is used to extract visual features from X-ray images after Gamma Correction is performed to improve image quality. In the decoder section, word embeddings from the report text are processed using Long Short-Term Memory (LSTM) to capture word order relationships, and enriched with Multi-Head Attention (MHA) to pay attention to important parts of the text. Visual and text features are then combined and passed to a dense layer to generate text-based radiology reports. The evaluation results show that the proposed model achieves a ROUGE-L score of 0.385, outperforming previous models. The BLEU-1 score also shows competitive results with a value of 0.427. This study shows that the use of pre-trained ViT, combined with LSTM-MHA on the decoder, provides excellent performance in capturing visual and semantic context of text, as well as improving accuracy and efficiency in radiology report automation.

Issue

Vol. 9 No. 1 (2025)

Section

Articles

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

References

[1] K. L. Gormly, “Improving radiology reporting locally and globally: who, how, and why?,” British Journal of Radiology, vol. 98, no. 1167, pp. 330–335, Mar. 2025, doi: 10.1093/bjr/tqae253.

[2] J. Yuan, H. Liao, R. Luo, and J. Luo, “Automatic Radiology Report Generation Based on Multi-view Image Fusion and Medical Concept Enrichment,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2019. doi: 10.1007/978-3-030-32226-7_80.

[3] P. Divya, Y. Sravani, C. Vishnu, C. K. Mohan, and Y. W. Chen, “Memory Guided Transformer With Spatio-Semantic Visual Extractor for Medical Report Generation,” IEEE J Biomed Health Inform, vol. 28, no. 5, 2024, doi: 10.1109/JBHI.2024.3371894.

[4] Z. Chen, Y. Song, T. H. Chang, and X. Wan, “Generating radiology reports via memory-driven transformer,” in EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 2020. doi: 10.18653/v1/2020.emnlp-main.112.

[5] Y. C. Peng, W. J. Lee, Y. C. Chang, W. P. Chan, and S. J. Chen, “Radiologist burnout: Trends in medical imaging utilization under the national health insurance system with the universal code bundling strategy in an academic tertiary medical centre,” Eur J Radiol, vol. 157, 2022, doi: 10.1016/j.ejrad.2022.110596.

[6] S. Elbedwehy, T. Medhat, T. Hamza, and M. F. Alrahmawy, “Enhanced descriptive captioning model for histopathological patches,” Multimed Tools Appl, vol. 83, no. 12, 2024, doi: 10.1007/s11042-023-15884-y.

[7] Z. Song and X. Zhou, “EXPLORING EXPLICIT AND IMPLICIT VISUAL RELATIONSHIPS FOR IMAGE CAPTIONING,” in Proceedings - IEEE International Conference on Multimedia and Expo, 2021. doi: 10.1109/ICME51207.2021.9428310.

[8] M. Liu, H. Hu, L. Li, Y. Yu, and W. Guan, “Chinese Image Caption Generation via Visual Attention and Topic Modeling,” IEEE Trans Cybern, vol. 52, no. 2, 2022, doi: 10.1109/TCYB.2020.2997034.

[9] H. Chen, G. Ding, Z. Lin, Y. Guo, C. Shan, and J. Han, “Image Captioning with Memorized Knowledge,” Cognit Comput, vol. 13, no. 4, 2021, doi: 10.1007/s12559-019-09656-w.

[10] H. Tsaniya, C. Fatichah, and N. Suciati, “Automatic Radiology Report Generator Using Transformer With Contrast-Based Image Enhancement,” IEEE Access, vol. 12, 2024, doi: 10.1109/ACCESS.2024.3364373.

[11] J. Irvin et al., “CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison.” [Online]. Available: www.aaai.org

[12] A. Vaswani et al., “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017.

[13] H. Park, K. Kim, S. Park, and J. Choi, “Medical Image Captioning Model to Convey More Details: Methodological Comparison of Feature Difference Generation,” IEEE Access, vol. 9, 2021, doi: 10.1109/ACCESS.2021.3124564.

[14] Z. Babar, T. van Laarhoven, and E. Marchiori, “Encoder-decoder models for chest X-ray report generation perform no better than unconditioned baselines,” PLoS One, vol. 16, no. 11 November, 2021, doi: 10.1371/journal.pone.0259639.

[15] S. Yan, W. K. Cheung, K. Chiu, T. M. Tong, K. C. Cheung, and S. See, “Attributed Abnormality Graph Embedding for Clinically Accurate X-Ray Report Generation,” IEEE Trans Med Imaging, vol. 42, no. 8, 2023, doi: 10.1109/TMI.2023.3245608.

[16] E. Vendrow and E. Schonfeld, “Understanding transfer learning for chest radiograph clinical report generation with modified transformer architectures,” Heliyon, vol. 9, no. 7, 2023, doi: 10.1016/j.heliyon.2023.e17968.

[17] J. Zhao et al., “Automated Chest X-Ray Diagnosis Report Generation with Cross-Attention Mechanism,” Applied Sciences (Switzerland), vol. 15, no. 1, Jan. 2025, doi: 10.3390/app15010343.

[18] F. F. Alqahtani, M. M. Mohsan, K. Alshamrani, J. Zeb, S. Alhamami, and D. Alqarni, “CNX-B2: A Novel CNN-Transformer Approach For Chest X-Ray Medical Report Generation,” IEEE Access, vol. 12, 2024, doi: 10.1109/ACCESS.2024.3367360.

[19] S. Soni, P. Singh, and A. A. Waoo, “REVIEW OF GAMMA CORRECTION TECHNIQUES IN DIGITAL IMAGING,” ShodhKosh: Journal of Visual and Performing Arts, vol. 5, no. 5, May 2024, doi: 10.29121/shodhkosh.v5.i5.2024.1902.

[20] U. K. Acharya and S. Kumar, “Directed searching optimized texture based adaptive gamma correction (DSOTAGC) technique for medical image enhancement,” Multimed Tools Appl, vol. 83, no. 3, pp. 6943–6962, Jan. 2024, doi: 10.1007/s11042-023-15953-2.

[21] F. Kallel and A. Ben Hamida, “A New Adaptive Gamma Correction Based Algorithm Using DWT-SVD for Non-Contrast CT Image Enhancement,” IEEE Trans Nanobioscience, vol. 16, no. 8, pp. 666–675, 2017, doi: 10.1109/TNB.2017.2771350.

[22] A. Dosovitskiy et al., “AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE,” in ICLR 2021 - 9th International Conference on Learning Representations, 2021.

[23] W. Liu, J. Luo, Y. Yang, W. Wang, J. Deng, and L. Yu, “Automatic lung segmentation in chest X-ray images using improved U-Net,” Sci Rep, vol. 12, no. 1, 2022, doi: 10.1038/s41598-022-12743-y.

[24] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a Method for Automatic Evaluation of Machine Translation.”

[25] C.-Y. Lin, “ROUGE: A Package for Automatic Evaluation of Summaries.”

[26] D. Demner-Fushman et al., “Preparing a collection of radiology examinations for distribution and retrieval,” Journal of the American Medical Informatics Association, vol. 23, no. 2, 2016, doi: 10.1093/jamia/ocv080.

[27] G. Veras Magalhães, R. L. de S. Santos, L. H. S. Vogado, A. Cardoso de Paiva, and P. de Alcântara dos Santos Neto, “XRaySwinGen: Automatic medical reporting for X-ray exams with multimodal model,” Heliyon, vol. 10, no. 7, 2024, doi: 10.1016/j.heliyon.2024.e27516.

Automated Chest X-Ray Captioning Using Pretrained Vision Transformer with LSTM and Multi-Head Attention

Abstract

References

Submissions

Menu

Acreditation

Template

Visitor Statistics

Tools

EDITORIAL OFFICE OF JOURNAL OF INFORMATION ENGINEERING AND EDUCATIONAL TECHNOLOGY

Article Sidebar

Main Article Content

Abstract

Article Details

References