Automated Chest X-Ray Captioning Using Pretrained Vision Transformer with LSTM and Multi-Head Attention

Main Article Content

Rafy Aulia Akbar
Ricky Eka Putra
Wiyli Yustanti

Abstract

Radiology report generation is a complex and error-prone task, especially for radiologists with limited experience. To overcome this, this study aims to develop an automated system for generating text-based radiology reports using chest X-ray images. The proposed approach combines computer vision and natural language processing through an encoder-decoder architecture. As an encoder, a Vision Transformer (ViT) model trained on the CheXpert dataset is used to extract visual features from X-ray images after Gamma Correction is performed to improve image quality. In the decoder section, word embeddings from the report text are processed using Long Short-Term Memory (LSTM) to capture word order relationships, and enriched with Multi-Head Attention (MHA) to pay attention to important parts of the text. Visual and text features are then combined and passed to a dense layer to generate text-based radiology reports. The evaluation results show that the proposed model achieves a ROUGE-L score of 0.385, outperforming previous models. The BLEU-1 score also shows competitive results with a value of 0.427. This study shows that the use of pre-trained ViT, combined with LSTM-MHA on the decoder, provides excellent performance in capturing visual and semantic context of text, as well as improving accuracy and efficiency in radiology report automation.

Article Details

Section
Articles

References

[1] K. L. Gormly, “Improving radiology reporting locally and globally: who, how, and why?,” British Journal of Radiology, vol. 98, no. 1167, pp. 330–335, Mar. 2025, doi: 10.1093/bjr/tqae253.

[2] J. Yuan, H. Liao, R. Luo, and J. Luo, “Automatic Radiology Report Generation Based on Multi-view Image Fusion and Medical Concept Enrichment,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2019. doi: 10.1007/978-3-030-32226-7_80.

[3] P. Divya, Y. Sravani, C. Vishnu, C. K. Mohan, and Y. W. Chen, “Memory Guided Transformer With Spatio-Semantic Visual Extractor for Medical Report Generation,” IEEE J Biomed Health Inform, vol. 28, no. 5, 2024, doi: 10.1109/JBHI.2024.3371894.

[4] Z. Chen, Y. Song, T. H. Chang, and X. Wan, “Generating radiology reports via memory-driven transformer,” in EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 2020. doi: 10.18653/v1/2020.emnlp-main.112.

[5] Y. C. Peng, W. J. Lee, Y. C. Chang, W. P. Chan, and S. J. Chen, “Radiologist burnout: Trends in medical imaging utilization under the national health insurance system with the universal code bundling strategy in an academic tertiary medical centre,” Eur J Radiol, vol. 157, 2022, doi: 10.1016/j.ejrad.2022.110596.

[6] S. Elbedwehy, T. Medhat, T. Hamza, and M. F. Alrahmawy, “Enhanced descriptive captioning model for histopathological patches,” Multimed Tools Appl, vol. 83, no. 12, 2024, doi: 10.1007/s11042-023-15884-y.

[7] Z. Song and X. Zhou, “EXPLORING EXPLICIT AND IMPLICIT VISUAL RELATIONSHIPS FOR IMAGE CAPTIONING,” in Proceedings - IEEE International Conference on Multimedia and Expo, 2021. doi: 10.1109/ICME51207.2021.9428310.

[8] M. Liu, H. Hu, L. Li, Y. Yu, and W. Guan, “Chinese Image Caption Generation via Visual Attention and Topic Modeling,” IEEE Trans Cybern, vol. 52, no. 2, 2022, doi: 10.1109/TCYB.2020.2997034.

[9] H. Chen, G. Ding, Z. Lin, Y. Guo, C. Shan, and J. Han, “Image Captioning with Memorized Knowledge,” Cognit Comput, vol. 13, no. 4, 2021, doi: 10.1007/s12559-019-09656-w.

[10] H. Tsaniya, C. Fatichah, and N. Suciati, “Automatic Radiology Report Generator Using Transformer With Contrast-Based Image Enhancement,” IEEE Access, vol. 12, 2024, doi: 10.1109/ACCESS.2024.3364373.

[11] J. Irvin et al., “CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison.” [Online]. Available: www.aaai.org

[12] A. Vaswani et al., “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017.

[13] H. Park, K. Kim, S. Park, and J. Choi, “Medical Image Captioning Model to Convey More Details: Methodological Comparison of Feature Difference Generation,” IEEE Access, vol. 9, 2021, doi: 10.1109/ACCESS.2021.3124564.

[14] Z. Babar, T. van Laarhoven, and E. Marchiori, “Encoder-decoder models for chest X-ray report generation perform no better than unconditioned baselines,” PLoS One, vol. 16, no. 11 November, 2021, doi: 10.1371/journal.pone.0259639.

[15] S. Yan, W. K. Cheung, K. Chiu, T. M. Tong, K. C. Cheung, and S. See, “Attributed Abnormality Graph Embedding for Clinically Accurate X-Ray Report Generation,” IEEE Trans Med Imaging, vol. 42, no. 8, 2023, doi: 10.1109/TMI.2023.3245608.

[16] E. Vendrow and E. Schonfeld, “Understanding transfer learning for chest radiograph clinical report generation with modified transformer architectures,” Heliyon, vol. 9, no. 7, 2023, doi: 10.1016/j.heliyon.2023.e17968.

[17] J. Zhao et al., “Automated Chest X-Ray Diagnosis Report Generation with Cross-Attention Mechanism,” Applied Sciences (Switzerland), vol. 15, no. 1, Jan. 2025, doi: 10.3390/app15010343.

[18] F. F. Alqahtani, M. M. Mohsan, K. Alshamrani, J. Zeb, S. Alhamami, and D. Alqarni, “CNX-B2: A Novel CNN-Transformer Approach For Chest X-Ray Medical Report Generation,” IEEE Access, vol. 12, 2024, doi: 10.1109/ACCESS.2024.3367360.

[19] S. Soni, P. Singh, and A. A. Waoo, “REVIEW OF GAMMA CORRECTION TECHNIQUES IN DIGITAL IMAGING,” ShodhKosh: Journal of Visual and Performing Arts, vol. 5, no. 5, May 2024, doi: 10.29121/shodhkosh.v5.i5.2024.1902.

[20] U. K. Acharya and S. Kumar, “Directed searching optimized texture based adaptive gamma correction (DSOTAGC) technique for medical image enhancement,” Multimed Tools Appl, vol. 83, no. 3, pp. 6943–6962, Jan. 2024, doi: 10.1007/s11042-023-15953-2.

[21] F. Kallel and A. Ben Hamida, “A New Adaptive Gamma Correction Based Algorithm Using DWT-SVD for Non-Contrast CT Image Enhancement,” IEEE Trans Nanobioscience, vol. 16, no. 8, pp. 666–675, 2017, doi: 10.1109/TNB.2017.2771350.

[22] A. Dosovitskiy et al., “AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE,” in ICLR 2021 - 9th International Conference on Learning Representations, 2021.

[23] W. Liu, J. Luo, Y. Yang, W. Wang, J. Deng, and L. Yu, “Automatic lung segmentation in chest X-ray images using improved U-Net,” Sci Rep, vol. 12, no. 1, 2022, doi: 10.1038/s41598-022-12743-y.

[24] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a Method for Automatic Evaluation of Machine Translation.”

[25] C.-Y. Lin, “ROUGE: A Package for Automatic Evaluation of Summaries.”

[26] D. Demner-Fushman et al., “Preparing a collection of radiology examinations for distribution and retrieval,” Journal of the American Medical Informatics Association, vol. 23, no. 2, 2016, doi: 10.1093/jamia/ocv080.

[27] G. Veras Magalhães, R. L. de S. Santos, L. H. S. Vogado, A. Cardoso de Paiva, and P. de Alcântara dos Santos Neto, “XRaySwinGen: Automatic medical reporting for X-ray exams with multimodal model,” Heliyon, vol. 10, no. 7, 2024, doi: 10.1016/j.heliyon.2024.e27516.