MD-ViT: Multidomain Vision Transformer Fusion for Fair Demographic Attribute Recognition

Main Article Content

Rezky Arisanti Putri
Ricky Eka Putra
Yuni Yamasari

Abstract

Demographic attribute recognition particularly race and gender classification from facial images, plays a critical role in applications ranging from precision healthcare to digital identity systems. However, existing deep learning approaches often suffer from algorithmic bias and limited robustness, especially when trained on imbalanced or non-representative data. To address these challenges, this study proposes MD-ViT, a novel framework that leverages multidomain Vision Transformer (ViT) fusion to enhance both accuracy and fairness in demographic classification. Specifically, we integrate embeddings from two task-specific pretrained ViTs: ViT-VGGFace (fine-tuned on VGGFace2 for structural identity features) and ViT-Face Age (trained on UTKFace and IMDB-WIKI for age-related morphological cues), followed by classification using XGBoost to model complex feature interactions while mitigating overfitting. Evaluated on the balanced DemogPairs dataset (10,800 images across six intersectional subgroups), our approach achieves 89.07% accuracy and 89.06% F1-score, outperforming single-domain baselines (ViT-VGGFace: 88.61%; ViT-Age: 78.94%). Crucially, fairness analysis reveals minimal performance disparity across subgroups (F1-score range: 87.38%–91.03%; σ = 1.33), indicating effective mitigation of intersectional bias. These results demonstrate that cross-task feature fusion can yield representations that are not only more discriminative but also more equitable. We conclude that MD-ViT offers a principled, modular, and ethically grounded pathway toward fairer soft biometric systems, particularly in high-stakes domains such as digital health and inclusive access control.

Article Details

Section
Articles

References

Al-Otaiby, N., & El-Alfy, E. S. M. (2023). Effects of Face Image Degradation on Recognition with Vision Transformers: Review and Case Study. 2023 3rd International Conference on Computing and Information Technology, ICCIT 2023. https://doi.org/10.1109/ICCIT58132.2023.10273970

Bonner, S. N., Thumma, J. R., Valbuena, V. S. M., Stewart, J. W., Combs, M., Lyu, D., Chang, A., Lin, J., & Wakeam, E. (2023). The intersection of race and ethnicity, gender, and primary diagnosis on lung transplantation outcomes. Journal of Heart and Lung Transplantation, 42(7). https://doi.org/10.1016/j.healun.2023.02.1496

Bulat, A., Cheng, S., Yang, J., Garbett, A., Sanchez, E., & Tzimiropoulos, G. (2022). Pre-training Strategies and Datasets for Facial Representation Learning. Lecture Notes in Computer Science, 13673 LNCS. https://doi.org/10.1007/978-3-031-19778-9_7

dima806. (2023). dima806/facial_age_image_detection · Hugging Face. https://huggingface.co/dima806/facial_age_image_detection

Ding, Y., Bu, F., Zhai, H., Hou, Z., & Wang, Y. (2024). Multi-feature fusion based face forgery detection with local and global characteristics. PLoS ONE, 19(10 October). https://doi.org/10.1371/journal.pone.0311720

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. https://doi.org/10.48550/arxiv.2010.11929

Gao, W., Li, L., & Zhao, H. (2022). Facial Expression Recognition Method Based on SpResNet-ViT. Proceedings - 2022 2nd Asia-Pacific Conference on Communications Technology and Computer Science, ACCTCS 2022. https://doi.org/10.1109/ACCTCS53867.2022.00046

Greco, A., Percannella, G., Vento, M., & Vigilante, V. (2020). Benchmarking deep network architectures for ethnicity recognition using a new large face dataset. Machine Vision and Applications, 31(7), 67. https://doi.org/10.1007/s00138-020-01123-z

Ha, F., John, A., & Zumwalt, M. (2021). Gender/sex, race/ethnicity, similarities/differences among SARS-CoV, MERS-CoV, and COVID-19 patients. The Southwest Respiratory and Critical Care Chronicles, 9(37). https://doi.org/10.12746/swrccc.v9i37.795

Hupont, I., & Fernández, C. (2019). DemogPairs: Quantifying the Impact of Demographic Imbalance in Deep Face Recognition. 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), 1–7. https://doi.org/10.1109/FG.2019.8756625

Iloanusi, O., Flynn, P. J., & Tinsley, P. (2022). Similarities in African Ethnic Faces from the Biometric Recognition Viewpoint. Proceedings - 2022 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops, WACVW 2022. https://doi.org/10.1109/WACVW54805.2022.00048

Jatain, R., & Jailia, D. M. (2023). Automatic Human Face Detection and Recognition Based On Facial Features Using Deep Learning Approach. International Journal on Recent and Innovation Trends in Computing and Communication, 11(2). https://doi.org/10.17762/ijritcc.v11i2s.6146

Karkkainen, K., & Joo, J. (2021). FairFace: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. Proceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021. https://doi.org/10.1109/WACV48630.2021.00159

Kotwal, K., & Marcel, S. (2024). Demographic Fairness Transformer for Bias Mitigation in Face Recognition. 2024 IEEE International Joint Conference on Biometrics (IJCB), 1–10. https://doi.org/10.1109/IJCB62174.2024.10744457

Nixon, S., Ruiu, P., Cadoni, M., Lagorio, A., & Tistarelli, M. (2025). Assessing bias and computational efficiency in vision transformers using early exits. Eurasip Journal on Image and Video Processing, 2025(1). https://doi.org/10.1186/s13640-024-00658-9

Pardede, J., & Kleb, S. S. (2024). Face Race Classification using ResNet-152 and DenseNet- 121. ELKOMIKA: Jurnal Teknik Energi Elektrik, Teknik Telekomunikasi, & Teknik Elektronika, 12(3), 798. https://doi.org/10.26760/elkomika.v12i3.798

Ranftl, R., Bochkovskiy, A., & Koltun, V. (2021). Vision Transformers for Dense Prediction. Proceedings of the IEEE International Conference on Computer Vision. https://doi.org/10.1109/ICCV48922.2021.01196

Robinson, J. P., Livitz, G., Henon, Y., Qin, C., Fu, Y., & Timoner, S. (2020). Face recognition: Too bias, or not too bias? IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2020-June. https://doi.org/10.1109/CVPRW50498.2020.00008

Scheuerman, M. K., Wade, K., Lustig, C., & Brubaker, J. R. (2020). How We’ve Taught Algorithms to See Identity: Constructing Race and Gender in Image Databases for Facial Analysis. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW1). https://doi.org/10.1145/3392866

Sehrawat, J. S., & Ali, M. (2023). Morpho-facial variations in physical features of two tribal populations of Kargil (Ladakh, India): A bio-anthropological investigation. Anthropological Review, 86(3). https://doi.org/10.18778/1898-6773.86.3.01

Singh, S., & Chauhan, A. S. (2023). Attendance Compilation by Facial Recognition Methods of Image Processing: A Review. International Journal for Research in Applied Science and Engineering Technology, 11(5). https://doi.org/10.22214/ijraset.2023.51708

skutaada. (2024). skutaada/VIT-VGGFace at main. https://huggingface.co/skutaada/VIT-VGGFace/tree/main

Wiens, M., Verone-Boyle, A., Henscheid, N., Podichetty, J. T., & Burton, J. (2025). A Tutorial and Use Case Example of the eXtreme Gradient Boosting (XGBoost) Artificial Intelligence Algorithm for Drug Development Applications. Clinical and Translational Science, 18(3). https://doi.org/10.1111/cts.70172

Xue, M., Duan, X., & Liu, W. (2019). Eliminating other-race effect for multi-ethnic facial expression recognition. Mathematical Foundations of Computing, 2, 43–53. https://doi.org/10.3934/mfc.2019004