MD-ViT: Multidomain Vision Transformer Fusion for Fair Demographic Attribute Recognition
Main Article Content
Abstract
Demographic attribute recognition particularly race and gender classification from facial images, plays a critical role in applications ranging from precision healthcare to digital identity systems. However, existing deep learning approaches often suffer from algorithmic bias and limited robustness, especially when trained on imbalanced or non-representative data. To address these challenges, this study proposes MD-ViT, a novel framework that leverages multidomain Vision Transformer (ViT) fusion to enhance both accuracy and fairness in demographic classification. Specifically, we integrate embeddings from two task-specific pretrained ViTs: ViT-VGGFace (fine-tuned on VGGFace2 for structural identity features) and ViT-Face Age (trained on UTKFace and IMDB-WIKI for age-related morphological cues), followed by classification using XGBoost to model complex feature interactions while mitigating overfitting. Evaluated on the balanced DemogPairs dataset (10,800 images across six intersectional subgroups), our approach achieves 89.07% accuracy and 89.06% F1-score, outperforming single-domain baselines (ViT-VGGFace: 88.61%; ViT-Age: 78.94%). Crucially, fairness analysis reveals minimal performance disparity across subgroups (F1-score range: 87.38%–91.03%; σ = 1.33), indicating effective mitigation of intersectional bias. These results demonstrate that cross-task feature fusion can yield representations that are not only more discriminative but also more equitable. We conclude that MD-ViT offers a principled, modular, and ethically grounded pathway toward fairer soft biometric systems, particularly in high-stakes domains such as digital health and inclusive access control.
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
References
Al-Otaiby, N., & El-Alfy, E. S. M. (2023). Effects of Face Image Degradation on Recognition with Vision Transformers: Review and Case Study. 2023 3rd International Conference on Computing and Information Technology, ICCIT 2023. https://doi.org/10.1109/ICCIT58132.2023.10273970
Bonner, S. N., Thumma, J. R., Valbuena, V. S. M., Stewart, J. W., Combs, M., Lyu, D., Chang, A., Lin, J., & Wakeam, E. (2023). The intersection of race and ethnicity, gender, and primary diagnosis on lung transplantation outcomes. Journal of Heart and Lung Transplantation, 42(7). https://doi.org/10.1016/j.healun.2023.02.1496
Bulat, A., Cheng, S., Yang, J., Garbett, A., Sanchez, E., & Tzimiropoulos, G. (2022). Pre-training Strategies and Datasets for Facial Representation Learning. Lecture Notes in Computer Science, 13673 LNCS. https://doi.org/10.1007/978-3-031-19778-9_7
dima806. (2023). dima806/facial_age_image_detection · Hugging Face. https://huggingface.co/dima806/facial_age_image_detection
Ding, Y., Bu, F., Zhai, H., Hou, Z., & Wang, Y. (2024). Multi-feature fusion based face forgery detection with local and global characteristics. PLoS ONE, 19(10 October). https://doi.org/10.1371/journal.pone.0311720
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. https://doi.org/10.48550/arxiv.2010.11929
Gao, W., Li, L., & Zhao, H. (2022). Facial Expression Recognition Method Based on SpResNet-ViT. Proceedings - 2022 2nd Asia-Pacific Conference on Communications Technology and Computer Science, ACCTCS 2022. https://doi.org/10.1109/ACCTCS53867.2022.00046
Greco, A., Percannella, G., Vento, M., & Vigilante, V. (2020). Benchmarking deep network architectures for ethnicity recognition using a new large face dataset. Machine Vision and Applications, 31(7), 67. https://doi.org/10.1007/s00138-020-01123-z
Ha, F., John, A., & Zumwalt, M. (2021). Gender/sex, race/ethnicity, similarities/differences among SARS-CoV, MERS-CoV, and COVID-19 patients. The Southwest Respiratory and Critical Care Chronicles, 9(37). https://doi.org/10.12746/swrccc.v9i37.795
Hupont, I., & Fernández, C. (2019). DemogPairs: Quantifying the Impact of Demographic Imbalance in Deep Face Recognition. 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), 1–7. https://doi.org/10.1109/FG.2019.8756625
Iloanusi, O., Flynn, P. J., & Tinsley, P. (2022). Similarities in African Ethnic Faces from the Biometric Recognition Viewpoint. Proceedings - 2022 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops, WACVW 2022. https://doi.org/10.1109/WACVW54805.2022.00048
Jatain, R., & Jailia, D. M. (2023). Automatic Human Face Detection and Recognition Based On Facial Features Using Deep Learning Approach. International Journal on Recent and Innovation Trends in Computing and Communication, 11(2). https://doi.org/10.17762/ijritcc.v11i2s.6146
Karkkainen, K., & Joo, J. (2021). FairFace: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. Proceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021. https://doi.org/10.1109/WACV48630.2021.00159
Kotwal, K., & Marcel, S. (2024). Demographic Fairness Transformer for Bias Mitigation in Face Recognition. 2024 IEEE International Joint Conference on Biometrics (IJCB), 1–10. https://doi.org/10.1109/IJCB62174.2024.10744457
Nixon, S., Ruiu, P., Cadoni, M., Lagorio, A., & Tistarelli, M. (2025). Assessing bias and computational efficiency in vision transformers using early exits. Eurasip Journal on Image and Video Processing, 2025(1). https://doi.org/10.1186/s13640-024-00658-9
Pardede, J., & Kleb, S. S. (2024). Face Race Classification using ResNet-152 and DenseNet- 121. ELKOMIKA: Jurnal Teknik Energi Elektrik, Teknik Telekomunikasi, & Teknik Elektronika, 12(3), 798. https://doi.org/10.26760/elkomika.v12i3.798
Ranftl, R., Bochkovskiy, A., & Koltun, V. (2021). Vision Transformers for Dense Prediction. Proceedings of the IEEE International Conference on Computer Vision. https://doi.org/10.1109/ICCV48922.2021.01196
Robinson, J. P., Livitz, G., Henon, Y., Qin, C., Fu, Y., & Timoner, S. (2020). Face recognition: Too bias, or not too bias? IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2020-June. https://doi.org/10.1109/CVPRW50498.2020.00008
Scheuerman, M. K., Wade, K., Lustig, C., & Brubaker, J. R. (2020). How We’ve Taught Algorithms to See Identity: Constructing Race and Gender in Image Databases for Facial Analysis. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW1). https://doi.org/10.1145/3392866
Sehrawat, J. S., & Ali, M. (2023). Morpho-facial variations in physical features of two tribal populations of Kargil (Ladakh, India): A bio-anthropological investigation. Anthropological Review, 86(3). https://doi.org/10.18778/1898-6773.86.3.01
Singh, S., & Chauhan, A. S. (2023). Attendance Compilation by Facial Recognition Methods of Image Processing: A Review. International Journal for Research in Applied Science and Engineering Technology, 11(5). https://doi.org/10.22214/ijraset.2023.51708
skutaada. (2024). skutaada/VIT-VGGFace at main. https://huggingface.co/skutaada/VIT-VGGFace/tree/main
Wiens, M., Verone-Boyle, A., Henscheid, N., Podichetty, J. T., & Burton, J. (2025). A Tutorial and Use Case Example of the eXtreme Gradient Boosting (XGBoost) Artificial Intelligence Algorithm for Drug Development Applications. Clinical and Translational Science, 18(3). https://doi.org/10.1111/cts.70172
Xue, M., Duan, X., & Liu, W. (2019). Eliminating other-race effect for multi-ethnic facial expression recognition. Mathematical Foundations of Computing, 2, 43–53. https://doi.org/10.3934/mfc.2019004