Multi-Modal Medical Image Segmentation using Vision Transformers (ViTs)
Keywords:
Medical Image Segmentation, Vision Transformers, Multi-modal Fusion, Deep Learning, Self-AttentionAbstract
Multi-modal medical image segmentation plays a pivotal role in modern diagnostic and therapeutic procedures by leveraging complementary anatomical and functional information from different imaging modalities, such as Magnetic Resonance Imaging (MRI), Computed Tomography (CT), and Positron Emission Tomography (PET). Traditional convolutional neural networks (CNNs), while effective for single-modality tasks, often struggle to capture complex cross-modal dependencies and global contextual features essential for accurate segmentation in multi-modal scenarios. In this paper, we propose a novel Vision Transformer (ViT)-based architecture specifically designed for multi-modal medical image segmentation. Our approach integrates modality-specific encoders with a shared cross-modal attention mechanism to effectively learn both intra- and inter-modality relationships. By harnessing the self-attention mechanism inherent in transformers, our model captures long-range dependencies and global semantic context that are often missed by conventional CNN-based methods. We conduct extensive experiments on publicly available multi-modal medical imaging datasets, including BraTS (for brain tumor segmentation) and CHAOS (for abdominal organ segmentation), to evaluate the performance of our method. Results demonstrate that our ViT-based framework significantly outperforms state-of-the-art CNN models in terms of Dice coefficient, IoU, and boundary accuracy. Furthermore, ablation studies confirm the contribution of each architectural component, particularly the cross-modal attention module, to overall performance improvements. Our findings highlight the potential of transformer-based models as a unifying solution for complex multi-modal medical image segmentation tasks, paving the way for more accurate and clinically applicable automated diagnosis systems.
References
[1]J. Yang, L. Jiao, R. Shang, X. Liu, R. Li, and L. Xu, “Ept-net: Edge perception transformer for 3d medical image segmentation,” IEEE Transactions on Medical Imaging, vol. 42, no. 11, pp. 3229–3243, 2023.
[2]O. N. Manzari, H. Ahmadabadi, H. Kashiani, S. B. Shokouhi, and A. Ayatollahi, “Medvit: a robust vision transformer for generalized medical image classification,” Computers in biology and medicine, vol. 157, p. 106791, 2023.
[3]W. Yao, J. Bai, W. Liao, Y. Chen, M. Liu, and Y. Xie, “From cnn to transformer: A review of medical image segmentation models,” Journal of Imaging Informatics in Medicine, vol. 37, no. 4, pp. 1529–1547, 2024.
[4]F. Zheng, X. Chen, W. Liu, H. Li, Y. Lei, J. He, and S. Zhou, “Smaformer: Synergistic multi-attention transformer for medical image segmentation,” in 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 4048– 4053, IEEE, December 2024.
[5]X. Lin, Z. Yan, X. Deng, C. Zheng, and L. Yu, “Convformer: Plug-and-play cnnstyle transformers for improving medical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 642–651, Springer Nature Switzerland, October 2023.
[6]J. Chen, J. Mei, X. Li, Y. Lu, Q. Yu, Q. Wei, and Y. Zhou, “3d transunet: Advancing medical image segmentation through vision transformers,” arXiv preprint arXiv:2310.07781, 2023.
[7]H. Wang, S. Xie, L. Lin, Y. Iwamoto, X. H. Han, Y. W. Chen, and R. Tong, “Mixed transformer u-net for medical image segmentation,” in ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 2390–2394, IEEE, May 2022.
[8]Y. Ruiping, L. Kun, X. Shaohua, Y. Jian, and Z. Zhen, “Vit-upernet: a hybrid vision transformer with unified-perceptual-parsing network for medical image segmentation,” Complex & Intelligent Systems, vol. 10, no. 3, pp. 3819–3831, 2024.
[9]R. Azad, R. Arimond, E. K. Aghdam, A. Kazerouni, and D. Merhof, “Daeformer: Dual attention-guided efficient transformer for medical image segmentation,” in International workshop on predictive intelligence in medicine, pp. 83– 95, Springer Nature Switzerland, October 2023.
[10]Y. Wang, Z. Li, J. Mei, Z. Wei, L. Liu, C. Wang, and Y. Zhou, “Swinmm: masked multi-view with swin transformers for 3d medical image segmentation,” in International conference on medical image computing and computer-assisted intervention, pp. 486–496, Springer Nature Switzerland, October 2023.
[11]F. Tang, Z. Xu, Q. Huang, J. Wang, X. Hou, J. Su, and J. Liu, “Duat: Dualaggregation transformer network for medical image segmentation,” in Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pp. 343–356, Springer Nature Singapore, October 2023.
[12]Y. Ding, G. Wu, D. Chen, N. Zhang, L. Gong, M. Cao, and Z. Qin, “Deepedn: A deep-learning-based image encryption and decryption network for internet of medical things,” IEEE Internet of Things Journal, vol. 8, no. 3, pp. 1504–1518, 2020.
[13]J. Zhang, J. D. Lu, B. Chen, S. Pan, L. Jin, Y. Zheng, and M. Pan, “Vision transformer introduces a new vitality to the classification of renal pathology,” BMC nephrology, vol. 25, no. 1, p. 337, 2024.
[14]T. H. Pham, X. Li, and K. D. Nguyen, “Seunet-trans: A simple yet effective unettransformer model for medical image segmentation,” IEEE Access, 2024.
[15]Z. Wang, X. Lin, N. Wu, L. Yu, K. T. Cheng, and Z. Yan, “Dtmformer: Dynamic token merging for boosting transformer-based medical image segmentation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 5814– 5822, March 2024.
[16]G. Y. Li, J. Chen, S. I. Jang, K. Gong, and Q. Li, “Swincross: Cross-modal swin transformer for head-and-neck tumor segmentation in pet/ct images,” Medical physics, vol. 51, no. 3, pp. 2096–2107, 2024.
[17]G. Y. Li, J. Chen, S. I. Jang, K. Gong, and Q. Li, “Swincross: Cross-modal swin transformer for head-and-neck tumor segmentation in pet/ct images,” Medical physics, vol. 51, no. 3, pp. 2096–2107, 2024.
[18]J. Wu, W. Ji, H. Fu, M. Xu, Y. Jin, and Y. Xu, “Medsegdiff-v2: Diffusion-based medical image segmentation with transformer,” in Proceedings of the AAAI conference on artificial intelligence, vol. 38, pp. 6030–6038, March 2024.
[19]F. Zheng, X. Chen, W. Liu, H. Li, Y. Lei, J. He, and S. Zhou, “Smaformer: Synergistic multi-attention transformer for medical image segmentation,” in 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 4048– 4053, IEEE, December 2024.
[20]J. Yang, L. Jiao, R. Shang, X. Liu, R. Li, and L. Xu, “Ept-net: Edge perception transformer for 3d medical image segmentation,” IEEE Transactions on Medical Imaging, vol. 42, no. 11, pp. 3229–3243, 2023.
[21]M. Heidari, A. Kazerouni, M. Soltany, R. Azad, E. K. Aghdam, J. Cohen-Adad, and D. Merhof, “Hiformer: Hierarchical multi-scale representations using transformers for medical image segmentation,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 6202–6212, 2023.
[22]S. Du, N. Bayasi, G. Hamarneh, and R. Garbi, “Mdvit: Multi-domain vision transformer for small medical image segmentation datasets,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 448–458, Springer Nature Switzerland, October 2023.
[23]J. He and C. Xu, “Hybrid transformer-cnn with boundary-awareness network for 3d medical image segmentation,” Applied Intelligence, vol. 53, no. 23, pp. 28542– 28554, 2023.
[24]O. Nejati Manzari, H. Ahmadabadi, H. Kashiani, S. B. Shokouhi, and A. Ayatollahi, “Medvit: A robust vision transformer for generalized medical image classification,” arXiv e-prints, pp. arXiv–2302, 2023.
[25]Y. Huang, X. Liu, T. Miyazaki, S. Omachi, G. El Fakhri, and J. Ouyang, “Ablation study of diffusion model with transformer backbone for low-count pet denoising,” in 2024 IEEE Nuclear Science Symposium (NSS), Medical Imaging Conference (MIC) and Room Temperature Semiconductor Detector Conference (RTSD), pp. 1–2, IEEE, October 2024.
[26]P. Jiang, W. Liu, F. Wang, and R. Wei, “Hybrid u-net model with visual transformers for enhanced multi-organ medical image segmentation,” Information, vol. 16, no. 2, p. 111, 2025.
[27]B. Li, T. Yang, and X. Zhao, “Nvtrans-unet: Neighborhood vision transformer based u-net for multi-modal cardiac mr image segmentation,” Journal of Applied Clinical Medical Physics, vol. 24, no. 3, p. e13908, 2023.
[28]Y. Ding, G. Zhu, D. Chen, X. Qin, M. Cao, and Z. Qin, “Adversarial sample attack and defense method for encrypted traffic data,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 10, pp. 18024–18039, 2022.
[29]Y. Ding, F. Tan, Z. Qin, M. Cao, K. K. R. Choo, and Z. Qin, “Deepkeygen: a deep learning-based stream cipher generator for medical image encryption and decryption,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 9, pp. 4915–4929, 2021.
[30]B. Hussain, J. Guo, S. Fareed, and S. Uddin, “Robotics for space exploration: From mars rovers to lunar missions,” International Journal of Ethical AI Application, vol. 1, no. 1, pp. 1–10, 2025.
[31]J. Wu, W. Ji, H. Fu, M. Xu, Y. Jin, and Y. Xu, “Medsegdiff-v2: Diffusion-based medical image segmentation with transformer,” in Proceedings of the AAAI conference on artificial intelligence, vol. 38, pp. 6030–6038, March 2024.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Sidra Fareed, Yi Ding, Babar Hussain, Subhan Uddin

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.