Multi-Modal Medical Image Segmentation using Vision Transformers (ViTs)

Authors

  • Sidra Fareed School of Information and Software Engineering, University of Electronic Science and Technology of China, Jianshe North Road, Chengdu Sichuan, China Author https://orcid.org/0009-0009-5611-7519
  • Ding Yi School of Information and Software Engineering, University of Electronic Science and Technology of China, Jianshe North Road, Chengdu Sichuan, China Author https://orcid.org/0000-0003-3406-9770
  • Babar Hussain School of Information and Software Engineering, University of Electronic Science and Technology of China, Jianshe North Road, Chengdu Sichuan, China Author https://orcid.org/0009-0002-9101-1268
  • Subhan Uddin School of Information and Software Engineering, University of Electronic Science and Technology of China, Jianshe North Road, Chengdu Sichuan, China Author https://orcid.org/0009-0006-1390-0923

Keywords:

Medical Image Segmentation, Vision Transformers, Multi-modal Fusion, Deep Learning, Self-Attention

Abstract

Multi-modal medical image segmentation plays a pivotal role in modern diagnostic and therapeutic procedures by leveraging complementary anatomical and functional information from different imaging modalities, such as Magnetic Resonance Imaging (MRI), Computed Tomography (CT), and Positron Emission Tomography (PET). Traditional convolutional neural networks (CNNs), while effective for single-modality tasks, often struggle to capture complex cross-modal dependencies and global contextual features essential for accurate segmentation in multi-modal scenarios. In this paper, we propose a novel Vision Transformer (ViT)-based architecture specifically designed for multi-modal medical image segmentation. Our approach integrates modality-specific encoders with a shared cross-modal attention mechanism to effectively learn both intra- and inter-modality relationships. By harnessing the self-attention mechanism inherent in transformers, our model captures long-range dependencies and global semantic context that are often missed by conventional CNN-based methods. We conduct extensive experiments on publicly available multi-modal medical imaging datasets, including BraTS (for brain tumor segmentation) and CHAOS (for abdominal organ segmentation), to evaluate the performance of our method. Results demonstrate that our ViT-based framework significantly outperforms state-of-the-art CNN models in terms of Dice coefficient, IoU, and boundary accuracy. Furthermore, ablation studies confirm the contribution of each architectural component, particularly the cross-modal attention module, to overall performance improvements. Our findings highlight the potential of transformer-based models as a unifying solution for complex multi-modal medical image segmentation tasks, paving the way for more accurate and clinically applicable automated diagnosis systems.

Downloads

Published

2025-07-24

Issue

Section

Articles