Multi-Modal Medical Image Segmentation using Vision Transformers (ViTs)
Keywords:
Medical Image Segmentation, Vision Transformers, Multi-modal Fusion, Deep Learning, Self-AttentionAbstract
Multi-modal medical image segmentation plays a pivotal role in modern diagnostic and therapeutic procedures by leveraging complementary anatomical and functional information from different imaging modalities, such as Magnetic Resonance Imaging (MRI), Computed Tomography (CT), and Positron Emission Tomography (PET). Traditional convolutional neural networks (CNNs), while effective for single-modality tasks, often struggle to capture complex cross-modal dependencies and global contextual features essential for accurate segmentation in multi-modal scenarios. In this paper, we propose a novel Vision Transformer (ViT)-based architecture specifically designed for multi-modal medical image segmentation. Our approach integrates modality-specific encoders with a shared cross-modal attention mechanism to effectively learn both intra- and inter-modality relationships. By harnessing the self-attention mechanism inherent in transformers, our model captures long-range dependencies and global semantic context that are often missed by conventional CNN-based methods. We conduct extensive experiments on publicly available multi-modal medical imaging datasets, including BraTS (for brain tumor segmentation) and CHAOS (for abdominal organ segmentation), to evaluate the performance of our method. Results demonstrate that our ViT-based framework significantly outperforms state-of-the-art CNN models in terms of Dice coefficient, IoU, and boundary accuracy. Furthermore, ablation studies confirm the contribution of each architectural component, particularly the cross-modal attention module, to overall performance improvements. Our findings highlight the potential of transformer-based models as a unifying solution for complex multi-modal medical image segmentation tasks, paving the way for more accurate and clinically applicable automated diagnosis systems.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Sidra Fareed, Yi Ding, Babar Hussain, Subhan Uddin

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.