본문으로 건너뛰기
← 뒤로

A deep learning framework for breast cancer diagnosis using Swin Transformer and Dual-Attention Multi-scale Fusion Network.

1/5 보강
Scientific reports 📖 저널 OA 99.1% 2021: 24/24 OA 2022: 32/32 OA 2023: 45/45 OA 2024: 140/140 OA 2025: 938/938 OA 2026: 750/767 OA 2021~2026 2026 Vol.16(1)
Retraction 확인
출처

Aldawsari MA, Aldosari SJ, Ismail A, Emam MM

📝 환자 설명용 한 줄

Breast cancer is among the most prevalent cancers affecting women worldwide, and early detection through mammography is critical to reducing mortality rates.

🔬 핵심 임상 통계 (초록에서 자동 추출 — 원문 검증 권장)
  • Sensitivity 99.15%

이 논문을 인용하기

↓ .bib ↓ .ris
APA Aldawsari MA, Aldosari SJ, et al. (2026). A deep learning framework for breast cancer diagnosis using Swin Transformer and Dual-Attention Multi-scale Fusion Network.. Scientific reports, 16(1). https://doi.org/10.1038/s41598-026-37969-y
MLA Aldawsari MA, et al.. "A deep learning framework for breast cancer diagnosis using Swin Transformer and Dual-Attention Multi-scale Fusion Network.." Scientific reports, vol. 16, no. 1, 2026.
PMID 41813686 ↗

Abstract

Breast cancer is among the most prevalent cancers affecting women worldwide, and early detection through mammography is critical to reducing mortality rates. Convolutional neural networks (CNNs) have demonstrated notable effectiveness in classifying mammograms. However, they are constrained in their ability to capture long-range contextual dependencies. On the other hand, transformer-based models excel at handling global relationships; however, they often require large datasets and substantial computing power, which limits their direct application in medical imaging. To overcome these limitations, we propose Swin-DAMFN, a novel dual-branch hybrid architecture for breast cancer classification from mammograms. The first branch utilizes a Swin Transformer to model global dependencies through shifted window self-attention, while the second branch employs a CNN-based Dual-Attention Multi-scale Fusion Network (DAMFN) to capture fine-grained local features, such as microcalcifications and structural distortions. The CNN branch incorporates two custom modules, Multi Separable Attention (MSA) and Tri-Shuffle Convolution Attention (TSCA), for multi-scale discriminative feature extraction. An attention-guided fusion mechanism integrates global and local features into a unified representation. To address dataset limitations, we adopt an advanced augmentation strategy that combines Generative Adversarial Networks (GANs) to synthesize realistic mammograms and photometric augmentation to introduce appearance variability, thereby mitigating class imbalance and enhancing model generalization. A lightweight classification head based on global average pooling and fully connected layers ensures both efficiency and diagnostic accuracy. Extensive experiments on the MIAS and CBIS-DDSM datasets demonstrate that Swin-DAMFN achieves superior results, reaching 99.30% accuracy, 99.14% sensitivity, and 99.15% F1-score on CBIS-DDSM, while maintaining 98.75% accuracy, 98.37% sensitivity, and 98.42% F1-score on MIAS.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

📖 전문 본문 읽기 PMC JATS · ~246 KB · 영문

Introduction

Introduction
As the second most common cause of cancer-related mortality among women globally, breast cancer presents a serious threat to global health1. Abnormal cells in the breast tissue grow uncontrollably, forming tumors. These tumors can be classified as benign or malignant. Each has distinct morphological characteristics in terms of shape, size, and texture. Clinical diagnosis often relies on evaluating these tumor characteristics to distinguish between different types of cancer. According to American Cancer Society statistics, approximately 287,850 breast cancer cases are reported annually in women and 2,710 in men2. Early detection and timely treatment of malignant tumors have significantly improved the five-year survival rate of patients3. The most common imaging modalities recommended for breast cancer screening are mammography, ultrasound, and magnetic resonance imaging (MRI)4. Among available screening modalities, mammography is the most widely used tool for early breast cancer detection, even before the development of palpable lumps5. Mammography enables the detection of breast cancer at an early stage, even before a palpable lump develops. However, the complexity of mammographic interpretation often leads to false diagnoses, requiring highly trained radiologists for accurate assessment. Given the challenges in reliably distinguishing tumors, even for medical experts, there is a pressing need for automated diagnostic systems6. However, as the number of cases increases and the interpretation of images becomes more complex, healthcare systems are under increasing strain. Manual analysis of breast images is both costly and time-consuming, even for experienced radiologists7.
Visual inspection of mammographic images is challenging and time-consuming, requiring significant radiological expertise. Furthermore, considerable inter-observer variability arises when the same mammograms are interpreted by different radiologists. Therefore, developing automated computer-aided diagnosis (CAD) systems powered by advanced artificial intelligence (AI) techniques is crucial for enabling the timely and accurate detection of breast cancer. These systems improve diagnostic consistency and reduce radiologists’ workload8,9.
A recent transformative shift in the field of medical diagnostics has been marked by the integration of artificial intelligence (AI), particularly in the area of breast cancer detection3. AI-powered systems demonstrate a remarkable capacity to identify subtle imaging patterns that may elude even experienced radiologists, thereby enhancing the potential for timely and accurate clinical decision-making. While traditional Machine Learning (ML) algorithms, including Support Vector Machines (SVM), Long Short-Term Memory networks (LSTM), and K-Nearest Neighbors (KNN), have been employed for breast tumor classification, their diagnostic accuracy is often limited. These constraints primarily stem from inherent weaknesses in feature extraction and a pronounced susceptibility to overfitting, especially when dealing with the small, imbalanced datasets common in medical imaging10,11.
Deep learning (DL) has emerged as a powerful alternative in medical image analysis, demonstrating superior performance. Specifically, convolutional neural networks (CNNs) have demonstrated exceptional capability in interpreting mammograms by automatically learning hierarchical, spatially relevant features directly from raw image data. However, conventional CNNs face significant challenges: they rely on limited receptive fields, struggle to capture global contextual information, and often fail to model subtle lesion structures such as microcalcifications and tissue distortions. Moreover, CNN-based methods are sensitive to class imbalance and sometimes lack full end-to-end learning pipelines, which restricts their generalization ability and real-time applicability12.
Transformer-based architectures have recently garnered considerable interest in the field of visual recognition tasks. This is due to their capacity to model global contextual relationships and long-range dependencies through self-attention mechanisms. This enhances both the representation of features and their interpretability13. However, transformers typically lack the strong local inductive biases that CNNs possess, making them less effective in modeling fine-grained medical details such as nodules and tissue distortions. In addition, their reliance on large-scale datasets and high computational requirements hinders their direct use in medical imaging domains where annotated data are scarce. To address these complementary limitations, hybrid CNN-Transformer models have been proposed to leverage the local detail extraction capability of CNNs alongside the transformers’ global dependency modeling power. With its hierarchical design and shifted-window attention, architectures such as the Swin Transformer have demonstrated promise in striking a balance between efficiency and global modeling capacity14.
Despite these architectural advances, dataset scarcity and class imbalance challenges remain a major bottleneck in mammographic analysis. Public datasets such as MIAS and CBIS-DDSM often suffer from a low number of malignant cases compared to benign or normal samples, leading to biased classifiers. Traditional augmentation strategies, such as flipping, rotation, and scaling, increase the dataset size but do not introduce sufficient variability, as they generate repetitive patterns of the same samples. This limits the models’ ability to generalize effectively. To mitigate these issues, recent works have employed Generative Adversarial Networks (GANs), which synthesize realistic medical images that enrich minority classes and reduce imbalance, as well as photometric augmentations that introduce additional appearance variations while preserving diagnostic structures7.
Leveraging the complementary strengths of these architectures, this work introduces a novel dual-branch hybrid framework for mammography breast cancer classification. The first branch is built upon a Swin Transformer to capture global anatomical context and model long-range dependencies across the mammogram. In parallel, the second branch utilizes a custom-designed CNN-based Dual-Attention Multi-scale Fusion Network (DAMFN), specifically designed to extract fine-grained, discriminative local features, such as microcalcifications and tissue distortions. This synergistic combination is designed to achieve a more comprehensive and robust image representation for accurate diagnosis and treatment. These complementary feature representations are fused using an attention-based mechanism, enhancing robustness and interpretability for clinical use. Furthermore, to overcome dataset limitations, our framework integrates GAN-based data augmentation and photometric augmentation, which collectively address class imbalance and improve model generalization.

Motivation
Breast cancer continues to be one of the most common and deadly cancers that strike women worldwide, underscoring the critical need of early and accurate identification through mammography screening. Despite the fact that deep learning has greatly improved medical image processing, three major obstacles remain for mammography-based breast cancer detection: (1) CNNs are good at extracting local features, but they have trouble modeling long-range relationships, which are essential for capturing the global structure of the breast. (2) Vision transformers, such as Swin Transformer, are good at capturing global context, but they frequently need huge annotated datasets, which are uncommon in medical imaging. (3) Malignant cases are underrepresented in public mammography datasets, which results in biased and less generalizable models due to their size limitations and extreme imbalance. To address these challenges, this research proposes Swin-DAMFN, a new feature fusion-based dual-branch model that processes mammographic images through two parallel tracks. The first track employs a Swin Transformer to capture global contextual dependencies using shifted windows and self-attention mechanisms. Two specialized modules, Multi-Separable Attention (MSA) and Tri-Shuffle Convolution Attention (TSCA), operate across multiple receptive fields to capture refined multi-scale information. The second track uses a Dual-Attention Multi-scale Fusion Network (DAMFN) to extract fine-grained local features. A Triplet Attention method is used to incorporate the characteristics from both branches, improving the model’s emphasis on clinically informative regions and reducing irrelevant noise.
Additionally, to overcome dataset restrictions, GANs are used to enhance minority classifications and create realistic mammograms. Further appearance variety is introduced by photometric augmentation. This combination strategy enhances the robustness and generalizability of the proposed framework while mitigating class imbalance. In conclusion, our methodology leverages the complementary advantages of transformer-based global context modeling and CNN-based local feature extraction, which are further enhanced by sophisticated data augmentation techniques. The shortcomings of current methods for mammographic breast cancer detection are addressed by this design, which provides a robust and comprehensive solution.

Contributions
The primary contributions of this work can be outlined as follows:We propose Swin-DAMFN, a novel dual-branch hybrid architecture that combines a Swin Transformer branch for global context modeling with a CNN-based Dual-Attention Multi-scale Fusion Network (DAMFN) branch for capturing fine-grained local features. This hybrid design leverages the strengths of both paradigms.

The DAMFN branch integrates two specialized modules, Multi-Separable Attention (MSA) and Tri-Shuffle Convolution Attention (TSCA), to effectively capture multi-scale discriminative features such as microcalcifications and subtle tissue distortions.

A lightweight Triplet Attention (TA) block is employed to enhance feature maps by modeling cross-dimensional dependencies, enabling the model to focus on clinically relevant regions.

We propose an attention-based fusion strategy to integrate complementary global (Transformer) and local (CNN) representations, ensuring robust multi-scale feature learning.

To address data scarcity and severe class imbalance, we combine two complementary augmentation methods: (i) GAN-based synthesis of realistic mammograms to enrich minority classes, and (ii) photometric augmentation to introduce appearance variability and improve model generalization.

A global average pooling (GAP)-based classification head with dropout and fully connected layers is adopted, achieving computational efficiency without compromising diagnostic accuracy.

The Swin-DAMFN framework, which was developed for this study, has been shown to consistently outperform current CNN and Transformer-based methods in terms of accuracy, sensitivity, and F1-score. This is demonstrated by comprehensive experiments conducted on the MIAS and CBIS-DDSM datasets.

Paper structure
This paper is structured as follows: Section 2 reviews the existing literature, Section 3 outlines the proposed methodology for breast cancer classification, Section 4 presents and discusses the experimental results, and Section 5 provides a summary of the conclusions.

Literature review

Literature review
The classification of breast cancer has constituted a significant research focus for an extended period. Prevailing methodologies can be generally divided into two groups: conventional machine learning techniques and those based on deep learning. This section reviews the most pertinent literature, placing a specific emphasis on contemporary deep learning architectures. In recent work, Oyelade et al.15 addressed the challenge of multimodal breast cancer classification by introducing a twin convolutional neural network combined with a hybrid binary optimization strategy. The framework was designed to learn modality-specific low- and high-level features from mammography and histopathology images, while the optimization step reduced redundancy by discarding non-discriminant features. The experimental evaluation revealed that this fusion model achieved a classification accuracy of 0.977 and an Area Under the Curve (AUC) of 0.913, indicating a significant enhancement in performance. Similarly, Teoh et al.16 created an optimized ensemble deep learning model. This approach integrated the GoogLeNet and ResNet-50 architectures using a weighted averaging technique, which yielded superior precision and greater predictive confidence compared to the individual constituent models. Evaluation on the mini-MIAS, INBreast, and CBIS-DDSM benchmark datasets demonstrated that the ensemble achieved average confidence scores of 93.05% for identifying microcalcifications and 88.59% for classifying normal tissue.
Furthermore, Jabeen et al.17 introduced an intelligent breast cancer diagnosis framework that synergizes custom CNNs with Bayesian optimization and sophisticated feature fusion. The authors developed two specialized CNN models, whose hyperparameters were fine-tuned using a Bayesian optimization strategy. Deep features were extracted and refined using a feature selection technique guided by Simulated Annealing, and were subsequently fused using a Rényi Entropy-based method. This comprehensive approach demonstrated exceptional performance, achieving classification accuracies of 97.7% and 97.3% on standard mammography datasets, thereby confirming the efficacy of the integrated framework for robust diagnostic outcomes. Sahu et al.18 proposed a deep learning framework that integrates multiple pre-trained CNN architectures for breast cancer classification. Prior to training, mammography images were enhanced using a Gaussian-based modified Laplacian high-boosting filter to improve image quality. The processed images were then classified using MobileNetV2, AlexNet, and ResNet, with MobileNetV2 achieving the best performance, yielding an accuracy of 97.50%. Despite the strong results, the study reported a considerable rate of false predictions, indicating limitations in model reliability and robustness. Moreover, Raiaan et al.19 developed a lightweight deep learning framework for mammography image classification, focusing on enhancement and noise removal to improve diagnostic quality. Data augmentation was extensively applied to expand the training samples, followed by training a lightweight CNN model. The approach achieved a classification accuracy of 98.42% on the MIAS dataset. However, the heavy reliance on augmentation to boost accuracy highlights a limitation, as it may reduce the framework’s generalizability to real-world clinical data with limited variability.
Moreover, Kaddes et al.20 proposed a hybrid deep learning architecture that combines CNNs with Long Short-Term Memory (LSTM) networks for binary breast cancer classification. While CNNs were employed to capture spatial hierarchies and malignancy-related patterns from mammographic images, LSTMs were integrated to model sequential dependencies and temporal interactions within the extracted features. The framework was evaluated on two Kaggle datasets and outperformed several baseline models, including standalone CNN, LSTM, GRU, VGG16, and ResNet50, achieving accuracies of 99.17% and 99.90%. Houssein et al.8 developed a hybrid model for breast cancer classification that combines transfer learning and an enhanced metaheuristic optimizer to support radiologists in precise abnormality detection. Their framework comprises four stages: preprocessing and augmentation, hyperparameter optimization, learning, and evaluation. The authors utilized an improved marine predators algorithm (IMPA) to optimize the hyperparameters of the ResNet-50 architecture, achieving robust classification performance on the MIAS and CBIS-DDSM mammography datasets. Nevertheless, the method’s reliance on ResNet-50 limits its generalizability to other pretrained architectures, and its validation on exclusively mammographic images raises questions about cross-dataset robustness. Similarly, Pramanik et al.21 proposed a two-stage framework for breast mass classification in mammograms. In the first stage, an attention-enhanced VGG16 network was employed to extract deep features from mammography images. To address redundancy and select the most discriminative features, the Social Ski-Driver (SSD) meta-heuristic was combined with an Adaptive Beta Hill Climbing local search strategy. The optimized feature subset was then classified using a KNN classifier. Experimental evaluation on the DDSM dataset achieved a classification accuracy of 96.07%. However, the reliance on a relatively shallow backbone (VGG16) and a traditional classifier (KNN) may restrict scalability and generalizability compared to more recent deep learning architectures.
Huang et al.22 presented a hybrid diagnostic framework for breast cancer that integrates a lightweight SqueezeNet architecture with an Improved Chef-Based Optimization (ICBO) algorithm. Their methodology employs a three-stage pipeline comprising noise reduction via median filtering, region of interest extraction using Kapur thresholding, and optimal feature selection through ICBO. The approach further refines SqueezeNet’s parameters using ICBO for enhanced classification efficiency. While evaluation on the MIAS dataset demonstrated promising results, the framework’s generalizability remains constrained by its exclusive validation on a single dataset, lacking assessment on larger or more diverse mammography collections. In addition, Mahmood et al.23 proposed a hybrid diagnostic framework for breast cancer that integrates radiomics with deep learning. Their method utilizes a novel feature extraction algorithm and employs transfer learning with modified architectures, such as VGGNet and SE-ResNet152. The framework also incorporates hybrid CNN+LSTM and CNN+SVM models for classifica, and utilizes Grad-CAM to enhancerove interpretability. Reported results show a near-perfect sensitivity and AUC of 0.99, although the scope of its validation was not specified.
Recently, transformer-based models have gained attention in computer vision for their ability to capture long-range dependencies. For the purpose of classifying breast cancer in histopathology pictures, Zeynali et al.24 presented a hybrid CNN-Transformer architecture in which Transformer modeled global context and Xception extracted local information. On the BreaKHis and IDC datasets, the method demonstrated great accuracy: 91% for binary IDC classification, 96.15%–99% in magnification-dependent situations, and 94.82%–99.62% in magnification-independent scenarios. While effective, this work was limited to histopathology data and did not explore mammography, raising questions about its cross-modality generalizability. Hayat et al.25 proposed a hybrid EffNetV2–ViT framework for breast cancer histopathology classification. EfficientNetV2 (small, medium, and large variants) was used for feature extraction, followed by a Vision Transformer for classification. On the BreakHis dataset, the model achieved state-of-the-art results with 99.83% accuracy for binary classification and 98.10% for eight-class classification. Moreover, Chen et al.26 introduced a two-stage framework for breast cancer detection on screening mammograms. In the first stage, self-supervised learning (SSL) was employed to pretrain a Swin Transformer backbone on a limited set of mammograms. This pretrained model was then incorporated into a hybrid architecture (HybMNet), where the Swin Transformer identified informative regions through local self-attention, while a CNN extracted fine-grained features from these patches. A fusion module subsequently integrated the global and local representations. The method achieved AUCs of 0.864 on the CMMD dataset and 0.889 on the INbreast dataset, highlighting the effectiveness of combining SSL with a hybrid design.
Marwa et al.27 proposed an explainable AI framework for breast cancer classification based on Vision Transformer (ViT) architectures. Unlike traditional CNNs, which often struggle to capture long-range dependencies, the ViT leverages self-attention to model global relationships between pixels in histopathology images. The framework was trained for binary lesion classification (benign vs. malignant) using the BreakHis dataset and achieved superior performance compared to state-of-the-art CNN models. While the method demonstrated strong accuracy and enhanced transparency, it was restricted to histopathology data and did not explore its generalizability to other modalities such as mammography or ultrasound. Shahzad et al.28 proposed a breast cancer prediction model that integrates EfficientNetB0 and ResNet50 for classifying histopathology images into IDC and non-IDC. By combining the strengths of both architectures, the model achieved an accuracy of 94%, a mean absolute error of 0.0628, and a Matthews Correlation Coefficient of 0.8690, demonstrating superior performance compared to prior baselines.
Jahan et al.29 developed a deep learning system for the automated detection and subtyping of breast cancer from whole slide images. Their three-stage framework, which involved patch-level detection, subtype classification, and patient-level aggregation, was evaluated on several models. A fine-tuned Vision Transformer (ViT) model demonstrated superior performance, achieving 96.74% accuracy in detecting cancerous patches and 89.78% in differentiating between ductal and lobular carcinoma. These results highlight the efficacy of transformer-based architectures in analyzing complex histopathological imagery. In addition, Boudouh et al.30 proposed a hybrid deep learning framework to enhance the classification of breast calcifications in mammography images. The classification architecture consists of two parallel branches: a Vision Transformer (ViT++) branch designed to capture contextual features and a CNN branch that leverages transfer learning to extract visual representations. Multiple pretrained CNN models, including Xception, VGG16, and RegNetX002, were explored as feature extractors. Experimental evaluation on the CBIS-DDSM dataset demonstrated that while VGG16 alone achieved only 61.96% accuracy, the hybrid approach combining ViT++ with CNN features significantly improved performance, reaching a maximum accuracy of 99.22%.
The limits of deep learning in mammography have been further stretched by recent developments. Mammo-CLIP, the first vision-language foundation model pre-trained on large-scale mammogram-report pairs, was presented by Ghosh et al.31. Their method highlights the promise of multimodal pre-training by demonstrating exceptional data efficiency and robustness in categorizing and localizing mammographic features essential for breast cancer detection. A multi-scale Vision Transformer with a gated attention fusion module and Harris Hawks optimization for feature selection was proposed by Ahmed et al.32. It outperformed previous models through optimized hierarchical representation learning and achieved 98.2% accuracy on a BI-RADS-classified dataset. Zhou et al.33 introduced an ellipse-constrained pseudo-label refinement and symmetric regularization framework for semi-supervised fetal head segmentation, despite their focus on ultrasound imaging. This framework provided useful information for upcoming semi-supervised extensions in mammography classification.
Recent advances in vision models havthat surpass traditional CNNs and Transformers, offering enhanced the limitations of traditional CNN and Transformer frameworks, offering superior accuracy and computational efficiency. Among these, Mamba-based architectures have gained significant attention for leveraging State Space Models (SSMs) to achieve linear-time sequence modeling. For instance, Vision Mamba (Vim)34 incorporates bidirectional Mamba blocks with positional encodings to improve representational capacity, delivering ImageNet-level performance while reducing computational overhead compared to Transformers. This idea is further expanded in VMamba35, which generalizes visual SSMs into scalable backbones for both image classification and semantic segmentation. Additionally, hierarchical variants such as SparX-Mamba adopt sparse cross-layer connections to enhance feature propagation, and BiFormer36 employs bi-level routing attention to introduce adaptive sparsity in Transformers, improving contextual awareness for detection tasks. The CSWin Transformer37 also surpasses Swin Transformer on dense prediction benchmarks by applying cross-shaped window attention to balance local and global dependencies. In the CNN domain, UniRepLKNet38 integrates large-kernel convolutions to achieve universal perception across visual and temporal modalities, attaining 88.0% accuracy on ImageNet, while OverLoCK combines top-down attention with dynamic context-mixing kernels to strengthen feature representation in real-world clinical settings. Although these approaches demonstrate remarkable progress in general-purpose vision tasks, our proposed Swin-DAMFN is specifically optimized for mammographic analysis, combining hybrid CNN–Transformer design with domain-adaptive multi-scale attention (MSA/TSCA) and GAN-based augmentation to address data imbalance and scarcity in medical imaging. As a result, Swin-DAMFN consistently outperforms generic visual backbones on CBIS-DDSM and MIAS benchmarks.
Moreover, several trends toward hybrid and Mamba-inspired architectures for breast imaging analysis are emerging. Aslam et al.39 proposed HA-Net, a hybrid attention-based segmentation framework that integrates DenseNet-121 with global spatial and scaled dot-product attention, achieving strong generalization on BUS datasets. Bayatmakou et al.40 introduced Mammo-Mamba, a selective state-space model (SSM) combined with Transformer attention through a Sequential Mixture of Experts block for efficient multi-view mammography analysis. Similarly, Laicu-Hausberger and Popa41 presented UltraScanNet, a Mamba-inspired hybrid backbone merging MobileViT and state-space modules for ultrasound breast lesion classification. Nasiri-Sarvi et al.42 explored Vision Mamba encoders for breast ultrasound classification, demonstrating statistically significant performance gains over CNN and ViT baselines. Zhu et al.43 proposed AttmNet, which integrates convolution, self-attention, and Mamba blocks for enhanced lesion segmentation. Hybrid multimodal systems such as TransBreastNet44 and the hybrid CNN–BiLSTM–EfficientNet architecture by Lilhore et al.45 further highlight the potential of combining temporal and spatial feature modeling for improved diagnosis. Additionally, Sun46 presented a LightGBM hybrid integration model incorporating gradient harmonization and Jacobian regularization for stable and interpretable breast cancer diagnosis. Collectively, these 2024–2025 studies confirm the ongoing shift toward hybrid, attention-driven, and state-space transformer frameworks–validating the design rationale behind our proposed Swin-DAMFN architecture.
Furthermore, Ahmed et al.32 introduced a multi-scale Vision Transformer (MAX-ViT) framework for mammogram classification, which uses a gated attention module to fuse features and the Harris Hawks Optimization algorithm for feature selection before final classification with XGBoost. Evaluated on a single institutional dataset, the model achieved exceptional metrics, including 98.2% accuracy and a 98.9% AUC, outperforming contemporary methods. However, the authors note significant limitations, as the model’s generalizability is unproven due to the lack of diverse external validation, and its computational complexity may hinder practical clinical application.
For a concise summary of recent progress in deep learning and transformer-based breast cancer classification, Table 1 provides a comparative overview of state-of-the-art methods. The table highlights the applied methodologies, utilized datasets, performance outcomes, imaging modality, and the main limitations of each study.
Additionally, TransXNet, a hybrid CNN-Transformer architecture that uses a lightweight dual dynamic token mixer (D-Mixer) to concurrently simulate global and local dynamics in visual recognition tasks, was proposed by Lou et al.47. TransXNet achieves state-of-the-art results on generic benchmarks such as ImageNet by combining self-attention with input-dependent dynamic convolutions. It lacks specific modules for fine-grained characteristics like microcalcifications in mammograms, and its application to medical imaging is still studied. Also, Pan et al.48 introduced ACMix, a method for seamlessly integrating self-attention and convolution through shared parameters. By enabling convolutional kernels to carry out attention-like operations that produce real images, ACMix increases efficiency. However, it does not include multi-scale attention or domain-specific augmentation, which restricts its direct applicability to imbalanced medical datasets. For example, hybrid fusion with custom MSA and TSCA modules for multi-scale local extraction, TA for cross-dimensional refinement, and GAN-photometric augmentation for data balancing, achieving superior diagnostic accuracy in breast cancer classification.
A number of recent studies have expanded the methodological landscape of attention-driven and hybrid deep learning frameworks for medical image categorization beyond breast cancer imaging. To illustrate the efficacy of multi-level feature aggregation, Ullah and Kim49 introduced a hierarchical feature fusion ensemble that combines Vision Transformer representations with optimal machine learning classifiers. Similar to this, Ullah et al.50 consistently improved accuracy across various histology and MRI datasets by methodically integrating attention modules, such as SE and CBAM, into traditional CNNs. In addition, Ullah et al.51 introduced a densely connected attention-based network (DAM-Net) for chest X-ray analysis, highlighting the role of dense feature connectivity in capturing subtle image cues. Although these works target other medical domains, they collectively reinforce the significance of hybrid architectures and attention mechanisms, principles that our proposed Swin-DAMFN framework adapts and extends for mammographic breast cancer classification.
In summary, although significant progress has been made in breast cancer classification through conventional and deep learning approaches, several challenges remain unresolved. CNN-based models are often limited in capturing global contextual information, while transformer-based methods demand large-scale datasets and high computational resources. Furthermore, dataset imbalance and lack of generalization across different mammography datasets hinder reliable deployment in clinical practice. These gaps highlight the need for hybrid frameworks that integrate local and global feature learning, supported by advanced augmentation strategies, to achieve robust and clinically applicable breast cancer classification.

Methodology

Methodology
This study introduces Swin-DAMFN, a novel dual-branch deep learning architecture for breast cancer classification from mammograms, designed to exploit both global contextual information and fine-grained local features. As illustrated in Fig. 1, the overall framework begins with dataset preparation, followed by preprocessing and augmentation strategies to enhance data quality and balance class distribution. The processed images are then fed into the proposed dual-branch model for feature extraction and classification. The network architecture of Swin-DAMFN, depicted in Fig. 2, consists of two complementary branches. The first branch employs a Swin Transformer, which captures long-range global dependencies across breast tissue through its hierarchical structure and shifted-window attention mechanism. The second branch, the Dual-Attention Multi-scale Fusion Network (DAMFN), is a CNN-based pathway designed to extract discriminative local patterns such as microcalcifications and mass margins. DAMFN integrates three specialized blocks: (i) Multi-Separable Attention (MSA), (ii) Tri-Shuffle Convolution Attention (TSCA), and (iii) a Traditional Convolution (TC) block, each operating at multiple receptive fields to capture fine-grained multi-scale information.
Features from both branches are fused using depth-wise concatenation, refined through a Triplet Attention (TA) block, and pooled via Global Average Pooling (GAP). The final classification is performed by fully connected layers with cross-entropy as the loss function. This architecture ensures that both global and local information contribute effectively to the final decision, thereby enhancing the robustness and clinical reliability of the model.

Dataset description
This study eutilizes two publicly available benchmark datasets, the Curated Breast Imaging Subset of DDSM (CBIS-DDSM) and the Mammographic Image Analysis Society (MIAS) database, to evaluate the performance of the proposed model Mammography is a critical tool for the early detection of breast cancer, primarily through the identification of masses and calcifications in X-ray images52. To guarantee that no patient appears in more than one split (training/validation/test), the datasets were divided at the patient level for CBIS-DDSM using the supplied patient IDs. Image-wise splitting was used with rigorous manual verification for the MIAS dataset, which lacks specific patient identification, in order to prevent any implied overlap of the same breast or laterality/view.

CBIS-DDSM dataset
The CBIS-DDSM is an updated and standardized version of the original Digital Database for Screening Mammography (DDSM)53. This curated dataset offers significant improvements, including images in the standardized DICOM format and a restructured, more accessible metadata organization. It comprises 1,459 mammograms annotated across four diagnostic categories: Benign Calcification (398 images), Benign Mass (417 images), Malignant Calcification (300 images), and Malignant Mass (344 images). All mammograms were preprocessed by resizing them to a uniform resolution of pixels.

MIAS dataset
The MIAS database was used to assess the suggested model’s robustness and generalizability54. This dataset, which was assembled by a collection of UK research groups, includes 322 mammograms divided into three classes: benign (63 images), malignant (52 images), and normal (207 images).
A summary of the key characteristics of both datasets is provided in Table 2. The MIAS dataset classification task is carried out as a three-class issue (Benign, Malignant, and Normal), with the Normal class being handled as a separate category. As previously mentioned, the CBIS-DDSM dataset is classified into four classes. The model employs categorical cross-entropy loss in both scenarios, and assessment metrics (such as sensitivity) are macro-averaged over all classes. In MIAS, the Normal class is treated in the same way as other classes during feature extraction and fusion, which take place before the classification head.

Data preprocessing
Accurate identification of breast lesions within the surrounding tissue presents a significant diagnostic challenge, with a non-negligible potential for misdiagnosis55. These lesions are often characterized geometrically by features such as origin, size, and perimeter, and typically appear brighter than the adjacent breast tissue56. Radiologists often compare mammograms of the same breast taken over time to monitor the progression of lesions. This temporal analysis relies on identifying stable control points (e.g., tissue junctions, duct intersections, vascular features) and calculating correlations between these points across sequential images. Adapting deep learning models to this complex environment is a critical challenge in medical image analysis. Preprocessing is a crucial step in enhancing the signal-to-noise ratio, thereby improving the distinction between normal anatomical structures and pathological features. To mitigate the risk of overfitting and improve model generalization, a series of preprocessing steps, including noise removal, image enhancement, scaling, and cropping, is applied prior to training.
In this paper, we employ a combination of techniques to improve mammogram readability and feature prominence. This includes mean and median filtering for noise reduction, morphological operations to refine structures, and Contrast Limited Adaptive Histogram Equalization (CLAHE) to enhance local contrast without amplifying noise.

Image preparation
The acquisition of a large, high-quality dataset of mammography images presents significant practical challenges. Raw mammograms contain sensitive patient information, inherent noise, and variabilities in lighting conditions, all of which can contribute to a higher rate of false positives in automated analysis. To mitigate these issues while preserving critical information, images are converted to a grayscale format. The inherent limitations of mammography, combined with a scarcity of available images, pose a major obstacle to training accurate deep learning models. Data scarcity often leads to overfitting, in which models perform well on training data but fail to generalize to new examples. Furthermore, annotating breast images for supervised algorithm training is labor-intensive and costly57. To prepare the images for analysis, morphological operations, specifically opening and closing, are applied to remove identifying labels and annotation tags. These operations are based on the fundamental processes of erosion and dilation, which reduce and expand the sizes of structures within an image, respectively.
The erosion of a set E by a structuring element S is defined as the set of all points p for which S translated by p is entirely contained within E:The dilation of E by S is defined as the set of all points p for which the reflected structuring element translated by p has a non-empty intersection with E:Opening and closing stem from these fundamental operations. An opening is defined as erosion followed by dilation. It is used to remove small, bright artifacts and break thin connections. Closing is defined as dilation followed by erosion. Closing is used to remove small, dark holes and fuse narrow breaks.In this work, these morphological operations, defined by Eqs. 3 and 4 are employed to effectively eradicate labeling and annotation markers from the mammography images while preserving the essential features of the underlying breast tissue. Furthermore, let I be the input image with dimensions (height, width, channels). After grayscale conversion, , and the image is normalized to [0,1] range using .

Image noise reduction
Following acquisition, mammography images undergo processing to eliminate various forms of noise that can degrade image quality and hinder accurate analysis. Common noise types in these images include Gaussian, salt-and-pepper, speckle, and Poisson noise. To address this, we employ an adaptive median filtering approach, which processes the noisy input image to produce a denoised output image . The adaptive median filter is particularly effective at suppressing impulse noises, such as salt-and-pepper noise, while preserving edges and details. The algorithm operates as follows: Initialization: For each pixel, a sliding window of size is defined. The minimum and maximum pixel intensities within this window are identified as and , respectively. The median value within the window is calculated as . The parameter represents the maximum allowable window size for the adaptive process.

Impulse detection: The condition is evaluated. If this condition is satisfied, is not considered an impulse noise pixel, and the algorithm proceeds to the next step. If not, the window size is increased. Step 1 is repeated until either the median value is no longer an impulse or the window reaches its maximum allowed size (). If the maximum size is reached, the median value is assigned as the filtered output for the pixel.

Pixel replacement: If the median was deemed not to be an impulse in Step 2, the algorithm then checks the original pixel value . If , the pixel is considered uncorrupted, and its value remains unchanged in the filtered image (). However, if equals either or , it is classified as corrupted. In this case, it is replaced by the median value obtained in Step 2 ().

Mathematically, for a pixel at position (i, j), the adaptive median , where W is the window centered at (i, j). The window size starts at 3x3 and increases by 2 until or the condition is met. This adaptive process ensures effective noise reduction while minimizing the blurring of fine details and edges, which are crucial for accurate medical image analysis.

Image enhancement using CLAHE
A widely adopted method in medical image processing for enhancing local contrast is contrast-limited adaptive histogram equalization (CLAHE). It functions by partitioning the input image into smaller, non-overlapping regions, known as tiles. Within each tile, a histogram of pixel intensities is calculated and adjusted using a predefined clip limit, which prevents the over-enhancement of noise. The clipped pixels are then evenly redistributed across all histogram bins, ensuring contrast enhancement while maintaining visual consistency and suppressing artifacts58,59. The total number of tiles, represented by T, is computed by the following expression:where N is the total image dimension and n is the size of each tile.
To enhance visual quality effectively, the algorithm applies a Normalized Contrast Limit (NCL), while the average number of pixels in the image () is given by:Here, represents the number of grayscale levels, and and correspond to the pixel dimensions along the x and y axes, respectively.
The mean number of clipped pixels per bin, , is derived using:where is the total number of pixels exceeding the clip limit.
Any remaining pixels, which are not distributed by this process, are uniformly spread across all grayscale levels using:where denotes the number of remaining unallocated pixels. For a tile histogram h(b) with bin (where for 8-bit grayscale), the clip limit , where is the clip factor (typically 0.01 to 0.03). The cumulative distribution function (CDF) for mapping is , where is the clipped histogram.

Data augmentation
Medical imaging datasets for breast cancer detection are often limited in size and suffer from class imbalance, which can negatively impact the performance and generalization ability of deep learning models. In particular, the MIAS dataset suffers from a small number of samples, while the CBIS-DDSM dataset exhibits class imbalance with malignant lesions being relatively underrepresented. These challenges increase the risk of overfitting and hinder the ability of deep learning models to generalize to unseen data. To overcome these challenges, we adopted a hybrid augmentation strategy that integrates Generative Adversarial Networks (GANs)60 with photometric augmentation techniques. The GAN-based augmentation was employed to generate realistic mammogram samples, particularly for minority classes, thereby addressing the imbalance problem. In parallel, photometric augmentation was applied to enhance data diversity by altering brightness, contrast, color, and sharpness levels of the original images. The combination of these two approaches not only increased the dataset size but also enriched its quality and variability, resulting in a more balanced and comprehensive dataset for training. The overall distribution of the datasets before and after augmentation is summarized in Table 3.

GAN-based augmentation
Class imbalance and the lack of annotated mammograms were addressed using GANs. The Generator and Discriminator neural networks make up a GAN framework. These networks undergo adversarial training. The generator learns to create realistic images and simulate the distribution of actual mammograms. The discriminator, meanwhile, attempts to distinguish between generated and actual images. The generator may gradually produce realistic, high-quality mammograms that capture fine-grained diagnostic patterns, such as masses and microcalcifications, as well as global tissue features, thanks to this adversarial training process. When the discriminator can no longer consistently distinguish between real and synthetic samples (i.e., when it converges to about 50% accuracy), the GAN training is said to be stable. The GAN architecture overview is shown in Fig. 3.
The objective function of the GAN is defined using Eq. (9):where x represents a real mammogram, z is a random noise vector, G(z) is the generated synthetic mammogram, D(x) and D(G(z)) denote the Discriminator outputs for real and synthetic images, respectively, and and represent the real data and noise distributions.
In our study, GAN-based augmentation is applied differently to the two datasets:MIAS: Due to the small dataset size, GANs are used to significantly expand the number of training samples, particularly for the minority malignant class, thus reducing overfitting and improving model generalization.

CBIS-DDSM: GANs are selectively applied to balance the class distribution by synthesizing malignant cases, thereby reducing bias toward benign classes. This ensures that the classifier does not become skewed by class imbalance.

In this study, GAN augmentation was particularly effective for minority classes such as malignant lesions in the MIAS dataset and malignant calcifications in CBIS-DDSM. By generating realistic mammograms for these underrepresented categories, GANs significantly improved the class balance and enhanced the diversity of the training set.

conditional deep convolutional GAN (cDCGAN)
To create realistic mammograms for minority classes, a conditional Deep Convolutional GAN (cDCGAN)61 was used just on the training split. Five transposed convolutional layers with batch normalization and ReLU activations make up the generator, while five convolutional layers with LeakyReLU activations make up the discriminator. For 200 epochs (CBIS-DDSM) and 300 epochs (MIAS), training employed the Adam optimizer (learning rate , , ) with mini-batch size 16. In order to maintain anatomical structures, the objective mixed an L1 reconstruction term with a non-saturating adversarial loss. The generator uses transposed convolutions (kernels 4x4, stride 2) with batch normalization and ReLU, while the discriminator uses convolutions (kernels 4x4, stride 2) with LeakyReLU (slope 0.2). The loss is non-saturating adversarial + L1 reconstruction (?=100). Early stopping is applied if discriminator accuracy stabilizes around 50% for 10 epochs. The augmented dataset sizes displayed in Table 3 were obtained by generating three to five synthetic photos for each real minority-class sample. A balanced 50:50 combination of real and synthetic images was present in each mini-batch during training.
The realism of synthetic mammograms was quantitatively evaluated using the Fréchet Inception Distance (FID) as shown in Eq. (10):where m, C and are the mean and covariance of feature embeddings (pretrained Inception-v3) of real and synthetic images, respectively. Lower values indicate higher similarity. Our model achieved an average FID of 18.7, indicating strong visual realism.
Qualitative inspection of generated samples confirmed high fidelity and absence of obvious artifacts. Illustrative examples are shown in Fig. 4.
While GAN-based augmentation significantly reduces data imbalance and enhances model generalization, it may also cause a shift in the distribution between the real and synthetic mammography domains. To mitigate this risk, the training phase utilized only synthetic samples, and the testing and validation subsets did not contain any actual data. However, we recognize that a classifier trained on a combination of actual and synthetic data would not be able to generalize to images that are exclusively clinical. To overcome this constraint, future research will utilize adversarial realignment techniques and domain adaptation on actual multi-institutional datasets.

Photometric augmentation
The brightness factor in [0.8, 1.2], contrast in [0.8, 1.2], and sharpness in [0.8, 1.2] are examples of specific transformations that are applied at random with a probability of 0.5. In addition to GAN-based augmentation, photometric augmentation techniques were employed to further enrich the dataset’s diversity. Unlike generative approaches, photometric augmentation modifies the pixel intensity distributions of existing images, producing multiple qualitative variations while preserving the original anatomical structures. Four types of transformations were considered:Brightness alteration: Adjusts the illumination of the image. Factors greater than 1 increase brightness, while values less than 1 decrease it. A factor of 0 results in a completely black image.

Contrast alteration: Modifies the ratio between light and dark regions. Higher factors amplify differences between bright and dark regions, while lower values reduce contrast.

Color alteration: Alters the intensity of RGB channels. A factor greater than 1 enhances color saturation, values between 0 and 1 desaturate the image, and a factor of 0 converts the image to grayscale.

Sharpness alteration: Enhances or smoothens edges. Larger factors sharpen borders and edges, whereas smaller values produce a blurred effect.

Although both CBIS-DDSM and MIAS datasets are widely used in breast cancer research, they suffer from severe class imbalance, where malignant cases are notably underrepresented compared to benign and normal samples. This imbalance negatively impacts classification performance, particularly in terms of sensitivity to malignant tumors. To mitigate this issue, GANs were primarily utilized to oversample minority classes by generating realistic synthetic mammograms, thereby achieving a more balanced class distribution. Additionally, photometric augmentation techniques, including adjustments to brightness, contrast, color, and sharpness, were applied to enhance data diversity and improve generalization. These complementary strategies not only expanded the dataset size but also reduced bias between classes, resulting in a nearly balanced distribution across all categories. Table 3 presents the final dataset distribution after augmentation.

Proposed Swin-DAMFN architecture

Swin Transformer
The Swin Transformer branch captures global context and long-range dependencies in mammograms, which is essential for detecting the diffuse patterns and structural asymmetries indicative of breast cancer. As illustrated in Fig. 5, its hierarchical architecture employs a shifted window mechanism to achieve efficient multi-scale feature learning with computational complexity that scales linearly with input size62. The process initiates with patch partitioning, where the input mammogram (of dimensions ) is divided into non-overlapping patches. Each patch is subsequently embedded into a C-dimensional vector through a linear layer. These embedded tokens are then hierarchically refined by passing through four sequential stages of Swin Transformer blocks.
The input mammogram has dimensions . After patch partitioning into patches and linear embedding to dimensions, the feature map is . Subsequent patch merging reduces resolutions to (Stage 2), (Stage 3), and (Stage 4).
A key innovation is the Shifted Window-based Self-Attention (SW-MSA) mechanism. Unlike standard Vision Transformers that compute self-attention globally, the Swin Transformer limits computation to non-overlapping local windows. Between consecutive blocks, the window partitioning scheme is shifted, facilitating cross-window communication and significantly enhancing modeling capability without a quadratic increase in computational cost. The complexity for an image with patches is given by:where M is the window size and C is the feature dimension.
Each stage reduces the spatial resolution while increasing the feature dimensionality through a Patch Merging layer, which concatenates features from groups of adjacent patches and applies a linear layer. This results in feature maps at resolutions of , , , and across the four stages, respectively.
The Swin transformer blocks utilize W-MSA and SW-MSA modules consecutively to extract features and spatial correlations from the input data. Figure 6 depicts the connection between two successive Swin transformer blocks.
A Swin Transformer block is formulated as:where and denote the outputs of the (S)W-MSA module and the MLP module for block l, respectively. LayerNorm (LN) and residual connections are applied before each module to stabilize the training process.
The output from the final stage, a rich set of 96 global features, is subsequently forwarded to the fufor integratione integrated with the local features extracted by the DAMFN track.

Dual-Attention Multi-scale Fusion Network (DAMFN) track
The Dual-Attention Multi-scale Fusion Network (DAMFN) track is a pivotal component of our proposed architecture, working in synergy with the Swin Transformer pathway. As illustrated in Fig. 7, the DAMFN track is specifically engineered to process input mammography images at their original resolution of , ensuring no loss of critical, fine-grained details essential for accurate diagnosis.
This branch is dedicated to extracting and refining hierarchical, multi-scale features from the input data. Its core objective is to capture low-level spatial dependencies and intricate local patterns–such as tissue texture, micro-calcifications, and mass margins–that are paramount for distinguishing malignant from benign lesions. The DAMFN incorporates dedicated attention mechanisms within its blocks to dynamically highlight these clinically relevant regions while suppressing irrelevant background information, ensuring the model focuses its computational resources on the most discriminative features.
The DAMF is composed of three fundamental blocks: the Multi-Scale Attention (MSA) block, the Two-Step Channel Attention (TSCA) block, and the Traditional Convolutional (TC) block. In the final stage of the track, the MSA and TSCA blocks operate in parallel. Their outputs are concatenated and subsequently processed by a final TC block. This design enhances the network’s adaptability and efficacy in capturing a diverse spectrum of features across multiple scales and spatial contexts, significantly contributing to the model’s robustness. The cumulative output of this track is a rich, high-dimensional feature map that is seamlessly fused with the global, contextual features extracted by the Swin Transformer track, providing a comprehensive feature set for the final classification stage.
The input is . After the initial MSA block, features are ; after TSCA, ; after TC, . In the final stage, parallel MSA and TSCA outputs are concatenated to , then processed by TC to .
Multi-scale Attention (MSA) Block The MSA block is designed for efficient multi-scale feature extraction. It employs three parallel convolutional pathways: Learned Depth-wise Separable Convolutions by Group Pruning (LdsConv)63, Flexible and Separable Convolution (FSConv)64, and Depth-wise Separable Convolutions (DWSC)65. This parallel structure facilitates comprehensive feature learning at various representational levels. These advanced convolutional techniques decompose standard operations, offering a significant reduction in computational complexity and parameter count compared to traditional convolutions, without compromising representational power. The feature maps from all three pathways are concatenated and subsequently refined using a Convolutional Block Attention Module (CBAM) to amplify salient features crucial for breast cancer detection and suppress less informative noise. The structure of the MSA block is depicted in Fig. 8.
For input feature map , the LdsConv pathway is , where pruning removes low-importance groups. FSConv is . DWSC is . Concatenated features . CBAM refines as , where is element-wise multiplication.
Tri-shuffle convolution attention (TSCA) block Feature maps from the MSA block are processed by the TSCA block. This block utilizes three parallel convolutional layers: group convolution, dilated convolution, and a standard convolution with kernel sizes of 3, 2, and 1, respectively. A channel shuffle operation is applied to these layers prior to the application of these layers to promote cross-group information exchange and enhance feature diversity. The group convolution reduces computational overhead by dividing the input tensor into subgroups. The incorporated dilated convolution expands the receptive field exponentially without increasing the kernel size or parameters, allowing for the capture of richer contextual information from breast tissue structures. The outputs of these parallel layers are concatenated and processed by a triplet attention mechanism, which further accentuates the most informative feature channels and spatial regions. An overview of the TSCA block is provided in Fig. 9.
Traditional Convolutional (TC) Block The output from the TSCA block is further refined by the TC block. This block performs a series of standard operations to efficiently transform features and prepare them for fusion. The process initiates with batch normalization to enhance training stability, followed by a ReLU activation function to incorporate non-linearity. Feature abstraction and dimensionality reduction are achieved through a 2D max-pooling layer, while a dropout layer is applied as an effective regularization technique to mitigate overfitting. A final convolutional layer ensures the feature map has the appropriate dimensions for subsequent fusion operations.
For input , batch norm is , where are mean and variance, learned parameters. ReLU: . Max-pool: . Dropout: , , . Final conv: .

Triplet attention (TA) block
The Triplet Attention (TA) block is a lightweight yet powerful channel and spatial attention mechanism integrated into our model. Its resource-efficient design makes it ideal for complex medical image analysis tasks without introducing significant computational overhead. The schematic of the TA block is depicted in Fig. 10. TA captures cross-dimensional dependencies between channel (C) and spatial (H, W) dimensions through three parallel branches. This approach effectively addresses the limitation of standard channel attention mechanisms by incorporating rich cross-dimensional interactions. For an input tensor , each branch performs a specific rotational operation followed by a Z-pooling layer and a standard convolutional layer. The Z-pool operation, which concatenates max and average pooling along the depth dimension, is defined as:The outputs from the three branches are then aggregated using a simple averaging operation to produce the final refined tensor:where and represent the three cross-dimensional attention weights, is the un-rotated weighted tensor, and and denote rotations by 90 degrees to restore the original input shape of .
Branch 1 (channel-spatial H): Rotate to , Z-pool to , conv with kernel to get . Branch 2 (channel-spatial W): Rotate to , Z-pool to , conv to . Branch 3 (spatial): Z-pool to , conv to . Sigmoid activation on all for [0,1] weights.
In the context of breast cancer classification, the TA block enhances the model’s ability to identify and emphasize critical diagnostic patterns. It improves feature representation by highlighting salient channels and spatial regions associated with malignant findings (such as spiculated masses or micro-calcifications) while suppressing less informative features from dense breast tissue. This leads to more accurate and robust classification performance.

Fusion model
This section describes the entire layer-by-layer architecture and fusion mechanism connecting the Swin Transformer and DAMFN branches in order to guarantee total methodological coverage and achieve complimentary feature enrichment beforehand. This fusion’s design goal is to combine the fine-grained local textures maintained by DAMFN with the high-level semantic representation collected by Swin in order to achieve complimentary feature enrichment prior to final classification. A thorough architectural breakdown of both branches is given in Table 4, which includes input/output dimensions, kernel sizes, strides, and the number of convolutional layers per stage. Direct implementation and a clear distinction between multi-scale convolutional processing and hierarchical transformer encoding are made possible by this organized presentation.
Fusion mechanism. The fusion process begins by aligning the spatial dimensions of the Swin Transformer output () and the DAMFN output (). The Swin features are upsampled via bilinear interpolation to match the DAMFN resolution, ensuring pixel-wise correspondence. The two representations are then concatenated along the channel dimension, forming a composite tensor that encodes both global semantic and local structural cues. To enhance feature selectivity and suppress redundant information, a Triplet Attention module is subsequently applied across the spatial and channel axes.
Finally, a Global Average Pooling (GAP) operation aggregates the refined feature map into a compact representation, which is passed through fully connected layers for classification. The complete process is illustrated in Algorithm 1, which presents the computational flow and parameterization of each step for straightforward reimplementation.

Classification head
The refined feature maps, enriched by the DAMFN and Swin Transformer tracks and processed by the Triplet Attention mechanism, are prepared for the final classification stage. To generate a fixed-length vector representation from these spatial feature maps, a GAP layer is applied. This operation summarizes each feature map by computing its average value, effectively reducing the spatial dimensions to while preserving the channel-wise information. This significantly reduces the number of parameters in the subsequent fully connected layers, acting as a strong regularizer to mitigate overfitting. The resulting vector is then fed into a compact classification module. This module consists of:A Flatten layer to convert the pooled vector into a one-dimensional array.

A Dropout layer () for further regularization, randomly disabling a fraction of neurons during training to prevent co-adaptation and enhance generalization.

A fully connected (Linear) layer with ReLU activation, serving as a hidden layer for non-linear feature transformation.

A final fully connected (Linear) layer that projects the features into the output logits space corresponding to the number of target classes (e.g., Benign, Malignant, and optionally Normal).

For the optimization of the classification head and the entire network, the Cross-Entropy loss function () is employed. This function quantifies the disparity between the predicted probability distribution and the true distribution y over the target classes, and is defined as:where C is the number of classes, is the ground truth label (binary indicator), and is the predicted probability for class c after applying the softmax function. The loss function penalizes incorrect predictions and guides the model, via backpropagation, to adjust its weights and biases to align its predictions more closely with the true labels. By minimizing this loss throughout the training process, the model’s classification accuracy is progressively improved.
Let fused features be . GAP is for each channel c, yielding . Flatten to 1D. Dropout: , . Hidden FC: , . Output FC: , , K classes. Softmax: .

Algorithm of the proposed method
To provide a clear overview of the Swin-DAMFN framework, we present the algorithm in pseudo-code form below. This summarizes the key steps for breast cancer classification from mammograms, including preprocessing, augmentation, dual-branch feature extraction, fusion, and classification.

Computational complexity
The computational complexity of the proposed Swin-DAMFN model is analyzed in Big O notation, considering its dual-branch architecture. The Swin Transformer branch, which processes the input mammogram of size divided into patches (where , initially), achieves a complexity of per block, where C is the feature dimension and M is the fixed window size (typically 7). Due to the hierarchical shifted-window mechanism, the overall complexity scales linearly with image size, , as detailed in the original Swin Transformer work66.
This is a significant improvement over standard ViT, which exhibits quadratic complexity with respect to the number of patches (patch size P), making Swin more efficient for high-resolution medical images. The DAMFN branch, a CNN-based network, comprises multiple convolutional blocks (MSA, TSCA, and TC). Each convolutional layer has a complexity of for standard convolutions, where K is the kernel size. Depthwise separable variants (e.g., in MSA) reduce this to with group size g, approximating across layers. Attention modules like CBAM and Triplet Attention add , which is negligible compared to the convolutions. Overall, DAMFN maintains complexity, similar to conventional CNNs like ResNet, but with enhanced multi-scale efficiency due to parallel pathways.
The full Swin-DAMFN model, operating the branches in parallel before attention-guided fusion (adding via Triplet Attention on concatenated features), has an overall complexity of . This is comparable to standalone CNNs but more efficient than pure ViT ( scaling), balancing global and local feature extraction without excessive computational overhead. Empirical runtime on a single NVIDIA RTX 3090 GPU confirms this efficiency, with inference times of approximately 12 ms per image.

Results and discussion

Results and discussion
The experimental findings of the suggested model are shown in this section, along with a comparison of its effectiveness with other methods. For clarity, the results are arranged into multiple subsections. Every experiment was carried out on a workstation that has an Intel Xeon processor, 128 GB of RAM, 24 GB of memory, and an NVIDIA RTX 3090 GPU. We used CUDA 11.7, PyTorch 2.0, and Python 3.9 to develop the suggested framework. Section 3.1 details the preprocessing and augmentation procedures used on the MIAS and CBIS-DDSM datasets.

Data splitting
We have strict procedures in place to prevent bias and data leakage, ensuring the validity of our high-performance measures. To ensure that there was no image overlap from the same patient throughout splits, the datasets were divided at the patient level into training (80%), validation (10%), and test (10%) sets. Implicit associations that are frequently present in medical imaging are lessened by this method. Only the training set was subjected to augmentation approaches such photometric modifications for variability and GAN-based synthesis for minority classes. Fréchet Inception Distance (FID) scores were used to verify the realism of GAN-generated images without reproducing test patterns, improving generalization instead of adding bias. Ablation’s beneficial effect on robustness was confirmed by a 3–4% performance decline in the absence of augmentation. While class imbalance was addressed, potential limitations include dataset-specific artifacts; future work will validate on larger, multi-institutional cohorts.

Evaluation metrics
To comprehensively evaluate the performance of our proposed framework, we employed several standard classification metrics: accuracy, sensitivity (recall), specificity, precision, and F1-score. These metrics are widely used in medical imaging studies to ensure robust evaluation, particularly in the presence of class imbalance. Table 5 summarizes the definitions of these metrics5.

Hyperparameter optimization
The optimization of hyperparameters plays a crucial role in training deep learning models, as it significantly enhances their generalization ability and overall learning capability. For the proposed hybrid CNN, Transformer architecture, we carefully tuned the hyperparameters to balance model accuracy with computational efficiency. The Adam optimizer was employed with an initial learning rate of and momentum parameters , . To prevent the model from stagnating, we adopted a StepLR scheduler with a step size of 10 and a decay factor of 0.1, ensuring gradual reduction of the learning rate during training. The batch size was set to 16, and training was performed for 100 epochs with early stopping (patience of 10) based on validation loss. A dropout rate of 0.5 and weight decay of were applied to mitigate overfitting. Cross-Entropy Loss was used as the objective function, as shown in Table 6. The final set of hyperparameters was selected after conducting extensive grid search experiments to achieve the best trade-off between performance and computational cost.

Quantitative results
The performance of the suggested Swin-DAMFN model on the CBIS-DDSM and MIAS datasets is compiled in Table 7. The outcomes demonstrate the effectiveness of our dual-branch hybrid architecture in classifying breast cancer. The model demonstrated an outstanding overall accuracy of 99.30% on the CBIS-DDSM dataset, with sensitivity and specificity rates of 99.14% and 99.72%, respectively.
Similarly, on the MIAS dataset, the model demonstrated robust and consistent performance across all metrics, achieving an accuracy of 98.75%, a sensitivity of 98.37%, and a precision of 98.48%. The slight decrease in performance compared to the CBIS-DDSM dataset can reasonably be attributed to the smaller size of the MIAS dataset and its inherent limitations in variability. Despite this challenge, the model’s consistent performance highlights its robustness and strong generalization capability across different data sources. The high F1 scores on both datasets (99.15% for CBIS-DDSM and 98.42% for MIAS) confirm a harmonious balance between precision and recall, solidifying the model’s utility for computer-aided diagnosis.

Ablation studies
As indicated in Table 8, this section offers a comprehensive analysis of each component’s contribution to the overall performance of the Swin-DAMFN model across the CBIS-DDSM and MIAS datasets. Five important performance indicators are included in the evaluation: F1 score, accuracy, sensitivity, specificity, and precision. These metrics provide a comprehensive evaluation of the model’s ability to generalize.

Analysis of augmentation strategies
We conducted additional controlled experiments using the complete Swin-DAMFN architecture under three augmentation settings to further investigate the contribution of various augmentation strategies to the final performance and ensure that the high accuracy is not unduly dependent on synthetic samples. The model obtained 94.80% accuracy on CBIS-DDSM and 93.20% accuracy on MIAS when just simple geometric changes (horizontal/vertical flips, rotations of ± , and random scaling) were used. The findings were improved to 96.15% and 95.40%, respectively, by substituting photometric changes (brightness, contrast, gamma, and CLAHE) for geometric augmentations while turning off GAN synthesis. Ultimately, the stated performance of 99.30% on CBIS-DDSM and 98.75% on MIAS was obtained by integrating both GAN-generated realistic mammograms and photometric augmentation.
Each augmentation component makes a significant contribution to this incremental increase (basic, photometric, and GAN plus photometric), with GAN synthesis offering the most boost (about +3.2–3.4%), especially for minority malignant classes. Crucially, the model’s excellent generalization capacity is maintained even in the absence of synthetic data, indicating that the high performance results from actual discriminative pattern learning rather than memorizing of intentionally comparable examples. With an average FID of 16.8 against actual training photos, all synthetic images were created only from the training split, demonstrating exceptional visual fidelity and diversity.

Analysis of the Swin Transformer network
The Swin Transformer track serves as the foundational global context extractor on both datasets. On CBIS-DDSM, it achieved 93.52% accuracy with 92.80% sensitivity and 94.85% specificity, while on MIAS, it attained 92.15% accuracy with 91.40% sensitivity and 93.25% specificity. The consistent performance across datasets demonstrates its robust capability for capturing long-range dependencies in mammography images, though the slightly lower metrics on MIAS reflect the dataset’s smaller size and limited variability.

Analysis of the DAMFN track
The DAMFN track significantly outperformed the Swin Transformer baseline on both datasets. On CBIS-DDSM, it achieved 95.21% accuracy (vs 93.52% for Swin) with 94.65% sensitivity and 96.20% specificity. Similarly, on MIAS, it attained 93.85% accuracy (vs 92.15% for Swin) with 93.10% sensitivity and 95.05% specificity. This performance improvement highlights the DAMFN’s effectiveness in capturing discriminative local features and multi-scale patterns that are crucial for identifying subtle breast lesions.

Analysis of Swin hybrid with DAMFN without Triplet Attention
The integration of both tracks without Triplet Attention resulted in interesting performance patterns. On CBIS-DDSM, the model achieved 95.10% accuracy, slightly lower than the DAMFN track alone (95.21%). On MIAS, the performance was 93.70% accuracy, also lower than DAMFN’s 93.85%. This suggests that simple feature concatenation without proper attention mechanisms may cause feature interference, particularly affecting sensitivity (94.30% on CBIS-DDSM vs DAMFN’s 94.65%).

Analysis of the final Swin-DAMFN model
Incorporating Triplet Attention modules significantly enhanced the model’s performance on both datasets. The complete Swin-DAMFN model achieved exceptional results: 99.30% accuracy, 99.14% sensitivity, and 99.72% specificity on CBIS-DDSM, and 98.75% accuracy, 98.37% sensitivity, and 99.20% specificity on MIAS. When combined with the augmentation ablation results above, these findings confirm the synergistic contribution of (1) hybrid global–local architecture, (2) specialized attention modules, and (3) carefully controlled data augmentation, rather than any single factor dominating the performance or introducing bias.
Among several fusion mechanisms evaluated during model development (simple concatenation, Squeeze-and-Excitation, CBAM, and Triplet Attention), Triplet Attention consistently achieved the highest validation accuracy (+1.86% over CBAM) while introducing only 20K additional parameters, making it the optimal choice for efficient yet powerful global-local feature integration.

Robustness to noise
We conducted an ablation study by adding controlled noise to the test sets of CBIS-DDSM and MIAS to assess the model’s response to noise, which is ubiquitous in real-world mammography due to acquisition errors. Both salt-and-pepper noise with density and Gaussian noise with standard deviation were used. Table 9 reports the performance measures (accuracy, sensitivity) and compares them to a baseline (ResNet50) under the identical circumstances. With accuracy only declining from 99.30% to 97.85% at on CBIS-DDSM, the results demonstrate that Swin-DAMFN maintains good performance even at moderate noise levels, in contrast to ResNet50’s decline to 88.20%. This robustness stems from the preprocessing steps (adaptive median filtering for impulse noise and CLAHE for contrast enhancement) and the model’s attention mechanisms, which prioritize salient features, such as microcalcifications, while suppressing noise. For salt-and-pepper noise, the drop is even smaller (to 98.10% at ), due to the multi-scale fusion in DAMFN that captures contextual information across noisy pixels.

Computational efficiency analysis
We assessed the number of parameters (in millions), FLOPs (GigaFLOPs), inference time (milliseconds per image on an NVIDIA RTX 3090 GPU, averaged over 100 test images), and peak GPU memory utilization (in megabytes during inference) to determine the practical deployability of Swin-DAMFN. These measures are compared with Basel, including the entire model without Triplet Attention, Swin Transformer only, DAMFN only, and ResNet50 (a typical CNN used for breast cancer workloads). Table 10 displays the results. With 35.2M parameters and 6.8G FLOPs, the complete Swin-DAMFN model strikes a balance between efficiency and accuracy, only slightly more than ResNet50’s 23.5M parameters and 4.1G FLOPs. Usage is effective at 820MB, and inference time remains reasonable at 18 ms per image, which is suitable for clinical applications. The slight increase in complexity over baselines is justified by the substantial accuracy gains (e.g., +8.05% over ResNet50 on CBIS-DDSM), attributed to the lightweight Triplet Attention (adding  0.02M parameters) and efficient multi-scale operations in DAMFN. This confirms Swin-DAMFN’s viability for real-world deployment, even on resource-limited hardware.

Training and validation curves
To comprehensively assess training dynamics and computational efficiency, Figs. 11 and 12 illustrate the convergence behavior of the proposed Swin-DAMFN model alongside its computational profile. The training and validation curves exhibit stable and consistent convergence. The training loss decreases sharply during the initial 20 epochs (approximately from 1.0 to 0.2), reflecting the effective feature extraction capability of the dual hybrid branches. The validation loss closely mirrors the training trend and stabilizes around epoch 80 with a marginal gap (), indicating strong generalization and the absence of overfitting. Correspondingly, the accuracy curve demonstrates a rapid rise, attaining nearly 90% by epoch 30, driven by the Triplet Attention fusion mechanism that facilitates early discrimination of mammographic patterns.
Performance plateaus are observed near the final reported test accuracies (99.30% on CBIS-DDSM and 98.75% on MIAS), confirming that the adopted learning rate schedule and data augmentation strategies effectively balanced convergence speed and regularization. Minor oscillations in the validation accuracy can be attributed to inherent dataset variability, yet the overall trajectory remains smooth and monotonic. These results collectively confirm that the optimization process was well-calibrated, with the Adam optimizer demonstrating strong adaptability to the unique challenges of medical image classification.

ROC curve comparison
We examined the Receiver Operating Characteristic (ROC) curves on the MIAS and CBIS-DDSM datasets to further assess the diagnostic performance of the suggested Swin-DAMFN framework in contrast to baseline techniques. The Area Under the Curve (AUC) provides a scalar assessment of the model’s capacity to distinguish between benign and malignant instances, while the ROC curve shows the trade-off between sensitivity (true positive rate) and 1-specificity (false positive rate) across different classification thresholds. Our experiments demonstrate that Swin-DAMFN achieves an AUC of 0.993 on CBIS-DDSM and 0.987 on MIAS, surpassing standalone baselines such as the Swin Transformer (AUC 0.981 on CBIS-DDSM) and pure CNN models, including ResNet-50 (AUC 0.95 on CBIS-DDSM). The superior AUC of Swin-DAMFN underscores its enhanced capability to maintain high sensitivity at low false-positive rates, which is crucial for clinical applications where minimizing missed diagnoses is paramount. This comparison is depicted in Figs. 13 and 14.

Comparison with traditional transfer learning models and baseline models
Table 11 presents a detailed comparison between the proposed Swin-DAMFN model, its individual components, and widely used transfer learning models such as ResNet5067, InceptionV368, Xception65, DenseNet12169, and VGG1970. Also, we compare the proposed model with a recent lightweight Vision Transformer-based architecture, such as MobileViT71 and DaViT72.
For strict fairness and reproducibility, all traditional transfer-learning baselines in Table 11 (ResNet50, InceptionV3, Xception, DenseNet121, VGG19) were trained from scratch using identical data splits, preprocessing, and the complete augmentation strategy (GAN-based synthesis + photometric transformations) as the proposed Swin-DAMFN model.
The results clearly highlight the progressive improvements achieved by our architecture and the superiority of the integrated hybrid design. Traditional CNN-based models achieved moderate performance, with accuracies ranging between 89.45% and 92.10% on the CBIS-DDSM dataset and between 87.95% and 90.45% on the MIAS dataset. These results are consistent with their limited capacity to capture subtle lesion characteristics and long-range dependencies in mammographic images. While effective in natural image tasks, their transfer to medical imaging remains constrained by the absence of specialized architectural adaptations.
In contrast, the Swin Transformer alone outperformed all traditional baselines, achieving 93.52% accuracy on CBIS-DDSM and 92.15% on MIAS. This confirms the inherent advantage of transformer-based models in capturing global contextual information through self-attention mechanisms. The DAMFN branch further improved the results, reaching 95.21% on CBIS-DDSM and 93.85% on MIAS. These gains can be attributed to the proposed multi-scale kernels and attention mechanisms, which enhance the extraction of fine-grained features such as microcalcifications and subtle tissue distortions. When the Swin Transformer and DAMFN were combined without Triplet Attention, performance showed only marginal improvements, suggesting that simple integration was insufficient. However, the inclusion of the Triplet Attention module in the full Swin-DAMFN model unlocked the synergy between the two branches, boosting performance to 99.30% on CBIS-DDSM and 98.75% on MIAS. These values represent improvements of approximately 4–5% over the best standalone component and 7–8% over traditional transfer learning models. Importantly, the model consistently achieved high sensitivity and specificity across both datasets, ensuring accurate detection of malignant cases while minimizing false positives.
We compare the proposed model with MobileViT and DaViT, two cutting-edge efficient ViT variations that have been extensively investigated in medical imaging applications, in order to further place Swin-DAMFN versus contemporary lightweight Vision Transformer-based architectures. MobileViT obtains 96.40% accuracy and DaViT 97.20% accuracy on the CBIS-DDSM dataset. The numbers are 95.10% and 96.00% on MIAS, respectively. The suggested Swin-DAMFN hybrid greatly surpasses both of these pure ViT models, achieving 99.30% accuracy on CBIS-DDSM and 98.75% accuracy on MIAS, despite their comparable performance. The effectiveness of the suggested synergistic design for mammography classification is demonstrated by this superior performance, which emphasizes the benefit of combining the hierarchical global context of Swin Transformer with the fine-grained multi-scale local feature extraction of the custom DAMFN branch.
The robustness of the Swin-DAMFN across two diverse datasets also demonstrates strong generalization ability, which is crucial for real-world clinical deployment. Overall, this comparison highlights that hybrid models, which leverage both CNN-based local feature extraction and transformer-based global context modeling, enhanced with attention-driven fusion, can achieve substantial improvements over conventional transfer learning methods in mammographic breast cancer classification.

Statistical validation of results
To ensure the statistical robustness of the reported results, all experiments were repeated five times with distinct random seeds, and the mean values and 95% confidence intervals (CI) of each performance metric were computed. Additionally, statistical significance between the proposed Swin-DAMFN and the most competitive baselines (e.g., DaViT, MobileViT) was evaluated using the Wilcoxon signed-rank test. Table 12 summarizes the averaged results, along with corresponding confidence intervals, for both the CBIS-DDSM and MIAS datasets. The results demonstrate that Swin-DAMFN consistently achieves higher mean performance with narrow confidence intervals, confirming the model’s stability. Moreover, all improvements over the best-performing baselines were found to be statistically significant (), underscoring the reliability of the observed performance gains.
The inclusion of statistical analysis further validates the reliability and robustness of the proposed Swin-DAMFN framework. As shown in Table 12, the reported confidence intervals across multiple runs are notably narrow, indicating that the model maintains stable performance across random initializations and data splits. Furthermore, the Wilcoxon signed-rank test results () demonstrate that the improvements achieved by Swin-DAMFN over contemporary hybrid architectures such as DaViT and MobileViT are statistically significant rather than coincidental. This provides compelling quantitative evidence that the architectural synergy between the Swin Transformer branch and the DAMFN branch consistently enhances discriminative representation learning. From a clinical perspective, such statistical robustness is essential for translating deep learning models into real-world diagnostic workflows. By establishing significance in both accuracy and sensitivity, the Swin-DAMFN model ensures dependable detection of malignant lesions, minimizing the risk of false negatives–a critical factor in mammographic screening. Overall, this additional analysis reinforces the scientific credibility of the proposed framework and its potential for practical deployment in computer-aided diagnosis of breast cancer.

A comparative analysis of state-of-the-art techniques
Table 13 presents a comprehensive comparison between our proposed Swin-DAMFN model and recent state-of-the-art methods in breast cancer classification. The performance evaluation demonstrates the robustness and efficiency of our approach across multiple datasets and classification tasks.
The comparative analysis reveals several key advantages of our proposed Swin-DAMFN model. For the CBIS-DDSM dataset, our model achieves an accuracy of 99.30% for four-class classification (Benign Mass, Benign Calcification, Malignant Mass, Malignant Calcification). On the MIAS dataset, it attains 98.75% accuracy for three-class classification (Normal, Benign, Malignant). These results represent an improvement of approximately 1%–2% over most existing state-of-the-art methods. The consistent performance across different datasets and classification tasks (binary, 3-class, and 4-class) demonstrates the robustness and generalization capability of the proposed Swin-DAMFN model.
The superiority of our approach can be attributed to several factors. First, the hybrid architecture effectively combines the global contextual understanding of the Swin Transformer with the local feature extraction capabilities of the DAMFN track. Second, the integration of Triplet Attention mechanisms enables refined feature representation and cross-dimensional dependency modeling. Third, despite its sophisticated architecture, our model maintains computational efficiency with fewer parameters compared to many ensemble and hybrid approaches, resulting in faster convergence and reduced overfitting tendencies. To further differentiate Swin-DAMFN from recent hybrid CNN-Transformer models cited by the reviewer, we note key architectural distinctions. Unlike HybMNet, which relies on general self-supervised pretraining and simpler fusion, or MAX-ViT frameworks Ahmed et al.32 that employ block-agnostic attention for multi-scale processing on institutional data (95.60% accuracy), Swin-DAMFN incorporates mammography-specific MSA and TSCA blocks with multi-receptive-field separable convolutions and tri-shuffle operations to explicitly target microcalcifications and subtle distortions. Compared to ViT++ + CNN hybrids Boudouh et al.83, which achieve 99.22% on CBIS-DDSM using parallel branches with VGG16-style CNNs, our lightweight Triplet Attention fusion introduces minimal parameters while enabling richer cross-dimensional interactions, contributing to the observed 0.08, several point gain on identical benchmarks. These targeted design choices, combined with controlled augmentation, explain the consistent superiority over general hybrid paradigms rather than mere component assembly.

Conclusion and future work

Conclusion and future work
In this study, we proposed Swin-DAMFN, a novel dual-branch hybrid framework for robust breast cancer classification from mammographic images. The model effectively combines the global dependency modeling power of the Swin Transformer with the fine-grained local feature extraction of the Dual-Attention Multi-scale Fusion Network (DAMFN). Specifically, the Swin branch leverages shifted-window self-attention to capture hierarchical contextual dependencies, while the DAMFN branch employs Multi-Separable Attention (MSA) and Tri-Shuffle Convolution Attention (TSCA) modules to learn discriminative local representations of masses and microcalcifications. Enhanced with a lightweight Triplet Attention (TA) mechanism and an attention-guided fusion strategy, Swin-DAMFN demonstrates powerful multi-scale representation and improved interpretability. In addition, an integrated augmentation strategy combining GAN-based synthetic generation and photometric transformations was employed to mitigate data scarcity and class imbalance.
Experimental evaluations confirmed that Swin-DAMFN achieves state-of-the-art results, attaining 99.30% accuracy on the CBIS-DDSM dataset and 98.75% accuracy on the MIAS dataset, outperforming existing CNN- and Transformer-based methods in terms of accuracy, sensitivity, and F1-score. The ablation analysis further validated the individual contributions of the Swin Transformer, DAMFN, and Triplet Attention modules, confirming that their joint interaction is key to the observed performance gain.

Limitations and future work
Despite the promising outcomes, several limitations should be acknowledged. The proposed framework was primarily evaluated using two public datasets (CBIS-DDSM and MIAS), which, although widely used as benchmarks, were acquired under controlled research settings and may not fully represent the broad spectrum of imaging artifacts, scanner variations, breast densities, and population diversity encountered in routine clinical practice. Additionally, while GAN-based augmentation effectively mitigates class imbalance and improves minority-class recall, synthetic mammograms–regardless of their visual realism- cannot perfectly replicate all the subtle variations present in real clinical acquisitions. A key limitation is the absence of external validation on independent, unseen datasets. All reported results rely on internal train/validation/test splits from the same two sources, which raises legitimate questions about true generalization across different institutions, vendors, and patient demographics. Furthermore, the study did not perform patient-level longitudinal outcomes or BI-RADS category stratification, limiting direct comparison with clinical radiology workflows. Another constraint is the lack of explicit robustness testing against domain shifts such as different mammographic vendors (e.g., Hologic vs. GE vs. Siemens), compression levels, or screening vs. diagnostic protocols. Although patient-wise splitting and strict augmentation discipline were enforced to prevent data leakage, performance on significantly out-of-distribution data remains unproven.
Future research will focus on addressing these limitations through the following directions. First and foremost, we plan large-scale multi-institutional validation using external datasets such as INbreast, CMMD, VinDr-Mammo, and private clinical cohorts to rigorously assess cross-center generalization. Incorporating domain-adaptation and test-time normalization techniques will help reduce dataset-specific biases. Extending the framework to full multimodal analysis (mammography, ultrasound, and MRI) and integrating tabular clinical metadata (age, breast density, family history, and genetic risk scores) is expected to further enhance diagnostic accuracy and clinical relevance.
Moreover, exploring semi-supervised and federated learning paradigms will enhance scalability and preserve patient privacy when training on decentralized hospital data. Advanced explainable AI methods, beyond basic Grad-CAM–such as attention rollout, integrated gradients, and concept-based explanations, together with uncertainty quantification will increase interpretability and clinical trust. Finally, optimizing inference speed, reducing memory footprint, and extending Swin-DAMFN to downstream tasks, including lesion localization, segmentation, and automated BI-RADS scoring, represent critical steps toward real-world deployment in computer-aided diagnosis systems.

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

🟢 PMC 전문 열기