본문으로 건너뛰기
← 뒤로

A multi-scale hybrid ResNet-transformer with distance-aware learning for interpretable BI-RADS mammographic classification.

1/5 보강
Scientific reports 📖 저널 OA 96.7% 2021: 24/24 OA 2022: 32/32 OA 2023: 45/45 OA 2024: 140/140 OA 2025: 938/938 OA 2026: 702/767 OA 2021~2026 2026 Vol.16(1)
Retraction 확인
출처

Singh M, Mohan A, Tripathi U, Pathak S, Gupta R, Kumar B

📝 환자 설명용 한 줄

Timely and accurate classification of breast lesions is needed on mammograms, as it will enhance clinical decisions and reduce unnecessary biopsies.

이 논문을 인용하기

↓ .bib ↓ .ris
APA Singh M, Mohan A, et al. (2026). A multi-scale hybrid ResNet-transformer with distance-aware learning for interpretable BI-RADS mammographic classification.. Scientific reports, 16(1). https://doi.org/10.1038/s41598-026-40906-8
MLA Singh M, et al.. "A multi-scale hybrid ResNet-transformer with distance-aware learning for interpretable BI-RADS mammographic classification.." Scientific reports, vol. 16, no. 1, 2026.
PMID 41721067 ↗

Abstract

Timely and accurate classification of breast lesions is needed on mammograms, as it will enhance clinical decisions and reduce unnecessary biopsies. The study proposes a Multi-Scale Hybrid ResNet–Transformer with Distance-Aware Learning for interpretable BI-RADS mammographic classification. The model integrates the spatial representation strength of ResNet-50 with the contextual modeling capability of lightweight multi-head self-attention layers, forming a unified hybrid architecture. Distance-Aware Learning loss is introduced to account for the ordinal nature of BI-RADS categories, penalizing predictions based on their proximity to the true class. The stage of preprocessing includes CLAHE to enhance mammographic contrast, followed by balanced oversampling and controlled augmentations to address data imbalance. Further, the model is trained and evaluated, which indicates strong generalization across validation and test sets. The hybrid model achieved a test accuracy of 0.921, with a mean AUC of 0.987 on the test set. The model performs a per-class discriminability, with F1-scores above 0.92 for clinically critical BI-RADS 4–5 categories. Moreover, the feature-space visualization and Grad-CAM based visual explanations confirm that the model focuses on clinically relevant lesion regions, providing interpretable outputs aligned with radiologist’s reasoning. The proposed framework will provide a clinically meaningful and efficient approach to automated BI-RADS classification, and may support future computer-aided diagnostic workflows.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

같은 제1저자의 인용 많은 논문 (5)

📖 전문 본문 읽기 PMC JATS · ~64 KB · 영문

Introduction

Introduction
Breast cancer is the most prevalent malignancy among women globally and remains a primary cause of cancer-related death1–3. According to the report from GLOBOCAN 2020, roughly 2.3 million new breast cancer diagnoses and over 685,000 global fatalities4–6. It has been forecasted that the yearly incidence of new cases will exceed 3.2 million by 2040, mostly due to population expansion and aging demographics. The rising prevalence underscores the critical necessity to enhance early detection and precise diagnosis7–10. Despite significant improvements in breast cancer survival rates in high-income nations attributable to systematic screening programs, advancements in imaging, and access to multimodal therapy, the situation remains much less favorable in low- and middle-income countries (LMICs)11. In these circumstances, late-stage diagnosis is prevalent due to inadequate screening infrastructure, socioeconomic obstacles, and insufficient awareness among women. India reports about 221,000 new cases yearly, with one woman diagnosed with breast cancer every four minutes. Survival outcomes are much inferior relative to Western nations, mostly due to the fact that numerous women present with advanced illness. These differences underscore the essential need for dependable, accessible, and comprehensible methods for early detection in various health system contexts.
Mammography remains the gold standard for breast cancer screening and is the most widely used imaging modality for population-level early detection12–15. Radiologists interpret mammographic images by identifying suspicious features such as microcalcifications, masses, and architectural distortions. To standardize communication and reporting, the American College of Radiology developed the Breast Imaging Reporting and Data System (BI-RADS)16, which classifies findings into ordered categories ranging from 0 to 6. Of particular clinical importance are categories BI-RADS 3 (probably benign, requiring follow-up), BI-RADS 4 (suspicious, biopsy recommended), and BI-RADS 5 (highly suggestive of malignancy). Correct classification into these categories guides patient management and significantly influences outcomes. However, BI-RADS classification is inherently challenging. Inter-observer variability among radiologists is well documented, with differences in experience, training, and perceptual sensitivity leading to inconsistent interpretations. Dense breast tissue, present in up to 40–50% of women17, further complicates detection by obscuring lesions or mimicking suspicious features. Consequently, misclassification can result in false positives, leading to unnecessary biopsies and patient anxiety, or false negatives, delaying diagnosis and treatment. These limitations underscore the need for reliable computer-aided diagnosis (CAD) systems to complement radiologists’ expertise.
In recent years, artificial intelligence (AI), particularly deep learning, has shown remarkable potential in medical image analysis. Convolutional neural networks (CNNs) have been successfully applied to tasks such as tumor detection, lesion segmentation, and cancer classification across a wide range of imaging modalities18–21. In mammography, CNN-based CAD systems have achieved performance comparable to expert radiologists in certain contexts, sparking optimism for their integration into clinical workflows. Recent developments, such as multitask CNNs25, self-supervised and transformer-based cascades22, and edge-aware lesion co-training frameworks23, have further advanced lesion characterization and diagnostic reproducibility. Despite these advances, several challenges continue to exist in terms of limited BI-RADS class classification, data imbalance and lack of interpretability among the BI-RADS classes. Several published works focus on differentiating between benign and malignant cases, emphasizing binary accuracy rather than comprehensive BI-RADS categorization22–26, which does not reflect the ordinal or multi-class nature of real clinical decision pathways. Even transformer-based models, such as PatchCascade-ViT and ViT-CAD, demonstrate strong performance but are typically restricted to dichotomous or three-class tasks rather than full BI-RADS (1–5) grading. Moreover, the available public mammography datasets often contain an uneven distribution of categories, with an overrepresentation of normal and benign cases relative to suspicious or malignant findings. This imbalance biases the model learning and reduces generalization to rare but clinically critical BI-RADS 4 and 5 categories. Although multitask CNNs have attempted to mitigate this through auxiliary learning objectives, few frameworks explicitly address the ordinal imbalance between consecutive BI-RADS classes. Furthermore, for clinical adoption, models must not only be accurate but also transparent and explainable. There are several existing approaches which lack interpretable probability calibration or visual explanations, limiting their trustworthiness in clinical settings27. While Gradient-weighted Class Activation Mapping (Grad-CAM) or attention maps have been explored, many lack correlation with diagnostic confidence or fail to quantify reliability through calibration or decision-curve analysis. These limitations highlight the need for the development of hybrid architectures that combine convolutional feature extraction with transformer-based contextual reasoning while maintaining ordinal consistency and interpretability. Unlike prior methods that emphasize accuracy alone, the present work incorporates Distance-Aware Learning (DAL) to penalize ordinal misclassifications and improve diagnostic transparency. In this work, we propose a Multi-Scale Hybrid ResNet–Transformer with DAL loss for interpretable BI-RADS classification in mammography (see Fig. 1).

The contributions of this study are summarized as follows:

A unified Hybrid ResNet–Transformer framework for mammographic BI-RADS classification that jointly learns local structural features and long-range contextual relationships from mammograms.

A Distance-Aware Learning loss formulation used to penalize the larger inter-class prediction errors, enforce ordinal consistency among BI-RADS categories, and minimize clinically significant misclassifications.

A comprehensive evaluation across all five BI-RADS categories, including quantitative performance analysis and interpretability assessment using Grad-CAM visualizations, probability calibration, decision-curve analysis, and t-SNE feature projections to ensure transparency and clinical relevance.

Methodology

Methodology
This section provides an overview of the proposed framework by describing the dataset used in this study and the preprocessing steps applied to the mammograms, followed by the proposed Hybrid ResNet–Transformer model. Subsequently, we detail the training mechanism and Distance-Aware Learning loss formulation used for quantitative assessment. The various steps of the methodological section are discussed as follows:

Dataset
This study utilized the publicly available Breast Imaging Reporting and Data System (BI-RADS) mammography dataset obtained from Kaggle [https://www.kaggle.com/datasets/orvile/inbreast-dataset-bi-rads-classification/data]. The dataset contains mammographic images annotated by expert radiologists according to the BI-RADS lexicon. BI-RADS is an internationally recognized reporting standard that categorizes breast findings into ordered classes reflecting increasing levels of malignancy risk:

BI-RADS 1: Negative (normal, no abnormalities).

BI-RADS 2: Benign findings.

BI-RADS 3: Probably benign (follow-up suggested).

BI-RADS 4: Suspicious abnormality (biopsy recommended).

BI-RADS 5: Highly suggestive of malignancy (biopsy strongly recommended).

Figure 2 represents the sample of mammograms from the dataset with their different classes. The dataset includes a total of five BI-RADS categories with BI-RADS 2 comprising 220 images, followed by BI-RADS 1 (67 images), BI-RADS 5 (57 images), BI-RADS 4 (44 images), and BI-RADS 3 (23 images). For consistency in ordinal classification, BI-RADS 4a/4b/4c were merged into a single BI-RADS 4 category, and BI-RADS 6 annotations were mapped to BI-RADS 5, following standard practice in prior BI-RADS classification studies. The dataset was selected for its clinical relevance and availability of ordinal labels, which are essential for training and validating a deep learning model for multi-class ordinal classification. However, the dataset exhibited significant class imbalance, with a disproportionate number of BI-RADS 2 (benign) and BI-RADS 1 (normal) cases compared to BI-RADS 3–5. Such an imbalance poses a risk of biasing models toward the majority classes and underrepresenting clinically critical malignant cases. Addressing this challenge was an integral part of the preprocessing pipeline.

Data preprocessing
Preprocessing is crucial in medical imaging, as mammograms vary significantly in size, contrast, and quality. Figure 3 highlights the following steps that were applied sequentially to ensure balance and diagnostically enhanced inputs for model training.

Image resizing
All mammogram images were resized to a uniform resolution of 224 × 224 pixels, consistent with the input requirements of ResNet-50 and other ImageNet-pretrained CNN architectures. Standardizing resolution improves computational efficiency and prevents bias due to varying image dimensions.

Contrast enhancement (CLAHE)
To improve lesion visibility, we applied Contrast Limited Adaptive Histogram Equalization (CLAHE) on the grayscale version of each image. This method enhances local contrast without excessively amplifying noise, making subtle abnormalities such as microcalcifications more detectable. The CLAHE-equalized grayscale image was then replicated across three channels to form a pseudo-RGB input compatible with the pretrained backbone. CLAHE is particularly effective for mammography, where grayscale differences between suspicious tissue and normal parenchyma can be minimal.

Normalization
Each pseudo-RGB image was converted from RGB to BGR and normalized using channel-wise mean subtraction with the constants [103.939, 116.779, 123.68], following the ‘caffe’ preprocessing convention used by ResNet-50. This ensures consistency with pretrained ImageNet statistics and improves convergence stability.

Data augmentation (training-time operation)
To increase dataset diversity and reduce overfitting, we performed lightweight on-the-fly data augmentation using TensorFlow’s augmentation layers, including:

Horizontal flips: To simulate left–right breast symmetry.

Small rotations (± 10°): Mimic variations in patient positioning.

Class Balancing using oversampling
Due to the high imbalance in BI-RADS categories, minority classes were randomly oversampled to achieve approximately equal class frequencies. Oversampling was achieved by replicating and shuffling samples rather than simple duplication, to ensure diversity within expanded subsets. This balancing reduced the bias of the model towards normal and benign classes and improved performance on rare but clinically important malignant cases.

Proposed framework - ResNet transformer hybrid
Mammographic BI-RADS classification requires a model, that can learn both local structural patterns and contextual relationships across the breast. The proposed framework addresses by combining convolutional feature extraction with attention-based mechanisms into a unified architecture suitable for ordinal BI-RADS prediction. Figure 4 (a-b) provides a block-level illustration of the complete workflow, showing how the model processes input images through four major stages: (i) ResNet-based spatial feature extraction, (ii) token projection and positional embedding, (iii) Transformer-based contextual modeling, and (iv) dense classification with ordinal regularization. The overall steps involved in the process of BI-RADS classification are summarized as follows:

Backbone network (ResNet-50 feature extractor)
The model utilizes a ResNet-50 backbone, which is pretrained on ImageNet for feature extraction. As seen in Fig. 4 (a), the input processes through the standard ResNet stem (7 × 7 convolution + batch normalization + ReLU, followed by 3 × 3 max-pooling) and the four residual convolutional blocks. The backbone is initially frozen to preserve pretrained low-level filters and later partially unfrozen during fine-tuning. This was done to support domain adaptation for efficient transfer learning while minimizing overfitting on the limited mammography dataset by improving lesion-related structures.

Transformer-based feature fusion
Following convolutional feature extraction, the output from ResNet-50 (2048 channels) is passed through a 1 × 1 convolutional projection layer to reduce dimensionality to a 128-channel embedding as depicted in Fig. 4 (a). The resulting tensor is then reshaped into a sequence of spatial tokens, each representing a localized mammographic region. Since, the flattening disrupts spatial ordering, a learnable positional embedding (shown adjacent to the token block in Fig. 4 (a)) is added to encode spatial location information. This embedding ensures that the attention mechanism maintains awareness of lesion position and symmetry across the breast region, which is essential for clinical interpretability.
The token sequence is then processed by a Transformer encoder (as shown in Fig. 4 (a)) consisting of (i) Multi-Head Self-Attention (MHSA), (ii) Residual Add + Layer Normalization, and (iii) Feed-Forward MLP using Gaussian Error Linear Unit (GELU) activation and dropout. These layer captures long-range dependencies across mammographic regions through multi-head self-attention (MHSA) layers. These layers enable the network to learns inter-region information across distant regions important for mammography, where suspicious findings may exhibit distributed patterns of asymmetry. Each MHSA block is followed by residual layer normalization and a lightweight feed-forward MLP sub-block employing GELU activation with dropout regularization to improve stability and prevent overfitting. The output token embeddings are aggregated using Global Average Pooling (see Fig. 4 (a)) to produce a single context-aware representation that captures both local and global diagnostic information.

Distance-aware regularization
The Distance-Aware Learning method is integrated into the model’s loss formulation for ordinal progression between BI-RADS categories. Instead of considering categories as independent, the model penalizes larger prediction errors more severely when the predicted class is far from the ground truth. To encode this ordinal structure, we construct a simple distance matrix
where,  and  denote the true and predicted BI-RADS classes, respectively.
Given the model’s SoftMax output , the expected ordinal deviation is computed as
which represents how far, on average, the prediction lies from the correct class.
The DAL loss scales the standard categorical cross-entropy loss by this deviation term:
where, is a tunable weighting factor controlling the penalty strength relative to the standard categorical cross-entropy loss. In this work, was selected empirically based on validation performance to balance ordinal regularization and stable optimization.
Under this formulation, small ordinal errors (e.g., BI-RADS 2→3) produce mild penalties, whereas large clinically significant deviations (e.g., 1→5) are penalized more strongly according to an absolute ordinal distance matrix defined between BI-RADS categories. This mechanism encourages ordinal consistency and reduces clinically significant misclassifications, leading to smoother decision boundaries and improving clinical reliability in intermediate BI-RADS scores.

Classification head
In Fig. 4 (b), the pooled Transformer representation is forwarded to a dense classification head composed of two fully connected layers followed by a SoftMax output. The pooled Transformer output sequence is passed through a dense classification head that refines high-level features before the final prediction. The dense head comprises two fully connected layers of 512 and 128 neurons, each using ReLU activation function and dropout of 0.5 and 0.3, respectively. The final output layer uses a SoftMax activation across five output units corresponding to BI-RADS categories (1–5). This multi-stage classification head integrates the contextual representation produced by the Transformer and outputs a probability distribution that is further shaped by the DAL loss to yield ordinally consistent predictions.

Optimization and training
The two-stage optimization pipeline was implemented to ensure stable convergence, while preserving pretrained visual information from the ResNet backbone.

In Stage 1 (Feature Adaptation Phase), the ResNet backbone remained frozen and acted as a stable low-level feature extractor. Only the Transformer encoder and dense classification head were trained, enabling the model to learn high-level semantic representations specific to mammography.

In Stage 2 (Fine-Tuning Phase), the final 40 convolutional layers of ResNet-50 were unfrozen to allow deeper feature adaptation.

The training continued with a reduced learning rate ( using the Adam optimizer, early stopping, and learning-rate decay to ensure stable convergence (Fig. 5). This progressive fine-tuning approach effectively combines frozen feature reuse with domain-specific optimization, leading to superior generalization on unseen cases.

Experimental results

Experimental results
All experiments were conducted in a Google Collaboratory Pro environment with GPU acceleration, typically providing an NVIDIA Tesla T4 GPU (16 GB VRAM). The implementation was carried out in Python using both PyTorch and TensorFlow libraries with CUDA support. The dataset was partitioned into 70% training, 15% validation, and 15% test sets, preserving the relative distribution of BI-RADS categories and ensuring that minority classes were represented across all subsets. The model training employed a batch size of 8 with the Adam optimizer, and the initial learning rate was set to 1e-4 during the first training stage and reduced to 1e-5 for fine-tuning of the last ~ 40 layers. Learning rate scheduling were handled with a ReduceLROnPlateau strategy. Training was capped at 60 epochs in the fine-tuning stage, with early stopping applied after five epochs of non-improvement, while the initial head training stage was limited to four epochs. Table 1 summarizes the different parameters used during the training process. Performance was evaluated using accuracy, precision, recall, F1-score, Area under Curve (AUC), and Matthews Correlation Coefficient (MCC), supported by confusion matrices, ROC curves, and calibration plots. t-SNE visualizations were used to examine feature space separability, while Grad-CAM heatmaps provided interpretability by highlighting the image regions most influential in model decisions.

Performance of the proposed model
The proposed Multi-Scale Hybrid ResNet–Transformer with DAL loss achieved a consistent performance across all experimental splits. The training and validation curves (see Fig. 6) show a steady increase in accuracy and a corresponding decrease in loss, confirming stable convergence and minimal overfitting. On the test dataset, the proposed model achieved an accuracy of 0.921, indicating overall correctness in classification. While, macro F1-score of 0.920 provides a balanced measure under class imbalance, and is often more informative than accuracy alone. Furthermore, to evaluate the discriminative ability independent of decision thresholds, indicating how well the model separates benign from suspicious or malignant categories, a mean (AUC) of 0.987 was obtained. Finally, the Matthew Correlation Coefficient (MCC) of 0.903, which is a strong indicator of overall multi-class reliability under class imbalance, demonstrates robust performance and outperforms the baseline architectures. The per-class evaluation, indicated excellent sensitivity and precision for clinically significant categories such as BI-RADS 4 with an F1-score of 0.923 and BI-RADS 5 with an F1-score of 0.969. These results show that the proposed model’s ability to identify malignant lesions accurately while maintaining robustness across normal and benign categories. Table 2 summarizes the class-wise precision, recall, and F1-scores for the training, validation, and test sets. The model exhibits strong consistency across all categories, with minor deviations observed in BI-RADS 2 and BI-RADS 4, likely due to clinical ambiguity between benign and suspicious lesions.

Furthermore, the confusion matrices confirm that most misclassifications occur between adjacent BI-RADS categories, such as 2 ↔ 3 or 4 ↔ 5, which is clinically acceptable and expected due to their subtle inter-category differences. This behavior supports the contribution of DAL loss, which enforces ordinal consistency and reduces large misclassification penalties. Figure 7 (a) and 7 (b) show the confusion matrix (Raw Counts) and Normalized Confusion Matrix, respectively.

Reliability and discriminative ability
In addition to the performance of the proposed model, the study highlights both high discriminative performance and strong reliability across all data splits. As summarized in Table 3, the model achieved training accuracy of 0.995, validation accuracy of 0.922, and test accuracy of 0.921, with corresponding macro F1-scores of 0.995, 0.921, and 0.920, respectively. The Matthew Correlation Coefficient (MCC) remained high across all splits (≥ 0.90), confirming stable and balanced predictions across BI-RADS categories. Furthermore, the mean AUC values exceeded 0.98 for all sets (train = 1.000, validation = 0.991, test = 0.987), indicating excellent separability between diagnostic categories.
The ROC curves shown in Fig. 8 (a) show near-perfect area under the curve for all BI-RADS classes, with BI-RADS 3 and BI-RADS 5 achieving AUC = 1.000 and 0.996, respectively, reflecting highly confident discrimination of clinically significant categories. Further, evaluated the calibration of the proposed model to assess whether its predicted probabilities aligned with true outcome frequencies. Calibration is critical in clinical applications, as BI-RADS categories are interpreted probabilistically and directly inform follow-up and biopsy decisions. The diagonal dashed line represents perfect calibration, where predicted probabilities correspond exactly to observed frequencies. To assess calibration and predictive reliability, Fig. 8 (b) illustrates the calibration curves for each BI-RADS class. The proposed Hybrid ResNet–Transformer model demonstrated good calibration across most categories, particularly BI-RADS 3, 4, and 5, which are clinically significant for distinguishing probably benign, suspicious, and malignant findings. Although BI-RADS 2 exhibited slight under-confidence in lower probability ranges, the overall trends confirm that the model’s probability estimates are well-calibrated. This shows that the proposed framework not only classifies mammograms effectively but also produces interpretable, risk-aware predictions, enhancing its potential as a clinical decision-support tool.

Feature space visualization
To interpret the learned feature representations, a t-SNE visualization of the penultimate layer embeddings is presented in Fig. 9. Distinct and well-defined clusters are observed for each BI-RADS category, indicating that the proposed hybrid architecture learns separable and discriminative feature spaces, particularly between malignant (BI-RADS 4–5) and benign categories. These clustering patterns highlight the model’s capability to differentiate malignancy-related classes while capturing the diagnostic ambiguity present in borderline cases.

Interpretability with Grad-CAM
To enhance the interpretability of the proposed hybrid model, Grad-CAM was employed to visualize the regions contributing most to the BI-RADS classification decision. Figure 10 illustrates representative examples from BI-RADS 2, 3, and 5 categories, showing the original and CLAHE-enhanced images alongside the corresponding Grad-CAM heatmaps, class probability distributions, and activation intensity histograms. The visualizations reveal that the model accurately focuses on diagnostically relevant regions such as localized masses, dense glandular tissue, and microcalcifications, while suppressing background structures.
For BI-RADS 2 (benign), attention is distributed over uniform parenchymal patterns with limited high-intensity activations, indicating low suspicion. In contrast, BI-RADS 3 (probably benign) demonstrates moderate localized activations corresponding to small, well-defined masses, suggesting the model’s sensitivity to early morphological irregularities. The BI-RADS 5 (malignant) case exhibits a strong, sharply concentrated activation centered on the lesion core, reflecting the network’s ability to capture pathologically significant features. The accompanying class probability and intensity distribution plots further confirm the confidence and focus consistency across predictions. These findings underscore the clinical transparency of the proposed Distance-Aware ResNet–Transformer model, validating that its learned attention aligns with radiologically meaningful patterns.

Ablation study and model behavior
The ablation results summarized in Table 4 show the effect of each architectural modification. The baseline ResNet-50 + dense head achieved 0.879 accuracy and 0.988 AUC. Adding a Transformer layer with two attention heads improved feature interaction, raising accuracy to 0.886. Furthermore, the deeper or wider attention (two layers, four heads) did not improve performance, suggesting overfitting. The best performance with an accuracy of 0.921 and an AUC of 0.987 was obtained when the ResNet backbone was partially fine-tuned with a single Transformer layer. Reducing the dropout slightly lowered scores, confirming that regularization is essential for stable convergence. Overall, a lightweight hybrid configuration with limited attention depth and controlled fine-tuning provided the best balance between accuracy and generalization.

Discussion

Discussion
The proposed Multi-Scale Hybrid ResNet–Transformer with DAL loss demonstrates both strong quantitative performance and clinically meaningful interpretability for BI-RADS mammographic classification. Beyond achieving high accuracy and F1-scores, the model exhibits stable generalization, reliable probability calibration, and consistent robustness across multiple evaluation metrics, indicating its suitability for clinical decision-support systems.

Model interpretability and probability reliability
To examine the reliability of the model’s probabilistic outputs, Fig. 11 presents the per-class probability distributions for the true BI-RADS categories. The majority of samples exhibit high predicted probabilities (≥ 0.8) for their respective classes, indicating well-calibrated confidence levels. The distributions are compact for BI-RADS 1, 3, and 5, confirming the network’s high certainty in normal, probably benign, and malignant classifications. In contrast, slightly broader spreads observed in BI-RADS 2 and 4 reflect clinically expected ambiguity between benign and suspicious categories, where imaging features often overlap. This pattern suggests that the model not only performs accurately but also captures diagnostic uncertainty patterns reflective of real radiological interpretation.

Clinical utility and generalizability
To further evaluate the clinical utility of the proposed model, Decision Curve Analysis (DCA) was performed, as shown in Fig. 12 (a). The DCA compares the net benefit of the model against “treat-all” and “treat-none” strategies across varying probability thresholds. The hybrid ResNet–Transformer consistently yields higher net benefit within the clinically relevant threshold range (0.2–0.8), confirming that it offers meaningful diagnostic support while reducing unnecessary recalls or biopsies. This finding highlights the model’s practical viability for assisting radiologists in decision-making, particularly for cases in intermediate BI-RADS categories where biopsy recommendations are uncertain.
Furthermore, the Radar Chart presented in Fig. 12 (b) summarizes six performance indicators, such as accuracy, precision, recall, F1-score, AUC, and MCC, used for training, validation, and test sets. The nearly hexagonal shape and overlapping contours shows balanced generalization without severe overfitting. While the training metrics are slightly higher, the validation and test contours remain close, confirming that the model maintains consistency under unseen data distributions. The inclusion of the Matthew Correlation Coefficient (MCC) further validates robustness by incorporating true and false predictions across all categories, reinforcing the model’s stability and reliability in multi-class classification.

Comparative performance of the proposed model
The performance of the proposed Hybrid ResNet–Transformer with DAL loss was compared with several recent BI-RADS classification frameworks, as summarized in Table 5. It was observed that, compared with PatchCascade-ViT (85% accuracy, 84.9% F1), the proposed hybrid framework achieves a 7% absolute improvement in both accuracy and macro F1, confirming the advantage of combining convolutional and transformer representations with ordinal DAL loss.
Unlike ultrasound-based systems (AUC ≈ 0.92) or edge-enhanced BUS models (≤ 74%), the proposed approach provides mammography-specific interpretability through calibrated probability outputs, Grad-CAM based lesion localization, and decision-curve validation, thereby demonstrating superior clinical readiness and diagnostic reliability.

Overall, the proposed hybrid model effectively integrates spatial feature learning (ResNet), contextual reasoning (Transformer), and ordinal consistency (DAL loss) as both a quantitative way and interpretable decision boundaries. Its performance across BI-RADS categories and calibration reliability underlines its potential application as a clinically interpretable and trustworthy AI-based screening tool.

Limitations of the work
Despite these promising results, several limitations must be acknowledged. The dataset used in this study was relatively small, imbalanced, and drawn from a single public source, which may restrict generalizability to diverse clinical settings and imaging systems. In addition, most existing deep learning approaches for breast imaging focus on binary benign–malignant classification rather than full ordinal BI-RADS stratification22–26. As a result, direct comparisons with dedicated ordinal or multi-class BI-RADS frameworks remain limited, restricting broader benchmarking of the proposed method.
Breast density, an established factor influencing both cancer risk and mammography interpretability, was not explicitly modeled, due to the absence of consistent density annotations in the dataset. This may contribute to residual misclassification, particularly between adjacent BI-RADS categories such as BI-RADS 2, 3, and 4, where dense parenchymal tissue can obscure lesion boundaries. Furthermore, external validation on independent or multi-institutional datasets was not conducted, leaving open questions about how well the model would perform across populations and screening protocols. Although Grad-CAM provided image-level interpretability, the seamless clinical integration requires additional considerations such as inference speed, workflow compatibility, and interoperability with picture archiving and communication systems (PACS). Our future research will prioritize validation on larger and more diverse datasets, explicit modeling of breast density, and pilot testing within real-world radiology workflows.

Conclusion

Conclusion
In this study, a Multi-Scale Hybrid ResNet–Transformer model with Distance-Aware Learning was proposed for interpretable BI-RADS mammographic classification. By combining ResNet’s spatial feature extraction with Transformer-based contextual reasoning and ordinal DAL loss, the model achieved both high quantitative accuracy and clinically consistent interpretability. The results show robust generalization across training, validation, and test splits, achieving 0.921 accuracy, 0.920 macro F1-score, 0.987 mean AUC, and 0.903 MCC on the test set. Grad-CAM visualizations confirmed that the model attends to clinically relevant regions, while probability calibration and decision curve analyses established the reliability and clinical utility of its predictions. The framework’s strong alignment between prediction confidence and diagnostic uncertainty supports its potential as a trustworthy AI-assisted tool for radiologists. However, limitations include a limited dataset size, a lack of breast density modeling, and the absence of external validation across multiple institutions. Future work will address these gaps by expanding dataset diversity, integrating density-aware feature modeling, and conducting prospective clinical validation to assess real-world performance and workflow integration.

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

🟢 PMC 전문 열기