본문으로 건너뛰기
← 뒤로

Enhanced breast cancer detection framework based on YOLOv11n with multi-scale feature calibration.

1/5 보강
Scientific reports 📖 저널 OA 97.4% 2021: 24/24 OA 2022: 32/32 OA 2023: 45/45 OA 2024: 140/140 OA 2025: 938/938 OA 2026: 715/767 OA 2021~2026 2026 Vol.16(1)
Retraction 확인
출처

He Z, Zhang C, Liang C, Li W

📝 환자 설명용 한 줄

Breast cancer poses a persistent global health challenge, making early diagnosis indispensable for reducing mortality and improving patient prognosis.

이 논문을 인용하기

↓ .bib ↓ .ris
APA He Z, Zhang C, et al. (2026). Enhanced breast cancer detection framework based on YOLOv11n with multi-scale feature calibration.. Scientific reports, 16(1). https://doi.org/10.1038/s41598-026-39723-w
MLA He Z, et al.. "Enhanced breast cancer detection framework based on YOLOv11n with multi-scale feature calibration.." Scientific reports, vol. 16, no. 1, 2026.
PMID 41673415 ↗

Abstract

Breast cancer poses a persistent global health challenge, making early diagnosis indispensable for reducing mortality and improving patient prognosis. However, conventional detection paradigms are frequently impeded by the inherent complexity of lesions, characterized by minute dimensions, morphological heterogeneity, and indistinct boundaries. To address these impediments, this study proposes an advanced detection framework building upon YOLOv11n. We introduce three novel architectural components, specifically the C3k2-DCNv2-Dynamic, C2CGA, and CSFCN modules, designed to synergize feature extraction, fusion, and calibration. The C3k2-DCNv2-Dynamic module employs dynamic convolution and deformable mechanisms to robustly accommodate scale variations. Concurrently, the C2CGA module exploits a channel-guided attention mechanism within a multi-branch topology to heighten sensitivity toward complex lesion regions. Furthermore, the CSFCN module synthesizes contextual and spatial feature calibration to refine the identification of small targets. Extensive empirical evaluations validate the efficacy of the proposed method. The model achieved a precision of 66.6% and a mean Average Precision (mAP@0.5) of 86.2%, surpassing the baseline YOLOv11n by 5.5 and 2.4% points, respectively. Notably, detection accuracy for small G2-class lesions improved by a substantial 14.1% points. These findings substantiate the superior performance of our framework in resolving small and complex lesion detection, suggesting significant potential for clinical deployment.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

같은 제1저자의 인용 많은 논문 (5)

📖 전문 본문 읽기 PMC JATS · ~89 KB · 영문

Introduction

Introduction
Breast cancer is one of the most prevalent malignant tumors among women worldwide, with both its incidence and mortality rates steadily rising, posing a significant public health threat to women’s health. Epidemiological data show a consistent increase in breast cancer cases in recent years, with a noticeable trend toward younger ages at diagnosis1. The World Health Organization (WHO) predicts that by 2025, there will be approximately 2.47 million new breast cancer cases globally, with over 20% of these cases occurring in China, highlighting the severity of the disease’s prevention and control. Early detection, early diagnosis, and early treatment are widely considered the most effective strategies for reducing breast cancer mortality. Studies indicate that the five-year survival rate for Stage I breast cancer patients exceeds 99%2, while for Stage IV (distant metastasis), the survival rate sharply declines to 29%3, underscoring the critical need for high-quality screening methods.
Current clinical breast cancer screening and diagnostic imaging technologies include mammography, breast ultrasound, and magnetic resonance imaging (MRI). However, these traditional methods have notable limitations. Firstly, diagnosis heavily depends on the radiologist’s experience and subjective judgment, which can be influenced by factors such as fatigue and variability in expertise4. Secondly, mammography’s sensitivity significantly diminishes in dense breast tissue, often leading to poor contrast between lesions and glandular tissue, resulting in false-negative rates of approximately 10%–20%, particularly for small lesions or early-stage abnormalities5. More critically, in primary healthcare institutions, limited specialist resources and high workloads further exacerbate the risks of missed and misdiagnosed cases. Consequently, there is an urgent need to develop more intelligent, objective, and efficient auxiliary screening tools to improve early breast cancer diagnosis rates and reduce misdiagnosis rates6.
In this context, artificial intelligence (AI), particularly deep learning (DL), has emerged as a promising solution for intelligent breast cancer image screening and diagnosis due to its powerful capabilities in automated feature extraction and pattern recognition7. Convolutional neural network (CNN)-based object detection and segmentation models can automatically learn complex spatial and texture features from large-scale breast imaging data, reducing the reliance on manually designed features and subjective experience. These models hold promise for improving the detection rates of small lesions, microcalcifications, and hidden abnormalities8. Among them, the YOLO (You Only Look Once) series of algorithms have gained considerable attention in medical image object detection due to their end-to-end, single-stage framework that balances detection accuracy and real-time performance, making them a focal point of research for automatic lesion detection in mammograms and ultrasound images9.
To enhance the accuracy and efficiency of breast cancer diagnosis, researchers have explored various technological approaches, with AI-based diagnostic models leading the way. Recent studies have further demonstrated the efficacy of deep learning frameworks in medical image analysis, particularly in improving classification accuracy and robustness against noise10–13. For instance, Lin et al.14 developed a stepwise breast cancer diagnostic model architecture. Initially, artificial neural networks (ANN) and support vector machines (SVM) were used to classify breast cancer risk factors and identify relevant risks. This was followed by the use of a pre-trained deep learning model combined with transfer learning techniques to diagnose benign and malignant tumors in mammograms. However, due to limited computational resources, the optimization of the deep learning network’s performance was constrained, which may limit its practical clinical application. Prodan et al.15 proposed a deep learning-based solution for analyzing mammogram images and detecting breast cancer, utilizing various computer vision models based on CNN and ViT architectures, alongside StyleGAN-XL-based synthetic image generation for data augmentation to improve model performance. However, this study used only positive samples for synthetic image generation, and future work could explore the use of both positive and negative samples for more balanced data augmentation to enhance the model’s generalization ability. Raaj et al.16 proposed a breast cancer detection and diagnosis method based on a hybrid CNN architecture, combining Radon transforms, data augmentation, and hybrid CNNs. This method aimed to enhance breast cancer detection rates by transforming spatial pixels into time-frequency change images and performing data augmentation, followed by morphological segmentation of cancer cell regions. However, the primary limitation of this method lies in its morphological segmentation algorithms, which can only detect internal cancer cell pixels, neglecting external cancer cell pixels and reducing the accuracy of detection. Sha et al.17 proposed an automatic breast cancer detection method based on deep learning and optimization algorithms, combining CNN with grasshopper optimization algorithms (GOA) to reduce image noise, optimize segmentation, and extract and select features, ultimately using SVM for classification. While this method improves breast cancer detection accuracy and reduces computational costs, the complexity of the deep learning and optimization algorithms makes the process time-consuming, especially during the training phase, limiting its application in real-time or resource-constrained environments. Zakareya et al.18 proposed a deep learning-based model for breast cancer diagnosis using medical images. This model combined features of GoogLeNet and residual blocks, incorporating granular computing, learnable activation functions, attention mechanisms, and wide and deep network structures to improve diagnostic accuracy. However, the inclusion of granular computing increased preprocessing time before training, and the model’s sensitivity to image granularity could affect the extraction of certain patterns. Khalid et al.19 proposed a machine learning-based breast cancer detection method that used six different classification models for diagnosis. This study utilized standardization processing and feature selection techniques, using scikit-learn for feature selection and evaluating model performance using confusion matrices and various performance metrics. However, when processing large datasets, this approach may require substantial computational resources. Shen et al.20 proposed an end-to-end training method for breast cancer screening using full-breast image classification. This approach trained a local image block classifier using a fully annotated dataset and subsequently fine-tuned it using a large-scale dataset. However, the method relies on limited fully annotated datasets for initial training and requires significant computational resources for handling large datasets.
Despite advancements in existing breast lesion detection algorithms, challenges persist due to the small size, diverse morphology, and blurry boundaries of breast lesions, which result in insufficient detection accuracy. These factors continue to pose significant challenges to the accuracy and robustness of algorithms. To address these issues, this paper proposes an enhanced breast cancer detection framework based on YOLOv11n. The framework introduces several innovative modules to optimize the feature extraction, fusion, and calibration processes, significantly improving breast cancer imaging detection performance. The main contributions of this paper are as follows:

C3k2-DCNv2-Dynamic Module: This module employs dynamic convolution and deformable mechanisms to automatically adapt to various image features, optimizing feature extraction and effectively addressing scale variation and geometric distortion in the images.

C2CGA Module for Feature Fusion Optimization: This module integrates a channel-guided attention mechanism (CGA) and a multi-branch structure, enabling more accurate capture of important features within the image and enhancing the model’s sensitivity to complex lesion areas, particularly when detecting small lesions and lesions with significant morphological changes.

CSFCN Module for Small Lesion Detection: This module combines contextual feature calibration (CFC) and spatial feature calibration (SFC), optimizing features from both semantic and spatial dimensions, thereby improving the model’s sensitivity to early lesions and subtle structural changes in peripheral areas.

Releted work

Releted work

Bidirectional feature pyramid network
The Bidirectional Feature Pyramid Network (BiFPN) is an efficient multi-scale feature fusion module that significantly enhances object detection and semantic segmentation performance through bidirectional feature interaction and weighted fusion mechanisms21. Its core design integrates both top-down and bottom-up pathways: the former propagates high-resolution features downward to strengthen small-object detection, while the latter transmits information upward to refine large-object features. Moreover, BiFPN introduces learnable weights to dynamically adjust the contributions of different feature levels, thereby addressing the limitations of traditional approaches such as insufficient information flow in FPN, redundant structures in PANet, and the high search cost of NAS-FPN22,23. Structural optimizations—including the removal of redundant nodes and the addition of extra connections between inputs and outputs at the same level—further improve efficiency. For example, normalized weights are used to balance the fusion of features across different resolutions, enabling effective multi-scale representation while maintaining low computational cost.

Full-dimensional dynamic convolution
Omni-Dimensional Dynamic Convolution (ODConv), proposed by Yao Anbang’s team at Intel Labs, is a novel convolutional module designed to overcome the limitations of both conventional convolution and existing dynamic convolution methods24. In standard convolution, kernel parameters remain fixed after training and cannot adapt dynamically to different input features25. Although current dynamic convolution methods introduce a degree of adaptivity through multi-kernel weighting, their flexibility is typically restricted to a single dimension—namely, the number of kernels—thus limiting their representational power26,27. The core innovation of ODConv lies in its multi-dimensional dynamic adjustment across four key dimensions of the convolution kernel: Kernel number dimension-adaptively selecting among multiple kernels to match the input; Spatial dimension-enabling location-specific receptive fields for improved adaptivity; Input-channel dimension-dynamically emphasizing or suppressing different input features through learned weights; Output-channel dimension-flexibly reallocating output features to enhance representation24. To realize this design, ODConv incorporates a lightweight attention module that predicts weights for the above four dimensions based on the input features. A decomposition strategy is further employed to reduce computational complexity, thereby achieving high efficiency while significantly improving the representational capacity and adaptability of convolutional layers. Experimental results demonstrate that ODConv consistently outperforms ResNet25, the Squeeze-and-Excitation (SE) module28, and conventional dynamic convolution methods26,27 across multiple tasks, including image classification (ImageNet), object detection (COCO), and semantic segmentation. As such, ODConv is regarded as a versatile and practically valuable convolutional enhancement module24.

YOLOv11
YOLOv1129, released by the Ultralytics team on September 30, 2024, represents the latest milestone in the YOLO series. Building upon the strengths of YOLOv8, it achieves a superior balance between efficiency and accuracy, and is widely regarded as the most optimized version of the series to date. Its core architecture retains the three-part design of backbone, feature fusion network, and detection head, but introduces several key upgrades. First, in the backbone, YOLOv11 replaces the original C2f module with a newly designed C3k2 module. This enhancement not only ensures stable gradient flow during training but also significantly improves the extraction of fine-grained details and high-level semantic information from images. Second, during the feature fusion stage, the model incorporates an innovative C2PSA module, a spatial attention mechanism that guides the network to focus more precisely on critical target regions within the image, thereby improving both the specificity and accuracy of feature representation.

Method design

Method design
Breast cancer detection remains a highly challenging task, particularly when addressing small lesions, morphological variations, and blurry boundaries. To overcome these obstacles, this paper proposes an advanced breast cancer detection framework based on YOLOv11n. The selection of YOLOv11n as the baseline model is driven by its specific suitability for breast imaging tasks. First, regarding clinical deployability, the lightweight “Nano” architecture offers an optimal balance between low parameter count and high inference speed, making it highly suitable for integration into resource-constrained medical imaging devices. Second, in terms of feature extraction, the updated C3k2 module provides superior capabilities in capturing fine-grained textural variations compared to previous versions, which is critical for identifying subtle breast lesions. Third, the native C2PSA (Cross-Stage Partial Spatial Attention) mechanism effectively guides the model to focus on informative lesion regions amidst complex glandular backgrounds. The framework introduces several innovative modules designed to optimize feature extraction, fusion, and calibration, thereby significantly improving detection performance in breast cancer imaging. Specifically, the structural integration of these modules into the YOLOv11n architecture is designed as follows: First, in the Backbone network, the original C3k2 blocks are replaced by the proposed C3k2-DCNv2-Dynamic module. This substitution allows the feature extraction stage to dynamically adapt to geometric distortions typical of breast lesions. Second, in the Feature Fusion Neck, the C2CGA module is embedded into the multi-branch PANet structure. It acts as a channel-guided filter during feature aggregation to suppress background noise. Finally, the CSFCN module is strategically positioned between the Neck and the Detection Heads. Unlike standard connections, it serves as a pre-head calibration bridge, refining the multi-scale features spatially and semantically before they enter the decoupled detection heads for final regression and classification. The network architecture of the enhanced YOLOv11n is shown in Fig. 1.

C3k2-DCNv2-dynamic module
To improve the accuracy and robustness of breast cancer detection, it is essential to address issues such as object scale variation and geometric deformation in medical images. The C3k2-DCNv2-Dynamic module enhances the model’s adaptability by introducing dynamic convolution and deformable mechanisms, which automatically adjust the convolution filters based on different image features. This enables the model to perform more precisely when handling fine details and complex structures, particularly in detecting small lesions and subtle changes, thereby significantly improving detection accuracy.
As shown in Fig. 2, C3k2-DCNv2-Dynamic is the core module of the entire architecture. The input features are initially transformed through a convolution, and then divided into two parallel feature paths. Each path consists of an independent C3k-DCNv2-Dynamic branch. In each C3k branch, the input features first undergo convolution to establish the backbone features, followed by stacking multiple Bottleneck-DCNv2-Dynamic submodules sequentially within the backbone. At the same time, a bypass feature, which is not subjected to the stacking process, is preserved. After the backbone stacking is completed, it is concatenated with the bypass feature at the end, and the output of the branch is obtained through convolutional fusion. The Bottleneck-DCNv2-Dynamic used within the C3k branch is the fundamental building block of the structure. In this submodule, the input features are divided into two paths: the main path and the shortcut path. The shortcut path directly retains the original features, while the main path sequentially passes through convolution and DCNv2-Dynamic30,31, using dynamic sampling mechanisms to enhance the modeling capability for geometric deformations and local disturbances. The outputs of the main path and shortcut path are concatenated at the end to form the final output of the bottleneck unit. Finally, the C3k2-DCNv2-Dynamic module concatenates the outputs of multiple stacked C3k-DCNv2-Dynamic branches and performs feature fusion through convolution. The formula is as follows:
In the formula represents channel grouping, represents deformable convolution and dynamic convolution, represents the concatenation operation, and Z represents the final output result.

C2CGA module
To enhance the performance of breast cancer detection, the C2CGA module is introduced to optimize feature fusion. This module integrates a channel-guided attention mechanism (CGA) with a multi-branch structure, enabling more effective capture of important features within the image. By leveraging a local window attention mechanism, C2CGA focuses on key regions and efficiently fuses features across different layers, thereby improving the model’s ability to identify complex lesion areas, particularly small lesions and those with significant morphological changes.
As illustrated in Fig. 3, the C2CGA module consists of a basic unit and a multi-branch framework. Within each CGABlock, features are initially refined by convolutional/attention units, followed by the introduction of local window attention32 to capture finer contextual dependencies within the window. The resulting attention representations are then integrated with the input via a residual path, facilitating information compensation and fusion. Subsequently, the features are deepened through two convolutional layers, with additional residual connections ensuring smooth gradient and information flow. In the complete C2CGA structure, the input feature map undergoes convolution to perform initial channel and semantic adjustments, after which it is divided into several channel-specific sub-branches. Each branch independently models the features through concatenated CGABlocks. The outputs from all branches are then merged along the channel dimension and unified for fusion. Finally, a convolutional layer further refines and integrates the features, producing the final output of the module. The formula is as follows:
In the formula represents channel grouping, represents the Channel-Guided Attention Block, represents the concatenation operation, and Y represents the final output result.

CSFCN module
To enhance the performance of the breast cancer detection model, the CSFCN33 module is incorporated into the architecture. Structurally, this module operates at the terminal end of the feature fusion network (Neck), serving as the final calibration step immediately before the decoupled Detection Heads. By performing dual calibration of contextual and spatial features, this module improves the model’s ability to detect small lesions and subtle abnormalities. The CSFCN module consists of two components: Context Feature Calibration (CFC) and Spatial Feature Calibration (SFC), which optimize features from both semantic and spatial dimensions. This dual-calibration approach significantly enhances the model’s sensitivity to early lesions and subtle structural changes in peripheral regions.
As illustrated in Fig. 4, the CFC module addresses the challenge of scale variation by acting as a semantic selector. Instead of standard pooling, it employs cascaded pyramid pooling to capture context at multiple receptive fields, generating a multi-scale context set. For each pixel, the module calculates an affinity score between local features and these global contexts. This mechanism allows the network to adaptively emphasize discriminative features—specifically, filtering out background noise that lacks global contextual support while enhancing lesion features regardless of their physical size. Given the input features X, they are first processed by convolution to obtain a reduced-dimensional query Q, followed by cascaded pyramid pooling to generate a multi-scale context set . For each pixel, the affinity with each context is computed as , which is then used to weight and aggregate the contexts for semantic calibration. A pixel-level learnable recalibration factor is subsequently applied to fine-tune the responses. Finally, the refined output is fused with the original features via a residual connection, yielding the final result.
In the formula, denotes the input, denotes the output, and represents the context. i spans the range of , while refers to the learnable recalibration factor. M denotes the total number of contexts, and represents the pairwise function used to compute the affinity between features.

To compensate for spatial misalignment between low-resolution semantic features and high-resolution detail features, the SFC (Spatial Feature Calibration) module introduces a learnable sampling mechanism. Rather than simple element-wise addition, SFC dynamically warps the coarse semantic features to align spatially with the fine-grained details. Guided by a predicted offset field, the module resamples the feature map, ensuring that high-level discriminative signals are accurately projected onto the precise spatial location of the lesion, preserving the morphological integrity of small targets. As shown in Fig. 5, low-resolution features and high-resolution features are first unified in channel count through convolution. Then, undergoes bilinear upsampling U to match the spatial resolution of and is concatenated with it. Next, parallel convolution branches predict cross-level offset fields and gating masks , which guide offset-based resampling and weighted fusion. Finally, the subfeatures are resampled using the offset-based calibration function T and weighted and summed according to the gating mask, producing the output features.
Where represents the convolutional transformation with batch normalization and ReLU, U represents bilinear upsampling, T represents the calibration function, and denotes element-wise multiplication.

Experiment

Experiment

Dataset
The dataset utilized in this study was sourced from the publicly available “Medical Imaging Dataset: Breast Cancer Detection” on the Roboflow platform34. It consists of high-resolution histopathology microscopy images stained with Hematoxylin and Eosin, which is the clinical gold standard for breast cancer grading. These images capture critical morphological details, including nuclear atypia and mitotic figures, essential for differentiating tumor grades.
As illustrated in Fig. 6, the dataset encompasses five distinct clinical categories: malignant tumor grades (G1, G2, G3), Benign, and Malignant (general). The annotation strategy employs region-level bounding boxes (ROI) to precisely localize lesions. These annotations strictly adhere to clinical pathological grading standards: G1 represents well-differentiated tissue, G2 represents moderately differentiated, and G3 represents poorly differentiated tissue. It is important to note that the dataset exhibits a natural class imbalance typical of medical data, with varying sample counts across categories, which poses a significant challenge for model robustness.
Original high-resolution digital slides were resized to 640 × 640 pixels during preprocessing to balance computational efficiency with the retention of fine-grained features. To facilitate rigorous evaluation, the dataset (totaling 5,131 images) was stratified into three subsets: Training Set: 2,727 images, used for model parameter learning. Validation Set: 2,297 images, used to monitor convergence and tune hyperparameters. Test Set: 107 images, serving as an independent hold-out set for final performance reporting.
All image samples have undergone rigorous anonymization preprocessing to ensure the complete removal of sensitive personal information, complying with privacy protection regulations.

Data preprocessing and training protocols
To ensure the reproducibility of results and optimize the model for breast imaging characteristics, we implemented a rigorous data preprocessing and training protocol. The experimental outcomes are grounded in the software and hardware configurations detailed in Table 1. Beyond basic setup, specific preprocessing strategies were designed to address the challenges of histopathology images: (1) Adaptive Image Resizing (Letterbox): Unlike standard resizing which may distort lesion shapes, we employed an adaptive letterbox strategy. Input images were resized to 640 × 640 while maintaining the original aspect ratio, with gray padding applied to the non-image areas. This step is crucial for medical imaging as it preserves the morphological fidelity of cell nuclei and tissue structures, preventing geometric distortion that could mislead the convolution kernels. (2) Stain-Invariance Augmentation: Histopathology images often exhibit color variations due to differences in staining protocols and scanner specifications. To mitigate this, we applied HSV (Hue, Saturation, Value) augmentation during training (Hue fraction: 0.015, Saturation: 0.7, Value: 0.4). This forces the model to learn structural features of malignant cells rather than relying on specific color intensity, significantly enhancing generalization across different batches. (3) Mosaic and Mixup Augmentation: To address the issue of small lesion detection (e.g., small G2 regions), we utilized Mosaic augmentation, which stitches four random training images into a single input. This technique enriches the background diversity and exposes the model to small targets more frequently within a single batch, directly contributing to the improved recall for small objects observed in our experiments. The model was trained using the Stochastic Gradient Descent (SGD) optimizer with a momentum of 0.937 and a weight decay of 0.0005 to prevent overfitting. The initial learning rate was set to 0.01 with a cosine annealing scheduler.
Given the inherent risk of overfitting in medical imaging datasets due to limited sample sizes, we implemented a multi-layered mitigation strategy to ensure model generalization: (1) Data Space Expansion: We employed Mosaic and HSV augmentations to fundamentally expand the diversity of the training distribution, preventing the model from memorizing specific background textures or staining artifacts. (2) Regularization: We applied a weight decay of 0.0005 during SGD optimization to impose L2 regularization on the model weights, suppressing the growth of overly complex decision boundaries. (3) Architectural Constraint: We deliberately selected the lightweight YOLOv11n architecture. With only 3.0 million parameters, the model imposes a structural bottleneck that reduces the risk of fitting noise, which is often observed in larger models trained on small datasets. (4) Strict Evaluation: We maintained a strictly isolated Test Set that was never involved in the training process, ensuring that the reported performance reflects true generalization capability rather than memory.

Evaluation indicators
In this study, the evaluation metrics used are Precision (P), Recall (R), Average Precision (AP), and mean Average Precision (mAP)35. Additionally, the number of parameters (Parameters) is also considered. The calculation expressions for these metrics are as follows:
Where represents the number of correctly detected targets; represents the number of incorrectly detected targets; represents the number of missed targets; n denotes the number of categories; and represents the average precision for the i-th target class.

Experimental analysis

Experimental analysis

Algorithm comparison results
To thoroughly assess the effectiveness and superiority of the proposed method, this study selects prominent single-stage detectors (YOLOv8n36, YOLOv10n37, YOLOv11n, YOLOv12n38 and the end-to-end detection framework RT-DETR-r1839 as baseline comparisons. Quantitative experiments are conducted under identical training strategies and testing configurations. The evaluation metrics include Precision and mAP@0.5, which assess detection accuracy and overall performance. The experimental results for each method are summarized in Table 2.

As presented in Table 2, the proposed method demonstrates an exceptional balance between detection accuracy and computational efficiency. Our model achieves the highest Precision (66.6%) and mAP@0.5 (86.2%) among all compared algorithms. Notably, it outperforms the latest YOLOv12n by 8.2% points in mAP, indicating that our customized architectural improvements are more effective for breast lesion features than the generic updates in newer YOLO versions. In terms of speed, the proposed method reaches 833.3 FPS on an NVIDIA RTX 4090, which is identical to the baseline YOLOv11n and significantly faster than the Transformer-based RT-DETR-r18 (312.5 FPS). This ultra-high frame rate ensures that the model can process medical video streams or high-resolution slides with negligible latency, fully meeting the requirements for real-time clinical auxiliary diagnosis. Regarding model complexity, our framework maintains a lightweight footprint with 3.00 M parameters and 7.2 GFLOPS. Although there is a marginal increase in parameters compared to the baseline (2.58 M) due to the integration of the DCNv2 and calibration modules, the model remains far more compact than RT-DETR-r18 (19.87 M). Crucially, this slight increase in computational cost yields a substantial 2.4% improvement in mAP and 5.5% improvement in Precision, representing a highly efficient trade-off where significant performance gains are achieved without compromising deployability on resource-constrained medical devices.
To rigorously validate the advancement of the proposed method, we benchmarked it against the most representative frameworks in the current field. Specifically, RT-DETR-r18 represents the state-of-the-art in Transformer-based detection, while YOLOv12n (released in early 2025) represents the latest iteration of CNN-based real-time detectors. Many recent breast cancer detection studies10–14 still rely on older architectures such as Faster R-CNN or standard ResNet backbones, which generally lag behind the YOLOv11/v12 series in terms of inference speed and parameter efficiency. As evidenced in Table 2, our method achieves an mAP@0.5 of 86.2%, surpassing both the Transformer-based RT-DETR (65.0%) and the latest YOLOv12n (78.0%). This quantitative comparison confirms that the proposed framework not only outperforms general-purpose SOTA models but also establishes a new performance benchmark for lightweight breast lesion detection.
In the context of medical image detection, performance metrics must be evaluated with a specific emphasis on clinical safety. Unlike general object detection, the primary objective in cancer screening is to maximize Sensitivity (Recall) to minimize False Negatives (FN), as a missed diagnosis (FN) can lead to delayed treatment and poor patient prognosis. While Precision reflects the rate of False Positives (FP) (which cause unnecessary anxiety and biopsies), a high mAP indicates a robust balance between these two conflicting objectives. As shown in Table 2, our method achieves the highest mAP@0.5 (86.2%) and Precision (66.6%). More importantly, the substantial performance gains in the difficult G2 and G3 malignant categories (as detailed in Table 3) imply a significant reduction in False Negatives for clinically ambiguous lesions. By effectively identifying these high-risk targets that baseline models often miss, the proposed framework demonstrates a superior safety profile suitable for auxiliary diagnostic workflows.

Table 3 demonstrates that, across the three categories—G1, G2, and G3—the proposed method (Ours) achieves the highest and most consistent average precision (AP). In the relatively easier G1 category, all methods surpass 85%, with Ours achieving 93.6%, slightly outperforming the second-best YOLOv11n, while YOLOv12n shows an anomalous drop to 85.0%. In the most challenging G2 category, baseline methods are significantly limited, but Ours reaches 67.9%, representing an improvement of approximately 10.6% points over the strongest baseline, YOLOv10n. This highlights the most notable advantage, demonstrating superior robustness to difficult samples and complex targets. In the moderately challenging G3 category, the YOLO series shows marked improvement, with YOLOv11n approaching its upper limit, yet Ours remains the highest at 85.3%. Overall, while the YOLO series outperforms RT-DETR-r18, there are notable category-specific adaptation differences across versions. In contrast, Ours delivers substantial improvements across all categories, particularly in the difficult G2 category, thereby validating the effectiveness and stability of the proposed method.

Table 4 demonstrates significant category-specific differences in the Precision performance of each algorithm. In the challenging G2 category, overall precision remains relatively low, with YOLOv11n achieving the highest value at 54.5%, followed by Ours at 46.5%, ranking second. This is notably superior to RT-DETR-r18 and the other YOLO versions (YOLOv8n, YOLOv10n, YOLOv12n), indicating that Ours exhibits strong false detection control for G2 targets, although it still slightly trails YOLOv11n. In contrast, for the benign category, baseline methods generally exhibit low Precision, with YOLOv11n experiencing a substantial decline to only 20.5%, reflecting its tendency to generate false positives for normal samples. In comparison, Ours achieves 45.7%, significantly outperforming all baseline algorithms and surpassing the best baseline, YOLOv8n, by approximately 11.3% points. This highlights Ours’ notable advantage in suppressing false positives for the normal class. Overall, while YOLOv11n delivers the highest precision in the G2 category, it lacks stability across categories, whereas Ours provides more balanced performance across both categories, particularly demonstrating more reliable detection accuracy and robustness in the benign class.

K-fold cross-validation analysis
To further verify the robustness of the proposed framework and eliminate the potential bias introduced by a single fixed train-test split, we conducted a 5-fold cross-validation experiment on the entire dataset. The dataset was randomly partitioned into five equal subsets. In each fold, four subsets were used for training and one for validation. The results, as summarized in Table 5, demonstrate the stability of the model’s performance.

The proposed method achieved an average mAP@0.5 of 86.1% with a standard deviation of only 0.35%, and an average Precision of 66.4% ± 0.42%. These metrics are highly consistent with the results obtained from the independent test set (86.2% mAP). Furthermore, compared to the baseline YOLOv11n (average mAP 83.6% ± 0.58%), our method exhibits not only higher accuracy but also lower variance across folds. This low standard deviation confirms that the performance gains are statistically significant and robust to data variations, validating the model’s generalization capability despite the limitations of dataset size.

Ablation experiment
To verify the effectiveness of the proposed modules (including C3k2-DCNv2-Dynamic, C2CGA, and CSFCN), ablation experiments were conducted under the same experimental settings and dataset. Using YOLOv11n as the baseline model, we evaluated the impact of each module on detection accuracy, parameter count, and computational cost by progressively integrating them. The detailed results are presented in Table 6.

The ablation results presented in Table 6 detailedly illustrate the individual impact of each proposed module. The YOLOv11n baseline (Experiment 1) achieves a Precision of 61.1% and mAP@0.5 of 83.8%. Replacing the backbone blocks with the C3k2-DCNv2-Dynamic module improves Precision to 64.6% and mAP@0.5 to 84.9%. This gain confirms that introducing dynamic deformable convolution enhances the model’s ability to adapt to the geometric variations of breast lesions. Incorporating the C2CGA module into the neck yields a significant boost, raising Precision to 69.5% and mAP@0.5 to 87.3%. This demonstrates that the channel-guided attention mechanism effectively filters background noise during feature fusion. To specifically validate the contribution of the calibration mechanism, Experiment 4 isolates the CSFCN module. By integrating only CSFCN into the baseline, the model achieves the highest individual gain, with Precision reaching 72.9% and mAP@0.5 peaking at 89.4%. This result strongly supports the hypothesis that calibrating features across spatial and semantic dimensions is the most critical factor for accurate detection. Finally, integrating all three modules achieves a balanced performance (mAP 86.2%), ensuring robust generalization across diverse lesion types while maintaining a lightweight architecture.

Result visualization
To further analyze the misclassification patterns and recognition biases of each model across different categories, normalized confusion matrices were generated for each algorithm based on the validation set, as shown in Fig. 7. By comparing the confusion distributions of the various methods, we can more clearly highlight differences in model performance, particularly with respect to easily confusable categories, background false positives, and the identification of malignant targets. This provides a more granular foundation for subsequent performance analysis.

The normalized confusion matrices clearly demonstrate consistent error patterns across the models in the multi-class recognition task, with the most significant issue being the confusion between G2 and G3. For lightweight models such as YOLOv8n, YOLOv10n, and YOLOv12n, approximately half of the true G2 samples are misclassified as G3, and nearly half of the true G3 samples are misclassified as G2. This suggests that these two classes are highly similar in feature space, making it challenging for the models to distinguish between them reliably. Although YOLOv11n shows improvements in other categories, its performance in recognizing G2 deteriorates significantly, indicating a bias in its feature representation towards G1/G3, which compromises its sensitivity to the intermediate G2 class. Furthermore, all models exhibit false positives for the background class. YOLOv8n and YOLOv11n are particularly prone to misclassifying the background as benign, while YOLOv10n and YOLOv12n tend to misclassify the background as malignant. These discrepancies reflect architectural differences in how the models respond to low-texture background areas. The benign class, with its smooth appearance and ambiguous boundaries, is more easily confused with the background, contributing to the high false positive rate from background → benign observed across multiple models. In contrast, the proposed method (Ours) demonstrates the most significant improvements in the critical categories. The proportion of true G2 samples misclassified as G3 decreases from the typical 0.50 to 0.20, significantly mitigating the primary source of confusion. Additionally, the correct recognition rate of G3 increases to 0.80, the highest among all methods, highlighting the model’s superior precision in high-level feature extraction and category boundary modeling. Although Ours performs similarly to YOLOv8n in the G1 class, it achieves a comprehensive advantage across three key dimensions, underscoring that the structural enhancements in Ours significantly improve the model’s discriminative and generalization capabilities in multi-class lesion recognition tasks.

The F1-Confidence curves presented in Fig. 8 clearly highlight significant variations in overall detection capabilities and stability across different methods as the confidence threshold changes. YOLOv8n achieves an overall F1 score of 0.67, performing reasonably well on G1 and malignant classes. However, its G2 curve is the lowest and declines the most rapidly, while G3 also shows significant degradation in the mid-to-high threshold range, indicating high sensitivity to challenging classes and moderate threshold stability. YOLOv10n shows a slight improvement in overall F1, reaching 0.69, but its optimal point occurs at a lower threshold, suggesting weaker prediction confidence and the need for a more relaxed threshold to achieve a better balance. Both YOLOv11n and YOLOv12n reach a best overall F1 of approximately 0.65, with higher curves for malignant and G1 classes. However, G2 deteriorates quickly at mid-to-high thresholds, exhibiting substantial fluctuation. This reflects instability in handling imbalanced and easily confusable classes. RT-DETR-r18 achieves the highest optimal threshold with the smoothest curve, indicating a preference for high-confidence predictions and fewer false positives. However, its overall F1 is relatively low, suggesting a tendency for false negatives. In contrast, Ours achieves the highest overall F1 of 0.72, outperforming all other methods. Its curves for each category are notably smoother, particularly for G3 and malignant, which maintain high F1 scores, while benign steadily improves as the threshold increases. Meanwhile, G2’s fluctuations are significantly more contained compared to the YOLO series, demonstrating that Ours has superior confidence calibration and a more robust precision-recall tradeoff. Overall, Ours offers superior performance and threshold robustness in multi-class detection tasks.
To further assess the fine-grained recognition capabilities of various detection models in the pathological grading task, we conducted a visual analysis using the same set of tissue slide images. The prediction results of several mainstream lightweight detection models (YOLOv8n, YOLOv10n, YOLOv11n, YOLOv12n), the Transformer-based RT-DETR-r18, and the method proposed in this study are presented side by side. Since G2 is used as the ground truth in this study, the visualization provides an intuitive comparison of the models’ discriminatory biases at the G2/G3 classification boundary, their confidence stability, and differences in localization accuracy. This comparison offers compelling evidence for the subsequent performance analysis. Figure 9 displays the predicted bounding boxes and classification results for each model on three representative pathological slides.

The visualization results in Fig. 9, based on the G2 recognition benchmark, reveal distinct performance differences across the six models. The comparison highlights both the mitigation of recognition bias and the specific improvements introduced by multi-scale feature calibration: (1) Mitigation of Over-Prediction Bias: As observed in the baseline models (Fig. 9a-d), the YOLO series and RT-DETR-r18 exhibit varying degrees of “over-prediction” bias, frequently misclassifying true G2 samples as the more severe G3. Specifically, YOLOv8n and YOLOv11n show the most pronounced bias with fluctuating confidence, reflecting an inadequate ability to recognize transitional tissue structures. (2) Spatial Alignment (Effect of SFC): Crucially, the visual results demonstrate the impact of the Spatial Feature Calibration (SFC) module. In the baseline predictions (e.g., Fig. 9c), bounding boxes often appear loose or spatially shifted, failing to tightly encapsulate the irregular lesion boundaries. In contrast, our method (Fig. 9f) produces bounding boxes that align precisely with the tumor edges. This validates that the SFC unit effectively corrects spatial misalignment caused by downsampling, ensuring accurate localization. (3) Semantic Discrimination (Effect of CFC): Regarding classification stability, while RT-DETR-r18 almost exclusively predicts G3 due to global context confusion, our method consistently identifies G2 targets with stable confidence (concentrated in the 0.57–0.65 range). This consistency attests to the Context Feature Calibration (CFC) module, which aggregates multi-scale context to filter out local ambiguities. Consequently, Ours excels in both classification accuracy and bounding box precision, demonstrating superior robustness in pathological grading tasks.

The comparison of the PR curves in Fig. 10 clearly demonstrates that the improved YOLOv11n outperforms the original model in both overall detection performance and key categories. The overall mAP@0.5 across all categories increases from 0.838 to 0.862, indicating an enhanced balance between precision and recall. By category, the performance of G1 improves slightly from 0.925 to 0.936, while malignant increases from 0.992 to 0.994, both maintaining a near-ideal top-right distribution. The most significant performance gain is observed in the challenging G2 class, where its AP rises substantially from 0.538 to 0.679. Additionally, the curve in the mid-to-high recall range shows a marked reduction in the rate of decline, suggesting that the improvement effectively reduces false detections and stabilizes recall for this class. Overall, the improved model’s PR curve is smoother and maintains high precision even at high recall levels, particularly addressing the detection bottleneck in the G2 class, which leads to a stable and significant enhancement in overall performance.

Robustness analysis under image quality degradation
In clinical practice, acquired images are often susceptible to quality degradation due to factors such as equipment limitations, patient movement, or variations in staining protocols. To rigorously evaluate the robustness of the proposed framework, we conducted stress tests on the test set by synthetically introducing three common types of degradations: (1) Gaussian Noise: Gaussian noise with a variance of 0.01 was added to simulate sensor noise inherent in medical imaging devices. (2) Motion Blur: A motion blur kernel (size 15 × 15) was applied to mimic artifacts caused by slight patient movements or breathing. (3) Brightness Variation: The exposure was randomly adjusted (± 20%) to simulate inconsistent lighting or staining intensity.
The comparative results indicate that the proposed method exhibits superior resilience compared to the YOLOv11n baseline. Specifically, under Gaussian noise interference, the baseline model’s detection capability for small lesions (G2) dropped significantly, likely due to high-frequency noise disrupting the texture features. In contrast, our method retained a stable recall rate, which can be attributed to the C2CGA module’s ability to filter out background noise through channel-guided attention. Furthermore, in the Motion Blur scenario, the C3k2-DCNv2-Dynamic module demonstrated its distinct advantage. By dynamically adjusting the sampling locations of the convolution kernels, the module effectively compensated for the geometric distortions caused by blur, maintaining clearer feature boundaries than standard convolutions. Overall, while image degradation inevitably impacts performance, the proposed framework shows a minimized performance drop, validating its reliability for deployment in non-ideal clinical environments.

Failure case analysis
While the proposed framework achieves state-of-the-art performance, a qualitative analysis of failure cases on the test set reveals specific scenarios where the model still struggles with localization or classification:
The most frequent classification errors occur at the boundary between G2 (moderately differentiated) and G3 (poorly differentiated) tumors. As discussed in Sect. 5.3, these categories share overlapping morphological features. In cases where the tumor exhibits transitional characteristics—such as localized tubular formation mixed with solid sheets—the model may focus on a specific sub-region, leading to a discrepancy with the slide-level ground truth. This suggests that the model’s ability to aggregate global context needs further refinement.
In scenarios involving dense fibrous stroma or overlapping tissue structures, the model occasionally exhibits “localization drift.” Although the lesion is correctly detected (True Positive), the predicted bounding box may not perfectly align with the manual annotation (IoU < 0.5). This is primarily because the blurred boundaries of infiltrative tumors make it difficult for the regression head to define exact edges, a challenge inherent to semantic segmentation in histopathology.
We also observed isolated false positives caused by staining artifacts, such as dark dye precipitates or tissue folds. These artifacts can mimic the high-contrast appearance of malignant nuclei (nuclear atypia), misleading the model into detecting them as small lesions. This indicates that while the C2CGA module effectively suppresses general background noise, the model remains sensitive to high-frequency anomalies that resemble target features.

Conclusion

Conclusion
This paper presents an enhanced breast cancer detection framework based on YOLOv11n, designed to address the challenges of small lesions, morphological variability, and blurry boundaries commonly encountered in traditional breast cancer detection methods. By integrating innovative modules such as C3k2-DCNv2-Dynamic, C2CGA, and CSFCN, the proposed method significantly optimizes feature extraction, fusion, and calibration processes, thereby improving detection accuracy and robustness in breast cancer imaging. Experimental results demonstrate the advantages of this method, particularly in detecting small lesions and complex pathological regions, leading to a marked improvement in performance. Compared to traditional YOLO series models, the proposed framework achieves substantial gains in key metrics such as accuracy, recall, and mAP, with the most notable improvement in detecting small lesions (G2 class), where precision increased by 14.1% points, underscoring the framework’s effectiveness.
Despite the promising results, several limitations regarding dataset characteristics and clinical generalizability must be explicitly acknowledged. First, regarding dataset diversity and size, this study utilized a single publicly available dataset. While this ensures reproducibility, it introduces a potential single-source bias. The model may not fully capture the extensive variability in staining protocols (e.g., H&E intensity differences), scanner specifications, and patient demographics encountered in multi-center clinical environments. Consequently, the model’s robustness against unseen artifacts from different hospitals remains to be verified. Second, regarding clinical generalizability, the current evaluation is retrospective and based on curated patches. The framework has not yet undergone external validation on independent cohorts or prospective clinical trials. Therefore, its performance in a real-world, end-to-end diagnostic workflow—where issues like tissue folding, blurring, and non-informative regions are prevalent—requires further rigorous investigation.
To bridge the gap between academic research and clinical application, our future work will focus on three concrete directions. First, to further enhance robustness, we will expand the dataset to include multi-center cohorts and explore unsupervised domain adaptation (UDA) techniques. This will enable the model to generalize across different scanner manufacturers and staining protocols without requiring extensive re-annotation. Second, regarding clinical workflow adaptation, we aim to develop a prototype plugin for whole-slide imaging (WSI) viewers. This system will operate in a “Human-in-the-Loop” mode, where the model automatically flags suspicious G2/G3 regions for prioritized review by pathologists, thereby optimizing diagnostic efficiency. Third, we will integrate Explainable AI (XAI) visualization tools directly into the diagnostic interface, allowing clinicians to verify the morphological basis of the model’s predictions and building trust in the auxiliary diagnosis system.

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

🟢 PMC 전문 열기