Towards interpretable prediction of recurrence risk in breast cancer using pathology foundation models.
1/5 보강
Transcriptomic assays such as the PAM50-based ROR-P score guide recurrence risk stratification in non-metastatic, ER-positive, HER2-negative breast cancer but are not universally accessible.
APA
Kaczmarzyk JR, Van Alsten SC, et al. (2026). Towards interpretable prediction of recurrence risk in breast cancer using pathology foundation models.. NPJ digital medicine, 9(1), 149. https://doi.org/10.1038/s41746-025-02334-2
MLA
Kaczmarzyk JR, et al.. "Towards interpretable prediction of recurrence risk in breast cancer using pathology foundation models.." NPJ digital medicine, vol. 9, no. 1, 2026, pp. 149.
PMID
41545684 ↗
Abstract 한글 요약
Transcriptomic assays such as the PAM50-based ROR-P score guide recurrence risk stratification in non-metastatic, ER-positive, HER2-negative breast cancer but are not universally accessible. Histopathology is routinely available and may offer a scalable alternative. We introduce MAKO, a benchmarking framework evaluating 12 pathology foundation models and two non-pathology baselines for predicting ROR-P scores from H&E-stained whole-slide images using attention-based multiple instance learning. Foundation models, large neural networks pre-trained on millions of pathology images and adaptable to diverse downstream tasks, were trained and validated on the Carolina Breast Cancer Study and externally tested on TCGA BRCA. Several foundation models outperformed baseline models across classification, regression, and survival tasks. CONCH achieved the highest ROC AUC, while H-optimus-0 and Virchow2 showed the top correlation with continuous ROR-P scores. All pathology models stratified CBCS participants by recurrence similarly to transcriptomic ROR-P. Using the HIPPO interpretability method, we found that tumor regions were necessary and sufficient for high-risk predictions, and we identified candidate tissue biomarkers of recurrence. These results highlight the promise of interpretable, histology-based risk models in precision oncology.
📖 전문 본문 읽기 PMC JATS · ~89 KB · 영문
Introduction
Introduction
Hormone receptor (HR)-positive, HER2-negative breast cancers account for over 70% of all breast cancer cases and carry a substantial risk of long-term recurrence1–3. A meta-analysis of 62,923 estrogen receptor (ER)-positive women reported 20-year recurrence risks ranging from 10% to 41%4, and a Danish cohort study with over 30 years of follow-up found recurrence rates of 13.5–34.3% among ER-positive patients5. While patients at high risk of recurrence benefit most from adjuvant chemotherapy, accurately identifying those at low risk is equally important to avoid unnecessary treatment and its associated side effects6–11.
To address this clinical need, transcriptomic assays like the PAM50-based risk of recurrence (ROR-P) score have been clinically validated for the prediction of recurrence risk in ER-positive, HER2-negative patients12–20. However, these assays are not universally available and may potentially delay decision-making due to turnaround times of several days to weeks21. Given the increasing digitization of histopathology, artificial intelligence (AI) applied to hematoxylin-and-eosin (H&E)-stained whole slide images (WSIs) offers a promising and scalable alternative for biomarker inference, particularly in settings where transcriptomic testing is inaccessible. Previous studies have demonstrated that AI models can infer HR status22–25, PAM50 molecular subtypes23,26, and even ROR-P23,27,28 directly from H&E-stained WSIs.
While encouraging, most prior work has relied on task-specific models or feature extractors pretrained on natural images, such as those from ImageNet23,27–29. These models may not optimally capture the morphological complexity of histopathology. In contrast, recent advances in general-purpose, pretrained pathology foundation models have demonstrated strong performance across diverse WSI-level tasks30–36. Foundation models are large neural networks pretrained on millions of pathology images, enabling them to learn broadly useful morphological representations that can be adapted to many downstream tasks. However, their utility for ROR-P prediction and ability to predict long-term recurrence in breast cancer have not been systematically evaluated. Benchmarking these pathology foundation models for risk prediction could accelerate progress in tissue-based prognostics37.
Moreover, despite the widespread use of attention-based multiple instance learning (ABMIL) for WSI-level prediction tasks38,39, the interpretability of these ABMIL models remains poorly understood40. For clinical deployment, it is essential not only to assess predictive performance but also to understand how models arrive at their predictions. Interpretability methods can help determine whether reliance is on histologically meaningful features or spurious correlations, and may aid in identifying potential biases, failure models, or novel biomarkers. While attention weights from ABMIL are widely used for interpretation, they are best considered as tools for generating hypotheses about which tissue regions may be informative. Attention weights highlight areas the model may be focusing on, but do not necessarily reveal its true decision-making process41. In practice, attention can be unreliable, especially in the presence of redundant or correlated features. To move from hypothesis generation to hypothesis testing, perturbation-based methods offer a more rigorous framework by systematically evaluating how altering specific tissue regions affects model outputs40. Despite their potential, no prior studies to our knowledge have applied such virtual experiments to assess the interpretability of foundation models for breast cancer recurrence risk stratification.
To address these gaps, we developed Mammary Analysis for Knowledge of Outcomes (MAKO), a comprehensive benchmarking framework for inferring recurrence risk from H&E-stained WSIs in early breast cancer (Fig. 1). MAKO evaluates 14 pretrained feature extractors using ABMIL, including 12 pathology-specific foundation models and two general-purpose vision encoders. Models were trained using data from the Carolina Breast Cancer Study (CBCS), a large, diverse cohort with long-term recurrence follow-up, and externally validated on The Cancer Genome Atlas Invasive Breast Carcinoma dataset (TCGA BRCA)42. We focused specifically on predicting the PAM50-based ROR-P score, comparing models trained for classification of risk categories (i.e., low/medium vs. high) with those trained to regress the continuous ROR-P score, the latter of which may better preserve underlying biological variation43. In addition to benchmarking predictive performance, we assessed model interpretability by examining the contribution of tumor epithelium versus surrounding tissue to the predictions. We conducted further virtual experiments in tissue to identify histologic features associated with high recurrence risk, advancing efforts toward biologically grounded and clinically actionable AI in computational pathology. Together, these efforts provide a comprehensive evaluation of pathology foundation models for breast cancer risk stratification and establish MAKO as a resource for benchmarking and discovery.
Hormone receptor (HR)-positive, HER2-negative breast cancers account for over 70% of all breast cancer cases and carry a substantial risk of long-term recurrence1–3. A meta-analysis of 62,923 estrogen receptor (ER)-positive women reported 20-year recurrence risks ranging from 10% to 41%4, and a Danish cohort study with over 30 years of follow-up found recurrence rates of 13.5–34.3% among ER-positive patients5. While patients at high risk of recurrence benefit most from adjuvant chemotherapy, accurately identifying those at low risk is equally important to avoid unnecessary treatment and its associated side effects6–11.
To address this clinical need, transcriptomic assays like the PAM50-based risk of recurrence (ROR-P) score have been clinically validated for the prediction of recurrence risk in ER-positive, HER2-negative patients12–20. However, these assays are not universally available and may potentially delay decision-making due to turnaround times of several days to weeks21. Given the increasing digitization of histopathology, artificial intelligence (AI) applied to hematoxylin-and-eosin (H&E)-stained whole slide images (WSIs) offers a promising and scalable alternative for biomarker inference, particularly in settings where transcriptomic testing is inaccessible. Previous studies have demonstrated that AI models can infer HR status22–25, PAM50 molecular subtypes23,26, and even ROR-P23,27,28 directly from H&E-stained WSIs.
While encouraging, most prior work has relied on task-specific models or feature extractors pretrained on natural images, such as those from ImageNet23,27–29. These models may not optimally capture the morphological complexity of histopathology. In contrast, recent advances in general-purpose, pretrained pathology foundation models have demonstrated strong performance across diverse WSI-level tasks30–36. Foundation models are large neural networks pretrained on millions of pathology images, enabling them to learn broadly useful morphological representations that can be adapted to many downstream tasks. However, their utility for ROR-P prediction and ability to predict long-term recurrence in breast cancer have not been systematically evaluated. Benchmarking these pathology foundation models for risk prediction could accelerate progress in tissue-based prognostics37.
Moreover, despite the widespread use of attention-based multiple instance learning (ABMIL) for WSI-level prediction tasks38,39, the interpretability of these ABMIL models remains poorly understood40. For clinical deployment, it is essential not only to assess predictive performance but also to understand how models arrive at their predictions. Interpretability methods can help determine whether reliance is on histologically meaningful features or spurious correlations, and may aid in identifying potential biases, failure models, or novel biomarkers. While attention weights from ABMIL are widely used for interpretation, they are best considered as tools for generating hypotheses about which tissue regions may be informative. Attention weights highlight areas the model may be focusing on, but do not necessarily reveal its true decision-making process41. In practice, attention can be unreliable, especially in the presence of redundant or correlated features. To move from hypothesis generation to hypothesis testing, perturbation-based methods offer a more rigorous framework by systematically evaluating how altering specific tissue regions affects model outputs40. Despite their potential, no prior studies to our knowledge have applied such virtual experiments to assess the interpretability of foundation models for breast cancer recurrence risk stratification.
To address these gaps, we developed Mammary Analysis for Knowledge of Outcomes (MAKO), a comprehensive benchmarking framework for inferring recurrence risk from H&E-stained WSIs in early breast cancer (Fig. 1). MAKO evaluates 14 pretrained feature extractors using ABMIL, including 12 pathology-specific foundation models and two general-purpose vision encoders. Models were trained using data from the Carolina Breast Cancer Study (CBCS), a large, diverse cohort with long-term recurrence follow-up, and externally validated on The Cancer Genome Atlas Invasive Breast Carcinoma dataset (TCGA BRCA)42. We focused specifically on predicting the PAM50-based ROR-P score, comparing models trained for classification of risk categories (i.e., low/medium vs. high) with those trained to regress the continuous ROR-P score, the latter of which may better preserve underlying biological variation43. In addition to benchmarking predictive performance, we assessed model interpretability by examining the contribution of tumor epithelium versus surrounding tissue to the predictions. We conducted further virtual experiments in tissue to identify histologic features associated with high recurrence risk, advancing efforts toward biologically grounded and clinically actionable AI in computational pathology. Together, these efforts provide a comprehensive evaluation of pathology foundation models for breast cancer risk stratification and establish MAKO as a resource for benchmarking and discovery.
Results
Results
Using attention-based multiple instance learning (ABMIL), we evaluated 12 foundation models pretrained on histopathology and compared their performance to two non-pathology baselines: ResNet50 pretrained on ImageNet and ViT-DINOv2, a vision transformer trained on 142 million natural images using self-supervised learning. This design enabled assessment of the added value of pathology-specific pretraining. Models were evaluated on three tasks: classification of ROR-P risk categories (low/medium vs. high), prediction of continuous ROR-P scores, and stratification by time to recurrence. Training and internal validation were performed on the Carolina Breast Cancer Study (CBCS) using 10-fold cross-validation, with external validation on TCGA BRCA (Table 1). Models were trained on all available CBCS WSIs with matched ROR-P and recurrence information (n = 1339 participants, 1384 WSIs). All reported performance metrics and interpretability analyses were restricted to ER-positive, HER2-negative tumors, the target population for clinical ROR-P testing. In CBCS, this included 883 WSIs from 847 participants, and in TCGA BRCA, it included 613 WSIs from 613 participants.
In addition to benchmarking performance, we applied the HIPPO framework to identify histologic features used by the models. HIPPO systematically adds or removes tissue patches from WSIs to generate hypothetical slides. By comparing model predictions on these hypothetical slides to the originals, HIPPO quantifies the necessity (whether removing a region reduces performance) and sufficiency (whether a region alone drives predictions) of specific tissue regions. Using HIPPO, we evaluated whether tumor regions were necessary and sufficient for accurate ROR-P prediction, and we identified regions sufficient to predict high-risk scores. These analyses offer insight into the spatial patterns driving model predictions and their potential relevance as interpretable biomarkers.
Benchmarking ROR-P group classification
Pathology foundation models consistently achieved higher ROC AUCs than the ResNet50 baseline for classifying ROR-P risk groups in ER-positive, HER2-negative, stage I–III specimens of the CBCS dataset. The top-performing model, CONCH, achieved an AUC of 0.809, representing an 8.6% relative improvement (, DeLong’s test). Six additional encoders also had statistically significant improvements over ResNet50 after multiple testing correction, including UNI, Phikon, CTransPath, Phikon-v2, Virchow, and DiRLv2 (, DeLong’s test). CONCH was the only model to reach a significantly higher ROC AUC than ViT-DINOv2 (, DeLong’s test) (Fig. 2a, b, Supplementary Table 1).
In the TCGA BRCA cohort of patients with non-metastatic ER-positive, HER2-negative tumors, four categorical models demonstrated statistically significant improvements in ROC AUC over the ResNet50 baseline. The CONCH model achieved the highest AUC at 0.850, corresponding to a 10.4% relative improvement (, DeLong’s test). H-optimus-0 and Virchow2 achieved ROC AUCs of 0.840, which were 9% higher than that of ResNet50 (, DeLong’s test). CTransPath also achieved a significantly higher ROC AUC of 0.829 (, DeLong’s test). Three additional models (i.e., UNI, Phikon, and Prov-GigaPath) showed numerically higher ROC AUCs compared to ResNet50, but these did not reach statistical significance after adjustment for multiple comparisons (Fig. 2c, d, Supplementary Table 1).
Benchmarking ROR-P score regression
In patients with non-metastatic, ER-positive, HER2-negative breast cancer in the CBCS cohort, eleven of twelve pathology foundation models significantly outperformed the ResNet50 baseline in predicting continuous ROR-P scores, as measured by Pearson correlation with the true ROR-P. The ResNet50 model achieved a baseline correlation of 0.541, and the H-optimus-0 encoder achieved the strongest performance, with a correlation of 0.638 (Fig. 3a, b, Supplementary Table 2).
Despite being the highest performer in CBCS, H-optimus-0 did not generalize as well in the TCGA cohort of patients with ER-positive, HER2-negative breast cancer, achieving a lower correlation with true ROR-P scores compared to ResNet50. None of the models evaluated in TCGA demonstrated statistically significant improvements in Pearson correlation relative to ResNet50 after multiple testing correction. Virchow2 and CTransPath showed the greatest numerical improvements, but their adjusted p-values narrowly exceeded the significance threshold. Several encoders that performed well in CBCS, including CONCH, Prov-GigaPath, and UNI, exhibited modest gains in correlation, while others, such as Phikon, Kaiko-L/14, and H-optimus-0 showed decreased performance (Fig. 3c, d, Supplementary Table 2). These results suggest limited cross-cohort generalizability of continuous ROR-P models trained on digital histopathology.
Benchmarking prediction of recurrence events
We evaluated histology-based models for their ability to predict actual recurrence events using the concordance index (C-index, ) as the primary performance metric. Among 847 participants with ER-positive, HER2-negative breast cancer participants in the CBCS cohort, 107 experienced a recurrence event within 10 years of study enrollment. We first evaluated models trained to classify binarized ROR-P risk groups by comparing their prognostic performance to that of the transcriptomic ROR-P assay. All ABMIL-based univariate Cox models demonstrated statistically significant stratification of patients based on recurrence (all , log-rank test, FDR corrected), except for the H-optimus-0 model (, log-rank test, FDR corrected). In addition, the C indices from ABMIL model predictions were compared to the C index from the transcriptomic assay using the statistical method proposed by Kang et al. 44. None of the ABMIL models achieved C indices that were inferior to the transcriptomic assay (all , Kang et al. test, FDR corrected), indicating comparable performance in risk stratification within this cohort (Fig. 4a,d). None of the ABMIL models demonstrated statistically significant differences in C-index compared to the ResNet50 baseline (all , Kang et al. test, FDR corrected).
Next, we benchmarked models trained to predict continuous ROR-P scores using the C-index. The continuous prediction scores from all ABMIL models were significantly associated with recurrence-free survival (all , Wald test, FDR corrected), indicating that higher scores corresponded to increased risk of recurrence. Similar to the categorical models, no continuous ABMIL-based Cox model differed significantly from the transcriptomic ROR-P assay (all , Kang et al. test, FDR corrected) or from the ResNet50 baseline model (all , Kang et al. test, FDR corrected) (Fig. 4b).
We observed that C indices were generally higher for models trained to predict continuous ROR-P scores than for those trained to classify binarized ROR-P risk groups (Fig. 4a, b). To determine whether this performance gap reflected genuine model differences or was instead due to differences in resolution, we applied the same threshold used by the transcriptomic ROR-P assay to the predicted continuous scores. When evaluated in a directly comparable classification framework, the categorical models consistently achieved higher C indices than their thresholded continuous counterparts (Fig. 4c), but after correction for multiple comparisons, only CONCH achieved a significantly higher C index in the categorical setting (, Kang et al. test, FDR corrected). These findings suggest that while continuous models benefit from finer granularity in survival modeling, categorical models trained explicitly to align with clinical thresholds may better capture risk group distinctions relevant to treatment decision-making.
Due to limited clinical follow-up data in TCGA45, this cohort was used for secondary analysis, while CBCS remained the primary dataset for evaluating recurrence prediction. In the ER-positive, HER2-negative, non-metastatic cohort of TCGA BRCA (), 47 participants experienced a recurrence event within 10 years of study enrollment. In this subset of TCGA BRCA, the transcriptomic ROR-P assay did not significantly stratify participants by recurrence when binarized (, , log-rank test) or used as a continuous score (, , Wald test). Likewise, none of the ABMIL models achieved significant stratification in TCGA BRCA (all , log-rank test, FDR corrected).
Attention is insufficient for interpretation
With ABMIL models, each patch is assigned a weight reflecting its contribution to the model’s prediction38. To better understand how these models inferred recurrence risk, we qualitatively analyzed high-attention patches across specimens stratified by predicted ROR-P groups. In high ROR-P predictions, the most attended patches frequently captured nuclear pleomorphism, disordered architecture, and tumor–stroma interfaces (Supplementary Fig. 2a), whereas in low/medium ROR-P predictions, high-attention patches were often localized to stromal regions (Supplementary Fig. 2b). In general, attention in high-risk cases was concentrated within tumor epithelial regions, suggesting that the model prioritizes tumor morphology in its high-risk assessments (Supplementary Fig. 2c). However, we also identified specimens in which attention was diffuse and without clear focus (Supplementary Fig. 2d). These ambiguous attention maps limited interpretability and raised questions about which tissue regions were actually driving model predictions.
Tumor regions are necessary and sufficient for high ROR-P predictions
To address these limitations and rigorously quantify the contribution of tumor tissue, we used HIPPO, a perturbation-based explainability framework specifically designed to assess how localized tissue regions influence predictions in weakly supervised models, like those developed in the present study40. While attention maps offer indirect insights, they are not guaranteed to reflect causal model behavior41. Perturbation-based approaches, by contrast, enable systematic testing of tissue importance through controlled input modification. Motivated by the biology of the transcriptomic ROR-P assay (which was developed using genes that were intrinsic to tumors found in repeated sampling of tumor tissue12,19), we hypothesized that tumor would be both necessary and sufficient for high ROR-P predictions by our models. Using HIPPO, we generated synthetic slides by removing tumor patches and observed the resulting impact on ROR-P predictions. Across all models, removal of tumor regions led to a significant reduction in high-risk scores (, two-sided paired t-test), with effect sizes ranging from to (Cohen’s ) (Fig. 5a), supporting the hypothesis that tumor regions are necessary for high ROR-P predictions.
To evaluate whether tumor tissue alone was sufficient for high ROR-P predictions, we applied HIPPO to generate synthetic slides containing only tumor regions, with all non-tumor tissue patches removed (Fig. 5b). For seven models (i.e., CTransPath, DiRLv2, H-optimus-0, Kaiko-L/14, Prov-GigaPath, UNI, Virchow2), this manipulation did not result in a statistically significant change in the predicted probability of high ROR-P (, two-sided paired t-test), indicating that tumor regions alone were sufficient to reproduce the original high-risk predictions. Among the remaining models, all except Hibou-L exhibited significant increases in predicted probability following removal of non-tumor regions. However, effect sizes for most pathology foundation models were modest (Cohen’s ), further supporting the sufficiency of tumor morphology in driving high ROR-P predictions. The largest increase in predicted probability was observed for ResNet50 (Cohen’s ) and ViT-ResNet50 (Cohen’s ), suggesting that these non-pathology models were most affected by non-tumor tissue. Together, these perturbation-based analyses (Fig. 5a, b) demonstrate that tumor regions are both necessary and, in most models, sufficient to drive high ROR-P predictions. These findings strengthen confidence that high-risk predictions of pathology foundation models reflect biologically relevant tumor morphology, rather than spurious associations with non-tumor components.
Identifying candidate tissue biomarkers of high recurrence risk
The combination of ABMIL models and explainable AI provides an opportunity not only to interpret model predictions but also to discover morphology associated with recurrence risk. We developed a data-driven strategy to identify specific tissue regions that are sufficient to drive high-risk predictions. Informed by Kaelin’s (2017) emphasis on biomarker sufficiency as a critical criterion in cancer research46, we leveraged HIPPO to automatically search for minimal regions of tissue that could convert a low- or medium-risk prediction into a high-risk one. This approach treats the trained model as a proxy observer, systematically identifying tissue regions that, when introduced into a low-risk specimen, consistently elevate predicted risk, revealing patterns that may serve as candidate, interpretable biomarkers of recurrence risk.
To operationalize this idea, we used HIPPO to perform a model-guided search for tissue patches sufficient to increase ROR-P predictions. Patches from high-risk WSIs were systematically inserted into low-risk WSIs to identify the smallest region capable of flipping the model’s prediction (Fig. 5c). This process yielded a set of 120 high-risk patches (~2.0 mm2) that consistently elevated ROR-P scores across models and specimens. The identified regions exhibited features associated with aggressive tumor biology, including nuclear pleomorphism, mitotic activity, necrosis, and invasive growth, and there was a lack of tumor-infiltrating lymphocytes (Fig. 5d, Supplementary Fig. 3). To assess generalizability, these high-risk patches were introduced into additional low- and medium-risk slides. In all cases, the manipulated slides showed consistent increases in predicted ROR-P (Fig. 5e, Supplementary Fig. 4), suggesting that the identified regions are robust, transferable, and sufficient to drive high-risk predictions. These results demonstrate the potential of ABMIL models, in combination with explainable AI, to propose candidate histologic biomarkers of recurrence risk.
Using attention-based multiple instance learning (ABMIL), we evaluated 12 foundation models pretrained on histopathology and compared their performance to two non-pathology baselines: ResNet50 pretrained on ImageNet and ViT-DINOv2, a vision transformer trained on 142 million natural images using self-supervised learning. This design enabled assessment of the added value of pathology-specific pretraining. Models were evaluated on three tasks: classification of ROR-P risk categories (low/medium vs. high), prediction of continuous ROR-P scores, and stratification by time to recurrence. Training and internal validation were performed on the Carolina Breast Cancer Study (CBCS) using 10-fold cross-validation, with external validation on TCGA BRCA (Table 1). Models were trained on all available CBCS WSIs with matched ROR-P and recurrence information (n = 1339 participants, 1384 WSIs). All reported performance metrics and interpretability analyses were restricted to ER-positive, HER2-negative tumors, the target population for clinical ROR-P testing. In CBCS, this included 883 WSIs from 847 participants, and in TCGA BRCA, it included 613 WSIs from 613 participants.
In addition to benchmarking performance, we applied the HIPPO framework to identify histologic features used by the models. HIPPO systematically adds or removes tissue patches from WSIs to generate hypothetical slides. By comparing model predictions on these hypothetical slides to the originals, HIPPO quantifies the necessity (whether removing a region reduces performance) and sufficiency (whether a region alone drives predictions) of specific tissue regions. Using HIPPO, we evaluated whether tumor regions were necessary and sufficient for accurate ROR-P prediction, and we identified regions sufficient to predict high-risk scores. These analyses offer insight into the spatial patterns driving model predictions and their potential relevance as interpretable biomarkers.
Benchmarking ROR-P group classification
Pathology foundation models consistently achieved higher ROC AUCs than the ResNet50 baseline for classifying ROR-P risk groups in ER-positive, HER2-negative, stage I–III specimens of the CBCS dataset. The top-performing model, CONCH, achieved an AUC of 0.809, representing an 8.6% relative improvement (, DeLong’s test). Six additional encoders also had statistically significant improvements over ResNet50 after multiple testing correction, including UNI, Phikon, CTransPath, Phikon-v2, Virchow, and DiRLv2 (, DeLong’s test). CONCH was the only model to reach a significantly higher ROC AUC than ViT-DINOv2 (, DeLong’s test) (Fig. 2a, b, Supplementary Table 1).
In the TCGA BRCA cohort of patients with non-metastatic ER-positive, HER2-negative tumors, four categorical models demonstrated statistically significant improvements in ROC AUC over the ResNet50 baseline. The CONCH model achieved the highest AUC at 0.850, corresponding to a 10.4% relative improvement (, DeLong’s test). H-optimus-0 and Virchow2 achieved ROC AUCs of 0.840, which were 9% higher than that of ResNet50 (, DeLong’s test). CTransPath also achieved a significantly higher ROC AUC of 0.829 (, DeLong’s test). Three additional models (i.e., UNI, Phikon, and Prov-GigaPath) showed numerically higher ROC AUCs compared to ResNet50, but these did not reach statistical significance after adjustment for multiple comparisons (Fig. 2c, d, Supplementary Table 1).
Benchmarking ROR-P score regression
In patients with non-metastatic, ER-positive, HER2-negative breast cancer in the CBCS cohort, eleven of twelve pathology foundation models significantly outperformed the ResNet50 baseline in predicting continuous ROR-P scores, as measured by Pearson correlation with the true ROR-P. The ResNet50 model achieved a baseline correlation of 0.541, and the H-optimus-0 encoder achieved the strongest performance, with a correlation of 0.638 (Fig. 3a, b, Supplementary Table 2).
Despite being the highest performer in CBCS, H-optimus-0 did not generalize as well in the TCGA cohort of patients with ER-positive, HER2-negative breast cancer, achieving a lower correlation with true ROR-P scores compared to ResNet50. None of the models evaluated in TCGA demonstrated statistically significant improvements in Pearson correlation relative to ResNet50 after multiple testing correction. Virchow2 and CTransPath showed the greatest numerical improvements, but their adjusted p-values narrowly exceeded the significance threshold. Several encoders that performed well in CBCS, including CONCH, Prov-GigaPath, and UNI, exhibited modest gains in correlation, while others, such as Phikon, Kaiko-L/14, and H-optimus-0 showed decreased performance (Fig. 3c, d, Supplementary Table 2). These results suggest limited cross-cohort generalizability of continuous ROR-P models trained on digital histopathology.
Benchmarking prediction of recurrence events
We evaluated histology-based models for their ability to predict actual recurrence events using the concordance index (C-index, ) as the primary performance metric. Among 847 participants with ER-positive, HER2-negative breast cancer participants in the CBCS cohort, 107 experienced a recurrence event within 10 years of study enrollment. We first evaluated models trained to classify binarized ROR-P risk groups by comparing their prognostic performance to that of the transcriptomic ROR-P assay. All ABMIL-based univariate Cox models demonstrated statistically significant stratification of patients based on recurrence (all , log-rank test, FDR corrected), except for the H-optimus-0 model (, log-rank test, FDR corrected). In addition, the C indices from ABMIL model predictions were compared to the C index from the transcriptomic assay using the statistical method proposed by Kang et al. 44. None of the ABMIL models achieved C indices that were inferior to the transcriptomic assay (all , Kang et al. test, FDR corrected), indicating comparable performance in risk stratification within this cohort (Fig. 4a,d). None of the ABMIL models demonstrated statistically significant differences in C-index compared to the ResNet50 baseline (all , Kang et al. test, FDR corrected).
Next, we benchmarked models trained to predict continuous ROR-P scores using the C-index. The continuous prediction scores from all ABMIL models were significantly associated with recurrence-free survival (all , Wald test, FDR corrected), indicating that higher scores corresponded to increased risk of recurrence. Similar to the categorical models, no continuous ABMIL-based Cox model differed significantly from the transcriptomic ROR-P assay (all , Kang et al. test, FDR corrected) or from the ResNet50 baseline model (all , Kang et al. test, FDR corrected) (Fig. 4b).
We observed that C indices were generally higher for models trained to predict continuous ROR-P scores than for those trained to classify binarized ROR-P risk groups (Fig. 4a, b). To determine whether this performance gap reflected genuine model differences or was instead due to differences in resolution, we applied the same threshold used by the transcriptomic ROR-P assay to the predicted continuous scores. When evaluated in a directly comparable classification framework, the categorical models consistently achieved higher C indices than their thresholded continuous counterparts (Fig. 4c), but after correction for multiple comparisons, only CONCH achieved a significantly higher C index in the categorical setting (, Kang et al. test, FDR corrected). These findings suggest that while continuous models benefit from finer granularity in survival modeling, categorical models trained explicitly to align with clinical thresholds may better capture risk group distinctions relevant to treatment decision-making.
Due to limited clinical follow-up data in TCGA45, this cohort was used for secondary analysis, while CBCS remained the primary dataset for evaluating recurrence prediction. In the ER-positive, HER2-negative, non-metastatic cohort of TCGA BRCA (), 47 participants experienced a recurrence event within 10 years of study enrollment. In this subset of TCGA BRCA, the transcriptomic ROR-P assay did not significantly stratify participants by recurrence when binarized (, , log-rank test) or used as a continuous score (, , Wald test). Likewise, none of the ABMIL models achieved significant stratification in TCGA BRCA (all , log-rank test, FDR corrected).
Attention is insufficient for interpretation
With ABMIL models, each patch is assigned a weight reflecting its contribution to the model’s prediction38. To better understand how these models inferred recurrence risk, we qualitatively analyzed high-attention patches across specimens stratified by predicted ROR-P groups. In high ROR-P predictions, the most attended patches frequently captured nuclear pleomorphism, disordered architecture, and tumor–stroma interfaces (Supplementary Fig. 2a), whereas in low/medium ROR-P predictions, high-attention patches were often localized to stromal regions (Supplementary Fig. 2b). In general, attention in high-risk cases was concentrated within tumor epithelial regions, suggesting that the model prioritizes tumor morphology in its high-risk assessments (Supplementary Fig. 2c). However, we also identified specimens in which attention was diffuse and without clear focus (Supplementary Fig. 2d). These ambiguous attention maps limited interpretability and raised questions about which tissue regions were actually driving model predictions.
Tumor regions are necessary and sufficient for high ROR-P predictions
To address these limitations and rigorously quantify the contribution of tumor tissue, we used HIPPO, a perturbation-based explainability framework specifically designed to assess how localized tissue regions influence predictions in weakly supervised models, like those developed in the present study40. While attention maps offer indirect insights, they are not guaranteed to reflect causal model behavior41. Perturbation-based approaches, by contrast, enable systematic testing of tissue importance through controlled input modification. Motivated by the biology of the transcriptomic ROR-P assay (which was developed using genes that were intrinsic to tumors found in repeated sampling of tumor tissue12,19), we hypothesized that tumor would be both necessary and sufficient for high ROR-P predictions by our models. Using HIPPO, we generated synthetic slides by removing tumor patches and observed the resulting impact on ROR-P predictions. Across all models, removal of tumor regions led to a significant reduction in high-risk scores (, two-sided paired t-test), with effect sizes ranging from to (Cohen’s ) (Fig. 5a), supporting the hypothesis that tumor regions are necessary for high ROR-P predictions.
To evaluate whether tumor tissue alone was sufficient for high ROR-P predictions, we applied HIPPO to generate synthetic slides containing only tumor regions, with all non-tumor tissue patches removed (Fig. 5b). For seven models (i.e., CTransPath, DiRLv2, H-optimus-0, Kaiko-L/14, Prov-GigaPath, UNI, Virchow2), this manipulation did not result in a statistically significant change in the predicted probability of high ROR-P (, two-sided paired t-test), indicating that tumor regions alone were sufficient to reproduce the original high-risk predictions. Among the remaining models, all except Hibou-L exhibited significant increases in predicted probability following removal of non-tumor regions. However, effect sizes for most pathology foundation models were modest (Cohen’s ), further supporting the sufficiency of tumor morphology in driving high ROR-P predictions. The largest increase in predicted probability was observed for ResNet50 (Cohen’s ) and ViT-ResNet50 (Cohen’s ), suggesting that these non-pathology models were most affected by non-tumor tissue. Together, these perturbation-based analyses (Fig. 5a, b) demonstrate that tumor regions are both necessary and, in most models, sufficient to drive high ROR-P predictions. These findings strengthen confidence that high-risk predictions of pathology foundation models reflect biologically relevant tumor morphology, rather than spurious associations with non-tumor components.
Identifying candidate tissue biomarkers of high recurrence risk
The combination of ABMIL models and explainable AI provides an opportunity not only to interpret model predictions but also to discover morphology associated with recurrence risk. We developed a data-driven strategy to identify specific tissue regions that are sufficient to drive high-risk predictions. Informed by Kaelin’s (2017) emphasis on biomarker sufficiency as a critical criterion in cancer research46, we leveraged HIPPO to automatically search for minimal regions of tissue that could convert a low- or medium-risk prediction into a high-risk one. This approach treats the trained model as a proxy observer, systematically identifying tissue regions that, when introduced into a low-risk specimen, consistently elevate predicted risk, revealing patterns that may serve as candidate, interpretable biomarkers of recurrence risk.
To operationalize this idea, we used HIPPO to perform a model-guided search for tissue patches sufficient to increase ROR-P predictions. Patches from high-risk WSIs were systematically inserted into low-risk WSIs to identify the smallest region capable of flipping the model’s prediction (Fig. 5c). This process yielded a set of 120 high-risk patches (~2.0 mm2) that consistently elevated ROR-P scores across models and specimens. The identified regions exhibited features associated with aggressive tumor biology, including nuclear pleomorphism, mitotic activity, necrosis, and invasive growth, and there was a lack of tumor-infiltrating lymphocytes (Fig. 5d, Supplementary Fig. 3). To assess generalizability, these high-risk patches were introduced into additional low- and medium-risk slides. In all cases, the manipulated slides showed consistent increases in predicted ROR-P (Fig. 5e, Supplementary Fig. 4), suggesting that the identified regions are robust, transferable, and sufficient to drive high-risk predictions. These results demonstrate the potential of ABMIL models, in combination with explainable AI, to propose candidate histologic biomarkers of recurrence risk.
Discussion
Discussion
In the present study, we demonstrate that computational pathology can accurately infer recurrence risk in breast cancer and systematically benchmark 12 pathology foundation models and two non-pathology baseline models using a unified framework, MAKO. Our results show that several foundation models, particularly CONCH, achieve robust performance in predicting both categorical and continuous PAM50-based ROR-P scores from H&E-stained WSIs. When evaluated for stratifying patients by recurrence events, these histology-based models performed comparably to transcriptomic assays. Beyond prediction, we show that pathology foundation models can be interrogated using perturbation-based methods to identify minimal tissue regions sufficient to drive high-risk predictions, suggesting their utility not only for histology-based risk stratification but also for biomarker discovery. Together, these findings highlight the promise of pathology foundation models as scalable, interpretable tools for the prediction of recurrence risk in breast cancer.
Our study builds upon a growing body of work demonstrating that deep learning models can infer molecular phenotypes and recurrence risk from histology. Couture et al., for example, showed that an ImageNet-pretrained convolutional neural network could be used to predict PAM50 molecular subtype, hormone receptor status, and ROR-PT (a score related to ROR-P that includes tumor size) from breast cancer tissue microarrays (TMAs)23. We extend this work by using resections rather than TMAs, evaluating a diverse set of pathology-specific foundation models rather than a single ImageNet-based encoder, and introducing perturbation-based interpretation to assess the role of tumor regions in model predictions. More recently, Boehm et al. applied deep learning using the CTransPath pathology encoder to infer OncotypeDX® scores from histology specimens of early hormone receptor-positive breast cancer47. Our work builds on this by focusing specifically on the PAM50-based ROR-P score, benchmarking a broad set of pretrained pathology foundation models (including CTransPath), and incorporating survival analyses to directly assess model performance in predicting recurrence events. In addition, we demonstrate how perturbation-based methods can be used to identify necessary and sufficient regions for high-risk predictions and to uncover candidate histologic biomarkers of recurrence. Finally, El Nahhas et al. showed the value of regression-based deep learning for predicting continuous molecular biomarkers, such as homologous recombination deficiency43. We build on this by showing that regression-based ABMIL models can effectively predict continuous ROR-P scores and that perturbation experiments can be applied in this setting to quantify how specific tissue regions influence predicted continuous ROR-P scores. However, we also found that continuous predicted scores, when thresholded using the same thresholds as the transcriptomic ROR-P score, resulted in lower concordance indexes than predictions from models trained to directly predict risk categories. This suggests that the thresholds would have to be optimized for this image-based approach. Together, our study advances prior work by providing a standardized benchmarking framework, evaluating both classification and regression models, and introducing a rigorous, model-driven approach for biomarker discovery and interpretation.
This study has several notable strengths. We benchmarked 14 models using a large and diverse training cohort (CBCS) with long-term clinical follow-up and validated them on an independent external dataset (TCGA BRCA). By comparing pathology foundation models to general-purpose vision encoders, we highlight the added value of domain-specific pretraining for histopathology. We also demonstrate how ABMIL can be paired with perturbation-based interpretability to identify sufficient tissue regions for high-risk predictions, providing interpretable evidence of biologically grounded model behavior. Among the evaluated models, CONCH emerged as a consistent top performer, achieving the highest classification performance across both datasets and ranking among the best models in regression tasks, suggesting that it may offer a particularly robust feature representation for recurrence risk modeling.
The primary limitation of the present study is that models were trained and evaluated on a cohort in which participants received heterogeneous treatments. As a result, we cannot determine whether the observed stratification of recurrence risk reflects prognostic information alone or is partially confounded by treatment effects. Unlike randomized clinical trials or studies with uniform treatment protocols, CBCS does not allow us to disentangle these factors. Therefore, while the predictions of our models align well with transcriptomic ROR-P (as measured by ROC AUC, Pearson r, and C-index), we cannot conclude that they would match the assay’s performance in guiding treatment decisions or predicting outcomes in homogeneously treated populations. This limitation highlights the importance of evaluating histology-based models in randomized clinical trials. Additionally, while external validation on TCGA BRCA supports generalizability, performance was variable, potentially due to differences in cohort characteristics and slide preparation. An additional limitation is that transcriptomic ROR-P itself is subject to uncertainty due to tumor heterogeneity, technical variation, and the derivation of categorical cutoffs. As our models were trained to predict ROR-P, their performance is inherently bound by the reproducibility of the assay. Finally, although HIPPO enables localized perturbation to test model reasoning, it reflects causality as inferred by the model rather than true biological mechanisms.
Our work opens several avenues for future research. Prospective studies in clinical settings are essential to assess real-world performance and determine the clinical utility of histology-based recurrence prediction. Integrating these models with additional clinical, genomic, or spatial transcriptomic data may further improve accuracy and interpretability. The perturbation-based approach presented here could also be extended to other prediction tasks or used to guide targeted biomarker discovery. As foundation models continue to evolve, benchmarking efforts like MAKO will be critical for identifying performant, generalizable models and for ensuring that model outputs are interpretable, biologically meaningful, and clinically actionable.
In the present study, we demonstrate that computational pathology can accurately infer recurrence risk in breast cancer and systematically benchmark 12 pathology foundation models and two non-pathology baseline models using a unified framework, MAKO. Our results show that several foundation models, particularly CONCH, achieve robust performance in predicting both categorical and continuous PAM50-based ROR-P scores from H&E-stained WSIs. When evaluated for stratifying patients by recurrence events, these histology-based models performed comparably to transcriptomic assays. Beyond prediction, we show that pathology foundation models can be interrogated using perturbation-based methods to identify minimal tissue regions sufficient to drive high-risk predictions, suggesting their utility not only for histology-based risk stratification but also for biomarker discovery. Together, these findings highlight the promise of pathology foundation models as scalable, interpretable tools for the prediction of recurrence risk in breast cancer.
Our study builds upon a growing body of work demonstrating that deep learning models can infer molecular phenotypes and recurrence risk from histology. Couture et al., for example, showed that an ImageNet-pretrained convolutional neural network could be used to predict PAM50 molecular subtype, hormone receptor status, and ROR-PT (a score related to ROR-P that includes tumor size) from breast cancer tissue microarrays (TMAs)23. We extend this work by using resections rather than TMAs, evaluating a diverse set of pathology-specific foundation models rather than a single ImageNet-based encoder, and introducing perturbation-based interpretation to assess the role of tumor regions in model predictions. More recently, Boehm et al. applied deep learning using the CTransPath pathology encoder to infer OncotypeDX® scores from histology specimens of early hormone receptor-positive breast cancer47. Our work builds on this by focusing specifically on the PAM50-based ROR-P score, benchmarking a broad set of pretrained pathology foundation models (including CTransPath), and incorporating survival analyses to directly assess model performance in predicting recurrence events. In addition, we demonstrate how perturbation-based methods can be used to identify necessary and sufficient regions for high-risk predictions and to uncover candidate histologic biomarkers of recurrence. Finally, El Nahhas et al. showed the value of regression-based deep learning for predicting continuous molecular biomarkers, such as homologous recombination deficiency43. We build on this by showing that regression-based ABMIL models can effectively predict continuous ROR-P scores and that perturbation experiments can be applied in this setting to quantify how specific tissue regions influence predicted continuous ROR-P scores. However, we also found that continuous predicted scores, when thresholded using the same thresholds as the transcriptomic ROR-P score, resulted in lower concordance indexes than predictions from models trained to directly predict risk categories. This suggests that the thresholds would have to be optimized for this image-based approach. Together, our study advances prior work by providing a standardized benchmarking framework, evaluating both classification and regression models, and introducing a rigorous, model-driven approach for biomarker discovery and interpretation.
This study has several notable strengths. We benchmarked 14 models using a large and diverse training cohort (CBCS) with long-term clinical follow-up and validated them on an independent external dataset (TCGA BRCA). By comparing pathology foundation models to general-purpose vision encoders, we highlight the added value of domain-specific pretraining for histopathology. We also demonstrate how ABMIL can be paired with perturbation-based interpretability to identify sufficient tissue regions for high-risk predictions, providing interpretable evidence of biologically grounded model behavior. Among the evaluated models, CONCH emerged as a consistent top performer, achieving the highest classification performance across both datasets and ranking among the best models in regression tasks, suggesting that it may offer a particularly robust feature representation for recurrence risk modeling.
The primary limitation of the present study is that models were trained and evaluated on a cohort in which participants received heterogeneous treatments. As a result, we cannot determine whether the observed stratification of recurrence risk reflects prognostic information alone or is partially confounded by treatment effects. Unlike randomized clinical trials or studies with uniform treatment protocols, CBCS does not allow us to disentangle these factors. Therefore, while the predictions of our models align well with transcriptomic ROR-P (as measured by ROC AUC, Pearson r, and C-index), we cannot conclude that they would match the assay’s performance in guiding treatment decisions or predicting outcomes in homogeneously treated populations. This limitation highlights the importance of evaluating histology-based models in randomized clinical trials. Additionally, while external validation on TCGA BRCA supports generalizability, performance was variable, potentially due to differences in cohort characteristics and slide preparation. An additional limitation is that transcriptomic ROR-P itself is subject to uncertainty due to tumor heterogeneity, technical variation, and the derivation of categorical cutoffs. As our models were trained to predict ROR-P, their performance is inherently bound by the reproducibility of the assay. Finally, although HIPPO enables localized perturbation to test model reasoning, it reflects causality as inferred by the model rather than true biological mechanisms.
Our work opens several avenues for future research. Prospective studies in clinical settings are essential to assess real-world performance and determine the clinical utility of histology-based recurrence prediction. Integrating these models with additional clinical, genomic, or spatial transcriptomic data may further improve accuracy and interpretability. The perturbation-based approach presented here could also be extended to other prediction tasks or used to guide targeted biomarker discovery. As foundation models continue to evolve, benchmarking efforts like MAKO will be critical for identifying performant, generalizable models and for ensuring that model outputs are interpretable, biologically meaningful, and clinically actionable.
Methods
Methods
Study population
This study involved data from The Carolina Breast Cancer Study (CBCS), which has been described previously48,49. Briefly, CBCS is a multidisciplinary study of invasive breast cancer that enrolled a total of 2998 female participants ages 20–74 years from 44 counties in North Carolina. CBCS oversampled self-identified Black/African American women and younger women (age <50 years). Cases were identified by rapid case ascertainment via the UNC Rapid Case Ascertainment Core in conjunction with the North Carolina Central Cancer Registry (diagnosis years 2008–2013).
For model development, we used all CBCS participants with WSIs, matched ROR-P scores, and recurrence information (n = 1339 participants; 1384 WSIs). The samples with WSIs had similar distributions of age, race, stage, size, and node status as the overall CBCS3 study population. Details regarding formalin-fixed paraffin-embedded (FFPE) and immunohistochemistry (IHC) preparation have been described previously50. Models were trained on this full set to maximize sample size and morphological diversity. However, all reported performance metrics and interpretability analyses were restricted to the clinically relevant subset of ER-positive, HER2-negative, stage I–III tumors, the target population for clinical ROR-P testing. In CBCS, this subset comprised 847 participants (883 WSIs) (Table 1).
For external validation, we used the invasive breast cancer cohort of The Cancer Genome Atlas (TCGA BRCA)42. Of 1050 participants with available WSIs and gene expression data, we restricted evaluation to those with ER-positive, HER2-negative, stage I–III tumors (n = 613 participants, 613 WSIs).
Molecular scoring
The methods for computing PAM50 centroid correlation coefficients, intrinsic subtypes, and the risk of recurrence (ROR-P) score have been described previously48. Briefly, tissue cores in diameter were sampled from tumor-rich regions (these cores contain significantly less tissue than the full whole slide images used for training neural networks in this study). Bulk RNA counting via NanoString nCounter was performed to derive PAM50 centroid correlation coefficients, which quantify the similarity of a tumor’s gene expression profile to each PAM50 subtype51. The assigned PAM50 subtype corresponds to the subtype with the highest correlation coefficient. The ROR-P score was computed as a weighted sum of the PAM50 correlation coefficients and a proliferation-related component per the PAM50 algorithm in Parker et al. 12.
Continuous ROR-P scores were stratified into three categories: low (<11.76471), intermediate (≥11.76471–<52.94118), and high (≥52.94118). The distribution of ROR-P scores in CBCS and TCGA BRCA was slightly different, with TCGA BRCA showing lower ROR-P scores on average (Supplementary Fig. 1).
Whole slide image processing
Formalin-fixed paraffin-embedded (FFPE) tissue specimens in CBCS were scanned using an Aperio scanner (Leica Biosystems, Nussloch, Germany) at magnification (approximately , MPP). WSIs from TCGA BRCA were downloaded from the Genomic Data Commons Data Portal. Seven slides from TCGA BRCA were excluded because the metadata specifying the physical size of each pixel (MPP) was missing. Patch coordinates were calculated using the CLAM toolkit39, which was modified to create patches of a constant physical size. Tissue image patches of size were extracted, and the same patch coordinates were used for all models trained in the present report.
Each patch was embedded using pre-trained feature extraction models. We evaluated 12 foundation models trained on pathology images: CONCH32, CTransPath52, DiRLV253, Hibou-L34, H-optimus-033, Kaiko-L/1454, Phikon55, Phikon-v256, Prov-GigaPath35, UNI31, Virchow36, and Virchow257. As a comparison, we also embedded patches using ResNet5058, which was trained on ImageNet, and ViT-DINOv259, which was trained on over 142 million natural images. For each WSI, all patches were embedded using one model at a time, with each patch transformed into a corresponding feature vector. The resulting vectors from all patches within a WSI were concatenated to form a WSI-level matrix corresponding to that feature extraction model. This process was repeated separately for each of the 14 feature extraction models. The resulting matrices were then used as inputs to the WSI-level models.
WSI-level neural network modeling
Attention-based multiple instance learning (ABMIL) was used to learn WSI-level labels from patch embeddings38. The ABMIL models first encoded patch-level feature vectors using a fully connected layer (), rectified linear unit activation, and dropout of for regularization. A gated attention mechanism then computed attention scores by applying parallel tanh-activated and sigmoid-activated branches (dimensionality ), followed by element-wise multiplication and a linear projection. The resulting attention scores were normalized via softmax, and the attention-weighted sum of patch embeddings formed a WSI-level feature vector, which was then processed by a final linear layer for classification or regression. The models output two logits for risk classification and one for ROR-P regression.
We employed a 10-fold cross-validation procedure to assess the generalization performance of our models. Specifically, our cohort of 1339 CBCS participants was divided into 10 equally sized subsets (folds). The subsets were stratified such that the distribution of ROR-P groups was similar across folds. Iteratively, each fold served exactly once as the test set, while the remaining nine folds formed a combined dataset that was subsequently partitioned into separate training and validation subsets. For each iteration, a model was trained using only the training subset, optimized using predictions evaluated exclusively on the validation subset, and finally assessed using predictions generated on the independent test fold. After completing all 10 iterations, we obtained a complete set of out-of-sample predictions for the entire dataset. Critically, each participant appeared in only one partition per iteration, ensuring that the model was never trained or optimized on data from participants included in the corresponding test set. This procedure preserved the integrity of the training-validation-testing separation and prevented data leakage, thus providing an unbiased evaluation of model performance.
Model evaluation on TCGA BRCA
Models trained on CBCS were applied to TCGA BRCA slides using the same pipeline. Patch embeddings from TCGA BRCA served as inputs to ABMIL models trained on CBCS. Because CBCS models were developed using 10-fold cross-validation, inference was performed with all 10 models. The final prediction for each TCGA BRCA specimen was obtained by averaging the softmax-transformed logits across the 10 models.
Statistical analyses
For classification of ROR-P risk categories, we evaluated model performance using the area under the receiver operating characteristic curve (ROC AUC), and DeLong’s test was used to assess statistical significance of differences in model performance60. For continuous ROR-P prediction tasks, model performance was evaluated using Pearson correlation, and the statistical significance of differences in Pearson correlation coefficients was evaluated using Meng’s z-test for comparing dependent correlations61. Pairwise comparisons were performed between each foundation model and the ResNet50 baseline. To correct for multiple hypothesis testing, p-values were adjusted using the Benjamini–Hochberg procedure (FDR)62. All statistical tests were two-sided. Analyses were conducted separately for the CBCS and TCGA BRCA cohorts.
We assessed model performance for recurrence-free survival using time-to-event data. Survival time was defined as the number of years from diagnosis to the first recurrence or censoring at 10 years. The concordance index (C-index) was used to quantify the discriminative ability of each model. C-indices were computed using the concordance() function from the “survival” R package. C-index values range from 0.5 (no discrimination) to 1.0 (perfect discrimination). For comparison between model predictions and the reference transcriptomic score, we used the “compare” R package, which implements a statistical test for comparing correlated C-indices44. P values were adjusted for multiple comparisons using the FDR correction62. To assess whether model predictions were significantly associated with recurrence-free survival, we fit univariable Cox proportional hazards models using the coxph() function from the “survival” R package. The continuous model prediction score was used as the sole predictor. We also applied the log-rank test using the survdiff() function implemented in the “survminer” R package.
For categorical prediction models, model logits were transformed with softmax and then converted to binary risk classifications using optimized thresholds. Rather than applying a fixed threshold (e.g., 0.5), we determined an optimal threshold for each encoder by maximizing Youden’s J ()63. Specifically, we performed 10-fold cross-validation, and for each model, we concatenated the validation set predictions for ER-positive, HER2-negative specimens across all folds. Youden’s J was then computed on this pooled validation set to identify the optimal threshold, which was subsequently applied to the model’s test set predictions. This procedure was performed independently for each model. This approach ensured that threshold selection reflected optimal risk discrimination rather than arbitrary decision boundaries, thus providing a more meaningful evaluation of clinical utility.
Interpretation of ABMIL models
Attention maps from ABMIL models were used as a preliminary tool for qualitative interpretation. These maps were generated in Python, converted to GeoJSON format (with each patch polygon annotated with its ranked attention score), and subsequently visualized in QuPath64 for assessment.
HIPPO experiments
To further interpret model predictions, we applied HIPPO, a quantitative, occlusion-based explainability technique40. As the ROR-P was developed using tumor-intrinsic genes, we hypothesized that tumor regions are necessary and sufficient for model predictions of ROR-P. We focused on ROR-P classification models. Tumor masks were generated using PenAnnotationExtractor65, which extracted tissue regions based on pen annotations on the glass slide. These masks defined the tumor-containing areas within each WSI. To evaluate the effect of tumor removal on high ROR-P predictions, we excluded all patch embeddings that intersected with the tumor mask and compared the model’s softmax probabilities between the original and tumor-removed WSIs. This analysis was restricted to slides where the model originally predicted high ROR-P with a softmax probability greater than 0.5.
To assess the sufficiency of tumor regions for ROR-P predictions, we removed all patches that did not intersect with the tumor mask, retaining only tumor-containing regions. As in the necessity analysis, we focused on WSIs where the models originally predicted high ROR-P with a softmax probability >0.5. We then compared the model’s predictions between the original WSIs and those containing only tumor regions.
Beyond the hypothesis-driven tests described above, we conducted data-driven experiments to identify regions sufficient to flip a low ROR-P specimen to high ROR-P. This approach aimed to uncover robust tissue biomarkers that the models associated with recurrence risk. We selected five high ROR-P slides and five low ROR-P slides and applied a greedy search strategy using the ROR-P regression model trained with the h-Optimus-0 features. In each iteration, a single patch from a high ROR-P slide was inserted into a low ROR-P slide, and the model’s prediction was re-evaluated. The patch was then replaced with the next patch from the high ROR-P slide, and this process was repeated for all patches. The patch that produced the highest increase in ROR-P when inserted into the low ROR-P slide was retained, and the search continued with the remaining patches. This iterative process was performed until either a quarter of the patches in the high ROR-P specimen were used or the manipulated specimen’s predicted ROR-P reached the high-risk threshold.
To assess the robustness and generalizability of the identified high ROR-P patches, we inserted them into additional low- and medium-ROR-P specimens and measured the resulting changes in predicted ROR-P scores. This analysis was conducted across all foundation models to evaluate the robustness of learned risk-associated features.
Study population
This study involved data from The Carolina Breast Cancer Study (CBCS), which has been described previously48,49. Briefly, CBCS is a multidisciplinary study of invasive breast cancer that enrolled a total of 2998 female participants ages 20–74 years from 44 counties in North Carolina. CBCS oversampled self-identified Black/African American women and younger women (age <50 years). Cases were identified by rapid case ascertainment via the UNC Rapid Case Ascertainment Core in conjunction with the North Carolina Central Cancer Registry (diagnosis years 2008–2013).
For model development, we used all CBCS participants with WSIs, matched ROR-P scores, and recurrence information (n = 1339 participants; 1384 WSIs). The samples with WSIs had similar distributions of age, race, stage, size, and node status as the overall CBCS3 study population. Details regarding formalin-fixed paraffin-embedded (FFPE) and immunohistochemistry (IHC) preparation have been described previously50. Models were trained on this full set to maximize sample size and morphological diversity. However, all reported performance metrics and interpretability analyses were restricted to the clinically relevant subset of ER-positive, HER2-negative, stage I–III tumors, the target population for clinical ROR-P testing. In CBCS, this subset comprised 847 participants (883 WSIs) (Table 1).
For external validation, we used the invasive breast cancer cohort of The Cancer Genome Atlas (TCGA BRCA)42. Of 1050 participants with available WSIs and gene expression data, we restricted evaluation to those with ER-positive, HER2-negative, stage I–III tumors (n = 613 participants, 613 WSIs).
Molecular scoring
The methods for computing PAM50 centroid correlation coefficients, intrinsic subtypes, and the risk of recurrence (ROR-P) score have been described previously48. Briefly, tissue cores in diameter were sampled from tumor-rich regions (these cores contain significantly less tissue than the full whole slide images used for training neural networks in this study). Bulk RNA counting via NanoString nCounter was performed to derive PAM50 centroid correlation coefficients, which quantify the similarity of a tumor’s gene expression profile to each PAM50 subtype51. The assigned PAM50 subtype corresponds to the subtype with the highest correlation coefficient. The ROR-P score was computed as a weighted sum of the PAM50 correlation coefficients and a proliferation-related component per the PAM50 algorithm in Parker et al. 12.
Continuous ROR-P scores were stratified into three categories: low (<11.76471), intermediate (≥11.76471–<52.94118), and high (≥52.94118). The distribution of ROR-P scores in CBCS and TCGA BRCA was slightly different, with TCGA BRCA showing lower ROR-P scores on average (Supplementary Fig. 1).
Whole slide image processing
Formalin-fixed paraffin-embedded (FFPE) tissue specimens in CBCS were scanned using an Aperio scanner (Leica Biosystems, Nussloch, Germany) at magnification (approximately , MPP). WSIs from TCGA BRCA were downloaded from the Genomic Data Commons Data Portal. Seven slides from TCGA BRCA were excluded because the metadata specifying the physical size of each pixel (MPP) was missing. Patch coordinates were calculated using the CLAM toolkit39, which was modified to create patches of a constant physical size. Tissue image patches of size were extracted, and the same patch coordinates were used for all models trained in the present report.
Each patch was embedded using pre-trained feature extraction models. We evaluated 12 foundation models trained on pathology images: CONCH32, CTransPath52, DiRLV253, Hibou-L34, H-optimus-033, Kaiko-L/1454, Phikon55, Phikon-v256, Prov-GigaPath35, UNI31, Virchow36, and Virchow257. As a comparison, we also embedded patches using ResNet5058, which was trained on ImageNet, and ViT-DINOv259, which was trained on over 142 million natural images. For each WSI, all patches were embedded using one model at a time, with each patch transformed into a corresponding feature vector. The resulting vectors from all patches within a WSI were concatenated to form a WSI-level matrix corresponding to that feature extraction model. This process was repeated separately for each of the 14 feature extraction models. The resulting matrices were then used as inputs to the WSI-level models.
WSI-level neural network modeling
Attention-based multiple instance learning (ABMIL) was used to learn WSI-level labels from patch embeddings38. The ABMIL models first encoded patch-level feature vectors using a fully connected layer (), rectified linear unit activation, and dropout of for regularization. A gated attention mechanism then computed attention scores by applying parallel tanh-activated and sigmoid-activated branches (dimensionality ), followed by element-wise multiplication and a linear projection. The resulting attention scores were normalized via softmax, and the attention-weighted sum of patch embeddings formed a WSI-level feature vector, which was then processed by a final linear layer for classification or regression. The models output two logits for risk classification and one for ROR-P regression.
We employed a 10-fold cross-validation procedure to assess the generalization performance of our models. Specifically, our cohort of 1339 CBCS participants was divided into 10 equally sized subsets (folds). The subsets were stratified such that the distribution of ROR-P groups was similar across folds. Iteratively, each fold served exactly once as the test set, while the remaining nine folds formed a combined dataset that was subsequently partitioned into separate training and validation subsets. For each iteration, a model was trained using only the training subset, optimized using predictions evaluated exclusively on the validation subset, and finally assessed using predictions generated on the independent test fold. After completing all 10 iterations, we obtained a complete set of out-of-sample predictions for the entire dataset. Critically, each participant appeared in only one partition per iteration, ensuring that the model was never trained or optimized on data from participants included in the corresponding test set. This procedure preserved the integrity of the training-validation-testing separation and prevented data leakage, thus providing an unbiased evaluation of model performance.
Model evaluation on TCGA BRCA
Models trained on CBCS were applied to TCGA BRCA slides using the same pipeline. Patch embeddings from TCGA BRCA served as inputs to ABMIL models trained on CBCS. Because CBCS models were developed using 10-fold cross-validation, inference was performed with all 10 models. The final prediction for each TCGA BRCA specimen was obtained by averaging the softmax-transformed logits across the 10 models.
Statistical analyses
For classification of ROR-P risk categories, we evaluated model performance using the area under the receiver operating characteristic curve (ROC AUC), and DeLong’s test was used to assess statistical significance of differences in model performance60. For continuous ROR-P prediction tasks, model performance was evaluated using Pearson correlation, and the statistical significance of differences in Pearson correlation coefficients was evaluated using Meng’s z-test for comparing dependent correlations61. Pairwise comparisons were performed between each foundation model and the ResNet50 baseline. To correct for multiple hypothesis testing, p-values were adjusted using the Benjamini–Hochberg procedure (FDR)62. All statistical tests were two-sided. Analyses were conducted separately for the CBCS and TCGA BRCA cohorts.
We assessed model performance for recurrence-free survival using time-to-event data. Survival time was defined as the number of years from diagnosis to the first recurrence or censoring at 10 years. The concordance index (C-index) was used to quantify the discriminative ability of each model. C-indices were computed using the concordance() function from the “survival” R package. C-index values range from 0.5 (no discrimination) to 1.0 (perfect discrimination). For comparison between model predictions and the reference transcriptomic score, we used the “compare” R package, which implements a statistical test for comparing correlated C-indices44. P values were adjusted for multiple comparisons using the FDR correction62. To assess whether model predictions were significantly associated with recurrence-free survival, we fit univariable Cox proportional hazards models using the coxph() function from the “survival” R package. The continuous model prediction score was used as the sole predictor. We also applied the log-rank test using the survdiff() function implemented in the “survminer” R package.
For categorical prediction models, model logits were transformed with softmax and then converted to binary risk classifications using optimized thresholds. Rather than applying a fixed threshold (e.g., 0.5), we determined an optimal threshold for each encoder by maximizing Youden’s J ()63. Specifically, we performed 10-fold cross-validation, and for each model, we concatenated the validation set predictions for ER-positive, HER2-negative specimens across all folds. Youden’s J was then computed on this pooled validation set to identify the optimal threshold, which was subsequently applied to the model’s test set predictions. This procedure was performed independently for each model. This approach ensured that threshold selection reflected optimal risk discrimination rather than arbitrary decision boundaries, thus providing a more meaningful evaluation of clinical utility.
Interpretation of ABMIL models
Attention maps from ABMIL models were used as a preliminary tool for qualitative interpretation. These maps were generated in Python, converted to GeoJSON format (with each patch polygon annotated with its ranked attention score), and subsequently visualized in QuPath64 for assessment.
HIPPO experiments
To further interpret model predictions, we applied HIPPO, a quantitative, occlusion-based explainability technique40. As the ROR-P was developed using tumor-intrinsic genes, we hypothesized that tumor regions are necessary and sufficient for model predictions of ROR-P. We focused on ROR-P classification models. Tumor masks were generated using PenAnnotationExtractor65, which extracted tissue regions based on pen annotations on the glass slide. These masks defined the tumor-containing areas within each WSI. To evaluate the effect of tumor removal on high ROR-P predictions, we excluded all patch embeddings that intersected with the tumor mask and compared the model’s softmax probabilities between the original and tumor-removed WSIs. This analysis was restricted to slides where the model originally predicted high ROR-P with a softmax probability greater than 0.5.
To assess the sufficiency of tumor regions for ROR-P predictions, we removed all patches that did not intersect with the tumor mask, retaining only tumor-containing regions. As in the necessity analysis, we focused on WSIs where the models originally predicted high ROR-P with a softmax probability >0.5. We then compared the model’s predictions between the original WSIs and those containing only tumor regions.
Beyond the hypothesis-driven tests described above, we conducted data-driven experiments to identify regions sufficient to flip a low ROR-P specimen to high ROR-P. This approach aimed to uncover robust tissue biomarkers that the models associated with recurrence risk. We selected five high ROR-P slides and five low ROR-P slides and applied a greedy search strategy using the ROR-P regression model trained with the h-Optimus-0 features. In each iteration, a single patch from a high ROR-P slide was inserted into a low ROR-P slide, and the model’s prediction was re-evaluated. The patch was then replaced with the next patch from the high ROR-P slide, and this process was repeated for all patches. The patch that produced the highest increase in ROR-P when inserted into the low ROR-P slide was retained, and the search continued with the remaining patches. This iterative process was performed until either a quarter of the patches in the high ROR-P specimen were used or the manipulated specimen’s predicted ROR-P reached the high-risk threshold.
To assess the robustness and generalizability of the identified high ROR-P patches, we inserted them into additional low- and medium-ROR-P specimens and measured the resulting changes in predicted ROR-P scores. This analysis was conducted across all foundation models to evaluate the robustness of learned risk-associated features.
Supplementary information
Supplementary information
출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.