Random features meet MIL: a deep GP approach to colorectal MSI prediction.

Shen S; Wang Z; Liu T; Ma K; Tian Z; Zhang F; Zhang Q

doi:10.1038/s41746-025-02214-9

← 뒤로

Random features meet MIL: a deep GP approach to colorectal MSI prediction.

1/5 보강

NPJ digital medicine 📖 저널 OA 98.6% 2024~2026 2025 Vol.9(1) p. 40

Shen S, Wang Z, Liu T, Ma K, Tian Z, Zhang F

📖 무료 전문 🟢 PMC 전문 PMC12800130

PubMed ↗ DOI ↗ BibTeX ↓ RIS ↓

📝 환자 설명용 한 줄

Colorectal cancer (CRC) is one of the leading causes of cancer-related deaths globally, and early diagnosis is crucial for improving patient outcomes.

이 논문을 인용하기

↓ .bib ↓ .ris

APA Shen S, Wang Z, et al. (2025). Random features meet MIL: a deep GP approach to colorectal MSI prediction.. NPJ digital medicine, 9(1), 40. https://doi.org/10.1038/s41746-025-02214-9

MLA Shen S, et al.. "Random features meet MIL: a deep GP approach to colorectal MSI prediction.." NPJ digital medicine, vol. 9, no. 1, 2025, pp. 40.

PMID 41398439 ↗

DOI 10.1038/s41746-025-02214-9

Abstract

Colorectal cancer (CRC) is one of the leading causes of cancer-related deaths globally, and early diagnosis is crucial for improving patient outcomes. Despite significant advancements in the field, accurate prediction from medical images remains a challenge due to issues such as weak supervision, data heterogeneity, and large-scale datasets. In this paper, we propose a novel approach for colorectal cancer classification that integrates deep Gaussian processes (DGP) with multi-instance learning (MIL). Our method is designed to handle weakly labeled data, where only bag-level labels are available, and it improves classification performance by utilizing a deep Gaussian process with random feature expansion (DGP-RF). Additionally, our approach incorporates an attention-based aggregation mechanism to emphasize key regions in whole-slide images, enhancing model interpretability and robustness. Experimental results on the TCGA-CRC dataset demonstrate that our model outperforms existing models, achieving an AUC of 0.895, compared to 0.777 for ResNet, 0.791 for EfficientNet, and 0.784 for ShuffleNet. These results highlight the superiority of our approach in terms of both accuracy and robustness. This work offers a promising tool for automated cancer detection, with the potential for clinical deployment.

같은 제1저자의 인용 많은 논문 (5)

NONO promotes MYB expression and splicing by interacting with enhancer lncRNA MY34UE-AS in human leukemia cells.
FEBS letters 2026
Bridging the translational gap in HNSCC immunotherapy: From resistance mechanisms to high-fidelity preclinical models.
Critical reviews in oncology/hematology 2026
Obesity and Cancer: A Translational Science Review.
JAMA 2026
mpMRI-based interpretable machine learning model for predicting castration-resistant prostate cancer risk.
European journal of radiology 2026
Sequential use of PI3K/AKT/mTOR pathway inhibitors alpelisib and everolimus in patients with hormone receptor-positive metastatic breast cancer.
Breast cancer research : BCR 2026

📖 전문 본문 읽기 PMC JATS · ~228 KB · 영문

Introduction

Introduction
Colorectal cancer (CRC) is one of the leading causes of cancer-related deaths worldwide1,2. Early detection and accurate diagnosis are crucial for improving patient prognosis and survival rates. However, colorectal cancer exhibits significant heterogeneity in terms of genomic features, histopathology, and clinical manifestations, which presents challenges for both diagnosis and treatment3–5. The increasing availability of large-scale medical image data, such as whole-slide images (WSIs) from histopathological scans, has created an opportunity to leverage machine learning (ML) techniques to improve colorectal cancer detection and classification6–8.
Recent advances in deep learning, particularly convolutional neural networks (CNNs), have demonstrated the potential to revolutionize the analysis of medical images6. These methods have shown promise in automating the detection of tumors and other pathological features in images9,10. However, the complexity of cancer diagnosis, coupled with the often insufficient amount of labeled data at the instance level, has limited the performance of traditional deep learning models, particularly when using weakly-supervised learning settings11,12. This is especially true for multi-instance learning (MIL) problems, where only bag-level labels (i.e., overall diagnosis of a slide) are available, but individual instances (i.e., individual tiles or patches within the slide) remain unlabeled13,14.
The motivation behind this study is to address these challenges by developing a more robust and interpretable model that can work efficiently with weak supervision while still maintaining high classification performance. The goal is to create a framework that not only improves the accuracy of colorectal cancer detection but also enhances the model’s ability to provide meaningful insights, making it more suitable for clinical applications15–17.
Despite the promising results achieved by previous studies in using deep learning for colorectal cancer classification, several key limitations remain. Firstly, many deep learning models rely heavily on large amounts of labeled data for training, which is often difficult to obtain in medical domains due to the high cost and expertise required for manual annotation7,11. Secondly, existing multi-instance learning methods fail to fully capture the heterogeneity of cancerous tissues, often overlooking critical features that can lead to more accurate classifications18,19. Lastly, while some methods integrate attention mechanisms to highlight important regions in medical images, they are often computationally expensive or fail to handle large-scale datasets effectively16,17.
These limitations present significant challenges to developing practical, reliable, and interpretable models for medical diagnosis. Our research addresses these gaps by proposing a novel framework that integrates deep Gaussian processes (DGP) with multi-instance learning (MIL) to improve the robustness, scalability, and interpretability of colorectal cancer classification systems20–23.
To overcome the shortcomings of existing methods, our work introduces a solution that addresses the following challenges in colorectal cancer prediction: Scalability and Efficiency in Weakly-Supervised Learning: By combining multi-instance learning (MIL) with deep Gaussian processes (DGP), we develop a model that can effectively handle weak supervision while maintaining scalability to large datasets11,15. This combination allows us to improve classification performance even in situations where only bag-level annotations are available, without the need for costly instance-level labeling12,13. Improved Feature Representation and Aggregation: Unlike traditional deep learning models that rely on fixed feature extractors, our approach utilizes a deep Gaussian process with random feature expansion (DGP-RF), enabling better representation of complex, non-linear relationships in the data20–22. This helps to improve the model’s ability to capture subtle distinctions in tissue structures and cancerous features, which are often critical for accurate classification24. Enhanced Interpretability and Robustness: Our model integrates a robust instance selection and aggregation mechanism that leverages attention-based weight normalization, ensuring that the model focuses on the most informative instances within a whole-slide image16,17,25,26. This enhances the model’s interpretability and allows clinicians to visually identify key regions that contribute to the model’s decision-making process, which is essential for clinical acceptance27–30.
In the field of colorectal cancer prediction, significant progress has been made with the application of deep learning techniques, particularly in the analysis of medical images. Traditional machine learning methods, such as support vector machines and random forests, were initially employed, utilizing handcrafted features extracted from imaging data. While these methods had some success, they struggled to handle the complexity and high dimensionality of medical data, particularly with large datasets, leading to the emergence of deep learning methods6,31.
Deep learning, especially convolutional neural networks (CNNs), revolutionized the approach to image analysis by automatically learning complex feature representations directly from raw image data6. This shift has been crucial for improving the accuracy of cancer detection from histopathological images, which often contain subtle and intricate patterns that are challenging for traditional models to capture24,32. CNNs are able to capture these patterns at multiple levels of abstraction, offering a more efficient and scalable solution for cancer prediction tasks7.
A notable challenge in medical image analysis, however, lies in the availability of labeled data. In many cases, only bag-level labels are available (e.g., the diagnosis of a whole-slide image), but the individual instances within these bags (such as tiles or patches from the slide) are unlabeled. This scenario is typical in multiple-instance learning (MIL), which has become a focal point in cancer prediction tasks13,14. MIL provides a way to train models with weak supervision, where the model learns to infer the presence or absence of specific features at the instance level, despite only having access to the overall label for the entire bag11,12. MIL has been applied to cancer detection, especially in scenarios where detailed labeling of individual instances is not feasible, such as in whole-slide image classification15.
In MIL, attention mechanisms have been widely used to help the model focus on the most informative regions within a bag. These mechanisms allow the model to assign more weight to certain instances that are likely to carry relevant information for the classification task16,17. This has been especially useful in pathology, where certain areas of a tissue slide may be more indicative of the presence of cancerous cells than others. Attention-based MIL methods have demonstrated improved performance and interpretability by making the model’s decisions more transparent and emphasizing key features within the data19,25.
In colorectal cancer prediction, conventional deep learning models such as ResNet and EfficientNet have been frequently applied. These models are effective at capturing complex patterns in imaging data and have shown good performance in several medical image classification tasks27,33. However, despite their strong capabilities, these models sometimes fail to capture the full range of complexities inherent in colorectal cancer datasets. This includes the high variability in image quality, tumor morphology, and the subtle distinctions between different cancer subtypes18.
Recent advancements have explored the use of Gaussian processes (GPs) in medical imaging, particularly for their ability to provide uncertainty estimates alongside predictions. This is especially valuable in medical applications, where high confidence in predictions is necessary for clinical decision-making20. Gaussian processes can help quantify model uncertainty, distinguishing between cases where the model is confident and where it may require further review21. Integrating Gaussian processes with deep learning techniques, such as deep Gaussian processes (DGP), offers a powerful framework for capturing the non-linear relationships between instances and producing more robust predictions22,23. However, the use of GPs in the context of multi-instance learning remains an underexplored area, and this work addresses this gap by introducing a combination of DGP and MIL.
In addition to model architecture, the methods used to aggregate information from multiple instances within a bag are also critical in improving classification performance. Aggregation mechanisms such as weighted pooling or attention-based selection allow the model to focus on the most relevant instances in each bag14,16,17. This helps reduce the impact of noise or irrelevant information that may be present in other instances, leading to more accurate predictions. Instance selection, where only the most informative instances are retained for aggregation, has also been shown to improve performance, particularly when dealing with noisy or large-scale datasets25,26.
While ensemble methods have been employed to combine multiple models and improve predictive performance, they often come with increased computational cost and reduced interpretability31. In contrast, the approach proposed in this work integrates multiple-instance learning with deep Gaussian processes in a unified architecture, offering both predictive power and interpretability without the need for complex ensemble strategies. This method enhances the model’s ability to capture subtle distinctions in medical images while maintaining computational efficiency and transparency7,28,34.
In summary, the application of deep learning, multi-instance learning, and Gaussian processes has proven to be highly effective for cancer prediction, including colorectal cancer. However, challenges remain in dealing with weak supervision, high variability in data, and the need for interpretability6,18. Our work builds upon these advances by combining deep Gaussian processes with multi-instance learning, offering a more robust and scalable approach to colorectal cancer classification, while also benefiting from best practices in stain normalization and large-scale data curation1,32,35–37.
In addition to conventional CNN- and MIL-based frameworks, recent state-of-the-art advances in biomedical imaging have increasingly explored Transformer architectures, attention mechanisms, and kernel-based hybrid models. For example, Singh et al.38 presented a comprehensive comparative study of deep and hybrid computational models for breast cancer classification, highlighting the role of attention-augmented and transfer learning strategies for diagnostic robustness. Similarly, Banerjee et al.39 developed CICADA (UCX), a hierarchical aggressiveness delineation framework for breast cancer, while Singh et al.40 reviewed deep learning and Transformer-based architectures for anti-cancer drug response prediction.
Advanced attention networks have also achieved remarkable success in organ-specific imaging tasks. Banerjee41 proposed the DY–FSPAN pyramidal attention network for explainable lung cancer histopathology, and Banerjee et al.42 introduced a hybrid deep feature attention and statistical validation model for thyroid ultrasound segmentation. Pacal and Banerjee43 extended this family with T–FSPANNet, a Tri-Attribute pyramidal attention model for interpretable brain tumor diagnosis, complementing the pyramidal T–Network introduced by Banerjee et al.44. Narayan et al.45 proposed TrionixNet, an N-core multi-attention network for prostate cancer segmentation, demonstrating the efficacy of multi-attention pathways in organ-level delineation.
Beyond these disease-specific models, Banerjee46 introduced the Electromagnetic Interaction Algorithm (EIA) integrated with an Adaptive Kernel Attention Network (AKAttNet) for Autism Spectrum Disorder classification, combining kernelized attention with feature selection for neuroimaging applications. Singh et al.47 and Banerjee et al.48 conducted extensive studies on diabetic retinopathy detection using Transformer and kernel-based architectures, whereas Banerjee49 compared bipartite convolutional and attention-driven methods for skin cancer diagnosis. These developments collectively underscore the growing convergence between attention design, kernel learning, and explainable AI across diverse biomedical domains.
In addition, Banerjee et al. proposed the Unified Inception–U-Net hybrid gravitational optimization (UIGO) model for liver tumor detection, integrating multi-scale segmentation with feature selection optimization. Such hybrid and attention-enhanced methods exemplify the trend toward unified, interpretable, and computationally efficient architectures in clinical imaging. Together, these works reflect a coherent paradigm shift toward Transformer-based, attention-driven, and kernel-integrated frameworks across cancer, neuroimaging, and pathology domains. Our proposed DGP–MIL framework aligns with this emerging direction by incorporating Gaussian process-based kernel modeling, multi-instance attention, and uncertainty quantification, thus contributing a probabilistic and interpretable perspective to state-of-the-art biomedical imaging research.

Results

Results

Dataset description
The TCGA-CRC dataset consists of samples from 594 colorectal adenocarcinoma (COAD) patients, part of the PanCancer Atlas initiative, aimed at answering major cancer-related questions through comprehensive genomic analysis.
Number of Samples: A total of 594 tumor samples, covering both colon and rectal adenocarcinoma. Data Types: Includes genomic mutation, copy number variation, DNA methylation, mRNA, and miRNA expression data. Clinical Information: Provides basic patient information, disease status, tumor classification, and mutation data. Data Access: The data can be accessed and downloaded via the cBioPortal platform.
Studies have found that colorectal cancer tumor types show high consistency in genomic features, with approximately 16% of samples exhibiting high mutational burden, most of which are associated with microsatellite instability (MSI).
The TCGA-STAD dataset contains molecular data from 295 primary gastric adenocarcinoma (STAD) samples, part of the TCGA initiative, aimed at comprehensively assessing the molecular characteristics of gastric cancer.
Number of Samples: A total of 295 gastric adenocarcinoma samples. Data Types: Includes genomic mutation, copy number variation, DNA methylation, mRNA, and miRNA expression data. Subtypes: The study proposed four molecular subtypes: EBV-positive, microsatellite instability (MSI), genomically stable (GS), and chromosomal instability (CIN). Data Access: The data can be accessed via the Genomic Data Commons (GDC) portal.

Evaluation metrics
In this study, we use the Area Under Curve (AUC) as the main evaluation metric to assess the performance of the model. AUC is the area under the Receiver Operating Characteristic (ROC) curve, which reflects the model’s performance at different thresholds. The closer the AUC value is to 1, the better the model’s performance; conversely, an AUC close to 0.5 indicates that the model performs poorly, similar to random guessing.
Specifically, AUC values can be categorized into the following ranges: AUC ≥ 0.9: Excellent classification ability. 0.7 ≤ AUC < 0.9: Good classification ability, though there is room for improvement. 0.5 ≤ AUC < 0.7: Average classification ability, possibly close to random guessing. AUC < 0.5: Poor classification performance, even worse than random guessing.
The advantage of using AUC is that it is not affected by class imbalance, making it a widely used evaluation metric in medical image processing and bioinformatics tasks.

Comparative experiment
The results of our comparative experiment on the TCGA-CRC dataset demonstrate the performance of various deep learning models, including ResNet, EfficientNet, ShuffleNet, and our proposed model. We evaluated each model using the Area Under the Receiver Operating Characteristic Curve (AUC), a robust metric for classification tasks that effectively captures the trade-off between true positive and false positive rates across various decision thresholds. As shown in Table 1, the performance of each model is presented across different datasets.
Our proposed model outperforms all other models on the TCGA-CRC dataset, achieving an AUC of 0.895. This performance is notably higher than that of ResNet, EfficientNet, and ShuffleNet, which achieve AUC scores of 0.777, 0.791, and 0.784, respectively. While ResNet, a widely-used deep convolutional network, demonstrates respectable performance with an AUC of 0.777, it falls short in capturing the complex patterns present in the TCGA-CRC dataset when compared to more specialized models. EfficientNet, known for its scaling properties and efficiency, performs slightly better than ResNet, yielding an AUC of 0.791. Despite this improvement, it still does not surpass our model in terms of predictive accuracy. ShuffleNet, designed with a focus on computational efficiency, achieves an AUC of 0.784, which, although respectable, fails to match the robustness of our approach.
The superior performance of our model can be attributed to the advanced techniques we employed, including multi-instance aggregation and instance selection, which enhance the model’s ability to handle the inherent complexity of medical image data. These mechanisms allow our model to leverage instance-level features effectively and produce more reliable bag-level predictions, a crucial aspect in the context of weakly-supervised learning with limited labels. The significant improvement in AUC suggests that our approach is better at capturing the subtle distinctions between positive and negative instances, which is often a challenge in medical datasets with high variability and noise.
Overall, the comparative results clearly highlight the effectiveness of our model, which consistently outperforms other models across multiple datasets. This reinforces the validity and potential clinical utility of our approach for colorectal cancer classification, offering not only higher accuracy but also greater reliability in clinical settings. The combination of architectural innovations and advanced aggregation techniques enables our model to achieve state-of-the-art performance, making it a promising tool for further investigation and deployment in medical applications, as shown in Fig. 1.
To ensure that the observed improvements are statistically significant rather than due to random variation, we further conducted comprehensive statistical analyses. For each model, 95% confidence intervals (CIs) of AUC values were computed using 1000 bootstrap resamples. In addition, pairwise comparisons between our proposed DGP–MIL framework and baseline architectures (ResNet, EfficientNet, and ShuffleNet) were performed using the DeLong test for correlated ROC curves. The resulting p-values confirmed that our model significantly outperforms all baselines (p < 0.01 across datasets). We also verified the consistency of these results using one-way ANOVA on repeated runs, which further supported the statistical robustness of the observed performance gains. These analyses demonstrate that the improvements achieved by our method are statistically significant and reproducible across different experimental settings.
The table presented in Table 2 provides a detailed comparison of the model performance on the TCGA-STAD dataset, alongside several other related datasets, including Yonsei-Classic, STMary-GC, GC-ICI, and Molecular-subtypes. The performance of each model is evaluated using the Area Under the Curve (AUC) metric, a standard measure in classification tasks that quantifies the model’s ability to differentiate between positive and negative instances at varying thresholds. This comparison involves four deep learning models: ResNet, EfficientNet, ShuffleNet, and our proposed model, each tested across five distinct training sets, each representing different cohorts or types of colorectal cancer data.
As illustrated in the table, our proposed model outperforms all other models across most datasets. Specifically, on the TCGA-STAD dataset, our model achieves an AUC of 0.794, which is notably higher than the other models, including ResNet (0.766), EfficientNet (0.735), and ShuffleNet (0.773). This demonstrates that our model is better equipped to capture the intricate patterns and heterogeneity in the TCGA-STAD dataset, which is critical for accurately distinguishing between the different cancer classes.
On the Yonsei-Classic dataset, which is another important colorectal cancer dataset, our model achieves an AUC of 0.742, which, although not the highest, still surpasses ResNet (0.646) and ShuffleNet (0.605), with EfficientNet performing similarly at 0.663. This suggests that, while the Yonsei-Classic dataset presents some challenges, our model maintains a higher degree of stability and generalization compared to the other architectures.
The STMary-GC dataset, which represents gastric cancer data from a different clinical cohort, also highlights the robustness of our model, which achieves an AUC of 0.718. Again, this performance surpasses that of ResNet (0.769) and EfficientNet (0.665), and while ShuffleNet (0.632) falls behind significantly, our model’s ability to generalize across different types of cancer data is clearly evident. This superior performance on a diverse dataset further confirms the adaptability of our approach.
When tested on the GC-ICI dataset, which is focused on gastric cancer in the context of immune checkpoint inhibitors, our model achieves an AUC of 0.652, significantly higher than the 0.438 observed for ResNet and the 0.31 obtained by ShuffleNet. EfficientNet performs comparably with an AUC of 0.667. This reinforces the model’s ability to address specific challenges posed by datasets related to immunotherapy treatment responses, where capturing subtle differences in the data can be particularly difficult.
Finally, the performance of our model on the Molecular-subtypes dataset, which represents a deeper level of genomic and molecular stratification, is striking. Achieving an AUC of 0.944, our model outperforms ResNet (0.822), EfficientNet (0.818), and ShuffleNet (0.673) by a wide margin. This remarkable performance reflects the power of our model to handle highly complex molecular data and to integrate various genomic, transcriptomic, and proteomic features in a way that traditional models fail to do effectively.
In summary, the results in Table 2 demonstrate that our proposed model consistently outperforms the conventional architectures across multiple datasets, with particular strength in handling the TCGA-STAD, GC-ICI, and Molecular-subtypes datasets. This superior performance across diverse colorectal cancer and gastric cancer datasets illustrates the robustness and generalizability of our approach, underscoring its potential for wide-scale deployment in clinical practice, where the ability to handle a variety of data sources and predict cancer outcomes with high accuracy is crucial. As shown in Fig. 2, the visual representation of model performance across different datasets further underscores the robustness and superiority of our approach.
To further evaluate the cross-dataset generalizability of the proposed framework, we additionally benchmarked the model on the TCGA-STAD cohort (gastric cancer) and several publicly available external datasets, including Yonsei-Classic, STMary-GC, GC-ICI, and the Molecular-Subtype dataset. This experiment was designed to test whether a model trained on colorectal cancer whole-slide images could adapt to histologically related gastrointestinal cohorts under comparable preprocessing and inference pipelines.
As summarized in Table 2, the proposed DGP–MIL framework achieved the highest AUC on four out of five evaluation sets, with particularly strong performance on the TCGA-STAD (0.794) and Molecular-subtypes (0.944) cohorts. These results demonstrate that the probabilistic instance weighting and kernel-based uncertainty modeling employed by DGP–MIL enable superior transferability across related domains compared with conventional CNN architectures such as ResNet, EfficientNet, and ShuffleNet. This cross-dataset evaluation reinforces the robustness and scalability of our framework in handling domain shifts between different gastrointestinal pathology datasets.
In addition to the Area Under the ROC Curve (AUC), we further evaluated the proposed DGP–MIL and baseline models using complementary metrics that better reflect clinical decision reliability, including accuracy, recall (sensitivity), precision, F1-score, and calibration quality measured by the Expected Calibration Error (ECE). These additional evaluations provide a more comprehensive assessment of classification performance and reliability across different operating thresholds.
Table 3 summarizes the results on the TCGA–CRC dataset. The DGP–MIL framework achieves an accuracy of 0.872, recall of 0.865, F1-score of 0.868, and an ECE of 0.021, indicating that the predicted MSI probabilities are well-calibrated. By comparison, TransMIL and CLAM show higher calibration error (ECE > 0.05) and slightly lower F1-scores, suggesting that the probabilistic kernel modeling in DGP–MIL contributes to both improved discrimination and better confidence estimation. Similar trends were observed across the external gastric cohorts (Table 4), confirming that the model maintains balanced precision-recall behavior and well-calibrated probability outputs across domains.
Overall, the expanded quantitative analysis demonstrates that DGP–MIL not only achieves the highest discrimination (AUC) but also maintains superior calibration and balanced sensitivity-specificity trade-offs, which are critical for clinically trustworthy decision-making.
To further compare the proposed framework with state-of-the-art multiple instance learning approaches, we additionally included CLAM and TransMIL as MIL-based baselines on the TCGA-CRC dataset. Both methods were implemented using their publicly available code and pre-trained feature extractors, and were trained under the same patch sampling, feature extraction, and optimization settings as our model to ensure a fair comparison. CLAM represents a clustering-constrained attention MIL paradigm, whereas TransMIL adopts a Transformer-based correlated MIL formulation with self-attention over instances.
As summarized in Table 5, all MIL-based methods achieve strong performance on MSI prediction. DGP–MIL attains the highest AUC (0.895 ± 0.007), slightly outperforming TransMIL (0.887 ± 0.010) and CLAM (0.879 ± 0.011), while also providing well-calibrated uncertainty estimates. These results indicate that our probabilistic deep Gaussian process formulation is competitive with, and in some cases superior to, existing attention- and Transformer-based MIL architectures, while additionally offering principled uncertainty quantification and interpretability at the instance level.
To statistically validate the superiority of the proposed DGP–MIL framework over competing methods, we performed hypothesis testing and confidence interval estimation for all AUC results. For each experiment, 95% confidence intervals (CIs) were computed using non-parametric bootstrap resampling with 1000 iterations. Pairwise statistical comparisons between DGP–MIL and baseline models (ResNet, EfficientNet, CLAM, TransMIL) were conducted using the DeLong test for correlated ROC curves.
The results confirmed that the performance improvements achieved by DGP–MIL are statistically significant (p < 0.01 for all comparisons) across the TCGA–CRC and TCGA–STAD cohorts. For instance, the 95% CI of the AUC for DGP–MIL on TCGA–CRC was [0.883, 0.907], compared with [0.832, 0.857] for ResNet and [0.869, 0.883] for TransMIL. These findings indicate that the observed gains are unlikely due to random fluctuations and demonstrate the robustness of our probabilistic MIL approach.
To further verify overall consistency, we also conducted a one-way ANOVA across all methods and training regimes. The F-statistic was significant at the 0.01 level, supporting that DGP–MIL achieves consistently higher mean AUCs across repeated runs. This comprehensive statistical validation provides strong evidence that the proposed method achieves reliable and reproducible performance improvements beyond conventional CNN- and Transformer-based MIL baselines.

Ablation study
To address the reviewer’s concern and make the ablation analysis more interpretable, we explicitly rename the components that were previously denoted as “Feature X” and “Feature Y”. In our framework, Feature X corresponds to the additional random-feature embedding used for deep Gaussian process kernel approximation, and Feature Y corresponds to the DGP-based latent representation refinement on top of the CNN features. For clarity, we now refer to these components as RF embeddings and DGP refinement, respectively, and we interpret the ablation results in an incremental fashion: starting from a simplified MIL model and progressively adding aggregation, attention, RF embeddings, and DGP refinement until the full DGP–MIL architecture is recovered.
Table 6 summarizes the ablation results for different model variants on colorectal cohorts. We can interpret these configurations incrementally. Starting from the variant without MIL aggregation (Row “- MIL aggregation”), the model relies on simple global pooling and achieves the lowest AUC across datasets, indicating that structured bag-level aggregation is essential for robust multiple-instance learning. Introducing MIL aggregation and attention (Row “- RF embeddings” and Row “- attention”) consistently improves performance, showing that instance-level weighting is beneficial beyond naive pooling. Next, adding RF embeddings (Row “- DGP refinement”) further enhances performance, confirming that kernelized random features provide complementary representation power over raw CNN descriptors. Finally, the full DGP–MIL model, which combines aggregation, attention, RF embeddings, and DGP refinement, achieves the highest AUC on all colorectal cohorts. This stepwise interpretation clarifies how each component contributes to the final performance, in contrast to the earlier, less transparent “Feature X/Feature Y” notation.
The ablation results on TCGA-STAD and related gastric cohorts in Table 7 exhibit a similar incremental pattern. The variant without MIL aggregation again yields the lowest performance, highlighting the central role of structured instance aggregation in capturing slide-level disease patterns. Enabling attention improves AUC across all STAD-related datasets, indicating that focusing on a subset of informative tiles is advantageous. Adding RF embeddings and DGP refinement on top of the attention-based MIL backbone leads to further gains, and the complete DGP–MIL configuration achieves the best AUC on every gastric cohort. Together with the radar plots in Figs. 3 and 4, these results demonstrate that each component—aggregation, attention, RF embeddings, and DGP refinement—contributes meaningfully to the final performance, and that the full model architecture is empirically justified.

Visualization and evidence localization
As illustrated in Fig. 5, a representative TCGA-CRC patch was visualized to demonstrate the interpretability of the proposed evidence-localization procedure. The original H&E image (panel a) shows the native histomorphology, while the hematoxylin channel (panel b) enhances nuclear distribution, making regions of epithelial crowding and glandular stratification more prominent. When the attention-like heatmap based on nuclei density and texture response is overlaid (panel c), the highlighted foci are concentrated along epithelial-luminal interfaces, within tightly packed or atypical glands, and at epithelial-stromal borders. These patterns correspond to morphologically plausible diagnostic cues, such as nuclear hyperchromasia, glandular complexity, and interface irregularities. This indicates that the model’s instance-level weighting prioritizes histologically meaningful structures, thereby offering interpretable support for bag-level decision making.
To provide a more detailed pathologic interpretation of the attention heatmaps, we analyzed the high-weighted (“hot”) regions with reference to specific histologic tissue characteristics. Across colorectal cancer (CRC) slides, areas with the strongest attention responses consistently correspond to the epithelial component of the tumor, particularly at glandular luminal borders, invasive fronts, and epithelial-stromal interfaces. Within these regions, the highlighted foci frequently display nuclear hyperchromasia, glandular architectural irregularity, cribriform patterns, and lymphocytic infiltration—all recognized morphologic correlates of microsatellite instability (MSI) in CRC. In contrast, areas assigned lower attention typically correspond to fibrotic stroma, muscularis propria, or well-differentiated glands with low nuclear atypia, features commonly associated with microsatellite stable (MSS) phenotypes.
The correspondence between attention “hot areas” and histopathologic structures was reviewed and qualitatively verified by an experienced gastrointestinal pathologist (co-author or collaborating expert). This expert validation confirmed that the model’s high-attention zones align with diagnostically relevant regions, suggesting that the network’s feature weighting mirrors human interpretive focus in histopathologic examination. These observations reinforce that the proposed DGP–MIL framework captures biologically meaningful morphologic cues rather than artifactual color or texture variations, thus supporting its potential for clinical interpretability and integration into digital pathology workflows.
As shown in Fig. 6, the Top-K (here K = 8) peaks extracted from the attention-style map further summarize the spatial distribution of high-weight evidence. The selected foci predominantly lie along glandular contours and epithelial segments with dense nuclei, with additional peaks near transition zones where tissue architecture changes. Such localization provides compact, reviewable “evidence tiles” that can be audited by pathologists, supports tile-level attribution without pixel-level annotations, and aligns with a multiple-instance learning formulation in which a small subset of informative instances drives the bag-level prediction. Together with Fig. 5, these qualitative results substantiate that the proposed visualization pipeline highlights histomorphologic regions likely to contribute most to the decision, thereby enhancing the interpretability of the model’s predictions.
As illustrated in Fig. 7, a representative TCGA-STAD patch was used to qualitatively examine the proposed visualization procedure. The original H&E image (panel a) provides the baseline tissue morphology, while the hematoxylin channel derived from HED decomposition (panel b) highlights nuclear enrichment, with stronger signals observed in glandular epithelium and crypt bases. When the attention-like heatmap combining nuclei density and texture response is superimposed onto the patch (panel c), the high-weight regions are predominantly localized at epithelial-luminal interfaces, within crowded or structurally disorganized glands, and along tumor-stroma transition zones and invasive fronts. These localized patterns correspond well to morphologically discriminative cues recognized in histopathology, such as nuclear crowding and glandular architectural complexity. This indicates that the model assigns greater weight to diagnostically informative structures rather than uniformly sampling across the tissue, thereby offering interpretable evidence to support bag-level predictions.
As shown in Fig. 8, the top-K (K = 8) high-weight regions derived from the attention-like map highlight the specific tissue areas most influential for the model’s decision. These regions are predominantly localized at nuclear-dense epithelial fragments, along glandular contours, and at tumor-stroma interfaces, whereas background and muscle layers exhibit minimal responses. This spatial distribution suggests that the model’s predictions are largely driven by a limited subset of highly informative instances, which is consistent with the assumptions of multiple instance learning. Moreover, by isolating these evidence slices, the visualization provides interpretable cues that facilitate human verification and enable cross-validation of qualitative insights with quantitative results.
Beyond visual heatmaps, we further quantified feature importance to better explain why specific tiles contribute strongly to model decisions. For each tile, attribution scores were estimated by combining its learned attention weight with the gradient of the slide-level logit with respect to the tile embedding (gradient × input). Tiles exhibiting high attribution values consistently corresponded to morphologically salient regions such as epithelial-stromal interfaces and densely packed glandular areas, confirming that the model relies on histopathologically meaningful cues.
In addition, the explainability analysis was extended by integrating model uncertainty measures described in Section Inference, Calibration, and Uncertainty. We visualized both aleatoric (data-driven) and epistemic (model-related) uncertainties at the tile level, revealing that high-uncertainty areas often coincided with ambiguous tissue boundaries or poor staining quality. This joint visualization of attribution and uncertainty provides a more comprehensive interpretation of model confidence and potential failure modes.
Finally, from a clinical perspective, such interpretable visual and quantitative evidence can enhance trust among pathologists and facilitate the model’s potential deployment in diagnostic workflows. By explicitly showing which tissue regions drive predictions and how confident the model is in those decisions, the proposed DGP–MIL framework aligns with the principles of transparent and accountable AI for clinical practice.
From a pathologist-centered perspective, the interpretability analysis reveals that the DGP–MIL framework bases its predictions on morphologic cues that align closely with histopathologic knowledge of microsatellite instability (MSI) and microsatellite stability (MSS). In MSI-predicted slides, the model consistently assigns high attention weights to regions exhibiting glandular architectural distortion, irregular epithelial-stromal interfaces, nuclear pleomorphism, and dense lymphocytic infiltration—features that are well recognized as correlates of MSI-related hypermutation and immune activation. Conversely, MSS-predicted slides show attention focused on well-organized glandular structures, lower nuclear atypia, and sparse inflammatory infiltrates, reflecting morphologic patterns associated with genomic stability.
These pathologically interpretable differences indicate that the proposed framework does not rely on spurious texture or color artifacts but rather captures biologically meaningful tissue architecture and cellular context. Such findings bridge the gap between algorithmic prediction and diagnostic reasoning, facilitating potential clinical adoption by providing human-understandable explanations consistent with established histopathologic criteria for MSI detection.

Qualitative results
As shown in Fig. 9, we first visualize representative H&E tiles and their intensity-based binary masks to show the morphological correspondence exploited by our pipeline (elongate epithelial ridges, stromal strands, and erythrocyte-filled lumina).
As shown in Fig. 10, we further overlay semi-transparent masks on the same tiles so that tissue context is preserved while contiguous epithelial walls, stromal septa, and vessel lumina are highlighted for visual inspection.
As shown in Fig. 11, we estimate a proxy instance score from the hematoxylin channel of H&E tiles (via HED color deconvolution) to approximate nuclear density, and overlay a semi-transparent red heatmap on six representative tiles to indicate attention intensity; darker red denotes higher instance score/attention, typically along glandular epithelium, luminal borders, and structurally disrupted regions.
As shown in Fig. 12, we highlight the core regions used for the MIL decision by displaying a top-k (k = 15%) attention-mask overlay that retains only the most attended areas within each tile, approximating the subset of instances that contribute most to slide-level aggregation and facilitating comparison with histologic patterns.

Cross-validation analysis
To further evaluate the robustness and generalizability of the proposed DGP–MIL framework, we additionally conducted a 5-fold cross-validation experiment on the TCGA-CRC dataset. The dataset was randomly partitioned into five non-overlapping folds with an approximately equal class distribution. In each iteration, four folds were used for training and one for testing, ensuring that every sample was tested exactly once.
The mean AUC obtained across the five folds was 0.892 ± 0.007, which is consistent with the single split result (AUC = 0.895). This indicates that the model’s performance is stable across different data partitions and not overly dependent on a particular train-test split.
These results demonstrate that the proposed method generalizes well and maintains high predictive accuracy under different data sampling conditions, confirming its robustness and potential applicability to unseen cohorts.

Case studies and feature-level analysis
To provide a more intuitive understanding of model behavior, we conducted qualitative case studies that showcase both correctly classified and misclassified whole-slide images. For each representative case, we visualized the corresponding attention-based heatmaps and extracted tile-level feature embeddings from the penultimate layer of the DGP–MIL framework.
As illustrated in the representative examples, correctly predicted cases typically exhibited high-confidence attention responses concentrated in diagnostically relevant epithelial and glandular regions, where the model captured well-defined morphological cues. In contrast, misclassified slides often displayed dispersed or ambiguous attention patterns, with high attribution assigned to tiles containing artifacts, necrotic regions, or low-quality staining. Feature-space projections using t-SNE further revealed that tiles from correctly classified slides clustered tightly within their respective class manifolds, whereas misclassified instances appeared near class boundaries, suggesting feature ambiguity.
This comparative analysis highlights both the strengths and limitations of the proposed framework: the model excels at recognizing typical morphological signatures of MSI-positive and MSI-negative tissues, but can be confounded by heterogeneous or borderline samples. Such case-level analysis provides valuable diagnostic insight, helping to identify potential sources of model uncertainty and informing future refinements in data curation and model design.

Low-sample regime analysis
To assess the risk of overfitting and the behavior of the proposed framework under limited data availability, we further conducted a low-sample regime analysis on the TCGA-CRC dataset. Specifically, we simulated reduced training conditions by randomly subsampling 25%, 50%, and 75% of the original training slides, while keeping the validation and test sets fixed. For each sampling ratio, we repeated the experiment with three different random seeds and reported the average AUC and standard deviation across runs.
The results, summarized in Table 8, indicate that the performance of DGP–MIL degrades in a gradual and monotonic manner as the amount of training data decreases, rather than collapsing abruptly. In particular, the model retains competitive AUC even when trained with only a fraction of the available slides, suggesting that the combination of multi-instance pooling and Gaussian process-based kernel modeling provides an effective form of implicit regularization. Compared with conventional CNN baselines (e.g., ResNet, EfficientNet), DGP–MIL exhibits a smaller performance drop in the 25% and 50% regimes, highlighting its robustness in low-data settings.
We also observed that explicit regularization strategies, such as dropout, weight decay, extensive data augmentation, and early stopping based on validation loss, further mitigate overfitting in the low-sample regimes. Together, these findings suggest that the proposed framework is less prone to severe overfitting and maintains graceful degradation when training data are scarce, which is particularly relevant for institutions with limited annotated whole-slide cohorts.
To further strengthen model robustness and assess its generalization in a black-box manner, we also adopted evaluation strategies that mimic blind or hidden-feature testing. Specifically, all external datasets were used exclusively for inference without any exposure during model training or hyperparameter tuning, thereby simulating an independent blind testing scenario. In addition, the feature extraction backbone and hyperparameter settings remained fixed across datasets, ensuring that the model encountered novel tissue distributions and scanner variations during evaluation.
These black-box testing steps confirm that the proposed DGP–MIL framework maintains strong predictive performance even under unseen data distributions, demonstrating genuine generalization rather than dataset-specific fitting. Future extensions will explore additional forms of hidden-feature testing—such as partially masked clinical attributes or synthetic perturbations—to further verify the model’s stability under black-box validation regimes.
While the experimental framework primarily relies on the TCGA–CRC and TCGA–STAD cohorts, we recognize that these datasets, although large and publicly standardized, may not fully capture the variability encountered in real-world, multi-center clinical environments. In particular, differences in scanner calibration, staining protocols, and patient demographics can lead to color and texture shifts that challenge model robustness.
To partially address this issue, the proposed DGP–MIL framework was evaluated on several independent cohorts (Yonsei, STMary, GC–ICI, and Molecular-subtype datasets), which originate from distinct institutions and imaging pipelines, thereby introducing inter-center variation at both the staining and acquisition levels. The consistent performance of our model across these datasets supports its ability to generalize under moderate domain shifts. Nevertheless, we acknowledge that larger-scale, multi-center validation incorporating diverse color profiles and imaging scanners will be necessary to further substantiate real-world applicability. Future work will therefore include color-variant augmentation, stain normalization, and domain adaptation across multiple clinical centers to enhance robustness and translational potential.

Discussion

Discussion
We presented the experimental results which demonstrated the superior performance of our proposed model in predicting colorectal cancer, especially when compared to existing architectures like ResNet, EfficientNet, and ShuffleNet. Our model consistently achieved higher AUC scores across multiple datasets, underscoring its robustness in addressing the inherent complexities and heterogeneity of medical image data.
We outlined the significant contributions of our work, highlighting the use of deep Gaussian processes with random feature expansion (DGP-RF) and multi-instance learning (MIL). These innovations enhance our model’s ability to efficiently handle weakly supervised learning with large-scale medical image data. Our method is particularly valuable for colorectal cancer classification, where it not only improves classification accuracy but also offers potential for clinical deployment by providing more reliable predictions and interpretability.
Looking to the future, we aim to explore further refinements in our approach, including the integration of more advanced attention mechanisms and enhanced uncertainty calibration techniques. Additionally, we plan to investigate how our model can be adapted to other medical imaging tasks, broadening its clinical applicability. Through these efforts, we hope to push the boundaries of precision medicine by developing more efficient and scalable models for early disease detection and diagnosis.
Despite these promising results, several limitations should be acknowledged. First, the model is trained on publicly available datasets that may contain inherent sampling and demographic biases, which could limit its generalizability to broader clinical populations. Second, the weakly supervised nature of the framework—where only slide-level labels are available—may not fully capture fine-grained histological variations. Third, whole-slide image variability arising from different scanners, staining protocols, and tissue preparation procedures remains a practical challenge that can affect cross-site consistency. For clinical validation, further steps are necessary before deployment. Future work should include prospective studies and multi-center clinical trials to assess reproducibility across institutions, as well as integration with pathologists’ diagnostic workflows to ensure practical usability. Quantitative user studies involving human-AI collaboration could also help evaluate interpretability, decision confidence, and diagnostic efficiency in real-world settings.
Beyond colorectal cancer, the proposed DGP–MIL framework possesses strong potential for adaptation across a broad range of biomedical imaging domains. Because the model is fundamentally built on a multi-instance learning paradigm coupled with Gaussian process-based uncertainty estimation, it can naturally extend to other histopathological and non-histopathological tasks such as breast, lung, and gastric cancers, as well as inflammatory and infectious diseases. By treating each patient sample or imaging study as a collection of heterogeneous instances, the framework can accommodate diverse input modalities including H&E slides, immunohistochemistry images, and even radiological scans. The interpretability and uncertainty quantification components of our method are particularly relevant in these scenarios. Attribution maps and uncertainty overlays provide clinicians with visual and quantitative cues about where and how the model makes decisions, thereby facilitating cross-domain transparency and aiding diagnostic reasoning. In complex or high-stakes applications—such as early tumor screening or grading ambiguous lesions—these interpretable and calibrated predictions can assist clinical experts in assessing model reliability and guiding further review. Therefore, the proposed DGP–MIL framework not only advances colorectal cancer prediction but also establishes a generalizable foundation for trustworthy AI in biomedical image analysis, with potential to accelerate precision diagnostics across multiple clinical specialties.
Moreover, additional improvements could be achieved by incorporating active learning strategies to prioritize ambiguous or uncertain cases for expert review, and by adopting domain adaptation techniques to enhance robustness under distribution shifts. Combining these advances with rigorous uncertainty quantification and continual learning will be essential to bridge the gap between computational prediction and reliable clinical decision support.

Methods

Methods

Overall structure
As illustrated in Fig. 13, the proposed weakly supervised pipeline operates on whole-slide images (WSIs) and proceeds through four sequential stages. First, the WSI is partitioned into non-overlapping tiles at one or multiple magnifications in order to capture both global tissue context and fine-grained cellular morphology. Each tile is normalized and then processed by a lightweight encoder (e.g., a ResNet-18 without its classification head) to obtain fixed-length feature representations. These encoded features are subsequently fed into a deep Gaussian process with random-feature expansion (DGP-RF), which approximates complex kernel mappings through stacked random Fourier projections and shallow linear transformations with dropout. This module produces two complementary outputs for each tile in a single forward pass: a discriminative logit indicating class support and a learnable importance weight that quantifies the tile’s contribution to the final decision. Instance-level outputs are then aggregated by a core processing layer that normalizes weights and optionally applies probability-proportional subsampling or Top-k/ratio pruning, thereby reducing noise and computational burden while retaining the most informative regions. At the slide level, the normalized instance scores are combined to yield a single logit, which is subsequently calibrated via temperature scaling to produce well-calibrated probabilities. Importantly, instance scores and weights are preserved during inference, enabling the generation of interpretable heatmaps that highlight diagnostically relevant tissue regions. Overall, this design tightly couples multi-scale feature encoding, deep kernelized representation learning, and weight-based multiple-instance aggregation within a unified and computationally efficient framework.
The proposed pipeline starts with the partitioning of the whole-slide image (WSI) into non-overlapping tiles at one or multiple magnifications. This tiling strategy allows the system to simultaneously capture large-scale tissue organization and fine-grained cellular morphology. Each tile is normalized and then passed through a lightweight feature extractor—by default, a ResNet-18 with its classification head removed. The encoder converts each tile into a fixed-length embedding that preserves discriminative information, and the design remains flexible, allowing other pre-trained or frozen backbones to be substituted without affecting subsequent stages.
The encoded tile embeddings are then processed by a deep Gaussian process with random-feature expansion (DGP-RF). This module approximates non-linear kernel mappings by stacking random Fourier projections with shallow linear transformations, combined with dropout for regularization. In this way, it provides scalable deep kernelized representations capable of modeling complex tissue heterogeneity. Importantly, a single forward pass produces two complementary outputs for each tile: a discriminative logit reflecting how strongly the tile supports the positive class, and a learnable importance weight that quantifies the tile’s relative contribution to the slide-level decision.
To further refine instance-level representations, the framework incorporates mechanisms for weight normalization and sparsification. Normalized importance scores ensure comparability across slides, while lightweight filtering strategies—such as probability-proportional subsampling and Top-k or Top-ratio pruning—are optionally applied. These procedures serve two purposes: reducing computational overhead when slides contain a very large number of tiles, and suppressing background or noisy regions that could otherwise dilute predictive accuracy. By retaining only the most informative tiles, the model is better able to handle structural heterogeneity across different histopathological cohorts.
Finally, the normalized tile weights are used to aggregate individual logits into a unified slide-level score. This aggregated score is calibrated through temperature scaling, which adjusts the raw outputs into well-calibrated probabilities aligned with held-out validation data. The calibration step improves cross-cohort consistency and enhances the reliability of probabilistic predictions. Moreover, both instance-level logits and weights are preserved at inference time, enabling the generation of interpretable heatmaps that highlight diagnostically relevant regions. Such visualizations provide pathologists with transparent, human-interpretable evidence supporting the model’s predictions, thereby improving both clinical trust and diagnostic utility.

Multiple instance learning
The bag-level binary classification setting with dataset , where the n-th sample (bag) consists of a variable-length collection of instances with and yn ∈ {0, 1}. In implementation, each bag is organized as an instance-feature matrix , corresponding to data_mat[idx] for the bag index idx, while the instance count Nn is given by Nis[idx]. The total number of instances across all bags is , which is relevant for memory planning and batching. For a loss-compatible and numerically stable interface, we map binary labels to bipolar form via Yn = 2yn − 1 ∈ { − 1, + 1}, and collect and for convenience. This setup bridges data representation, code interfaces, and statistical learning so that optimization objectives and stability measures can be expressed coherently.
A central challenge in MIL is the variable-length structure of bags together with unobserved instance-level supervision. Let be the index set for bag n and the vector of instance counts across bags. Computationally, we adopt a “bag-wise vertical concatenation” strategy: instances from all bags in a mini-batch are concatenated into a single tensor while N preserves bag boundaries for subsequent splitting and aggregation. In practice, this is realized by segmenting with torch.split according to Nis, applying per-bag normalization, and re-concatenating outputs at the batch level. The approach enables efficient matrix/vector operations without losing the semantics of independent bags. Because Nn may vary substantially across bags, we explicitly introduce a small ε > 0 for numerical stability in normalization and probability computations.
Regarding supervision assumptions, classical MIL adopts the existence-based view: a positive bag (yn = 1) contains at least one positive instance, whereas a negative bag (yn = 0) contains none. If zn,i ∈ {0, 1} denotes a latent instance label, the assumption can be written asHowever, in pathological imaging and broader computational medicine, positive evidence is often collective or distributed; a single instance is rarely decisive for bag-level risk. We therefore adopt a more general collective assumption: there exists a learnable aggregation mapping such thatwhere σ( ⋅ ) denotes the logistic sigmoid. This formulation allows bag-level decisions to synthesize evidence across all instances and subsumes the existence-based case as a limiting scenario; see Fig. 14.
To instantiate the collective decision rule, we employ an instance-level logit function together with a nonnegative weight (attention) function . We define the per-bag normalized weightsand aggregate instance scores bywhere a is a logit scaling factor (often set to a = 2.0 in code). When for a single instance and near zero otherwise, the model reduces to the “max-instance” hypothesis; when , it becomes mean pooling. Hence, the weighted aggregation unifies existence-based and attention-driven collective viewpoints and lets learning interpolate between them as dictated by data; the overall pipeline is illustrated in Fig. 15.
For vectorized computation, let and . The bag-level aggregation becomesAcross a batch of B bags, we write and p = σ(aF). In practice, a concatenated instance tensor is split into bag-wise segments using Nis with torch.split, each segment is normalized and aggregated, and results are concatenated back into batch-level outputs. Because bags may exhibit wide variability in Nn, the ε-stabilized normalization and safe logistic computations are essential to avoid numerical under/overflow; this also motivates adopting log-sum-exp style safeguards where appropriate.
To harmonize the loss interface and enhance optimization stability, we work with the bipolar labels Yn = 2yn − 1 ∈ {−1, +1} and define a correctness probabilityA standard objective is the logistic negative log-likelihood , with possible α-likelihood variants for robustness. This mapping matches the vectorized code transformationensuring end-to-end alignment between theory, interface, and numerical stability. When class imbalance is present, one may incorporate cost sensitivity via class weights—e.g., a positive-class weight wpos—or adopt temperature/threshold adjustments at the mini-batch level to balance robustness and interpretability.
Finally, we denote the train/validation/test splits by , with corresponding bag counts . The positive class prior (imbalance) can be estimated asand used to set cost-sensitive weights, e.g., wpos = π, if desired. For clarity, we use D for feature dimensionality, Nn for the instance count of bag n, and M for the total number of bags—corresponding to X.shape[1], Nis[idx], and the dataset length in code, respectively. In summary, bags are represented as variable-length instance matrices Xn, labels are maintained in both {0, 1} and { − 1, + 1} forms for derivational and implementation convenience, and bag-level probabilities arise from generalized weighted aggregation of instance logits with nonnegative normalizable weights. This unified view accommodates both existence-based and collective MIL regimes and integrates seamlessly with downstream modules such as random feature expansions, attention/weight learning, temperature scaling, and uncertainty estimation.

Deep Gaussian processes with random Fourier features
From the kernel mapping view of deep Gaussian processes (DGPs) and approximate a stationary kernel by random Fourier features (RFF), enabling differentiable, end-to-end training with explicit features in lieu of an implicit kernel. Let the instance input be and a stationary kernel . By Bochner’s theorem, . Drawing D1 frequency samples , we construct the first-layer feature map , where ⊙ denotes the Hadamard product, , , and z0 = generateZ( ⋅ ) provides kernel-amplitude scaling. To interpolate between “learned” and “random” components, we parameterize W(r) and b as linear combinations of learnable parameters and random priors:where β, βb are typically diagonal (or channel-wise) rescaling matrices, , are governed by hyperparameters and , while W and bL are learnable terms. Hence the first-layer map readsIn the RFF limit we have , thereby preserving the kernel’s second-order statistics.
To obtain deep non-linear kernel compositions, we stack cosine mappings layer by layer, forming a deep random-feature network. Let φ(0)(h) = h and, for ℓ = 1, …, L − 1, defineEach layer is followed by a dropout operator with rate p = 0.2, producing training-time features and inference-time scaling . To stabilize the feature variance, one may include a normalization factor (e.g., ) so that . For sufficient depth L, the construction is equivalent to a random-feature approximation of a family of composite kernels, implicitly corresponding to inter-layer non-linear GP transforms in a DGP. Given instance i, the instance-level logit is produced by a linear head:which can be seamlessly coupled with binary logistic modeling and downstream MIL aggregation.
To exploit external knowledge, we optionally enable a mean-function correction (flag_mean_function = True) that adds a frozen reference model’s class-difference output in the logit space. Denote the reference model’s scores (or logits) for the positive and negative classes at hi by m1(hi) and m0(hi). We then adopt an additive mean:which can be interpreted as endowing the DGP-RF prior with a nonzero mean m( ⋅ ), i.e., where m(h) = m1(h) − m0(h) is supplied by a teacher model. This strategy regularizes learning under class imbalance or limited data and transfers discriminative knowledge into the deep random-feature pipeline via residual learning. In practice, m1 − m0 is frozen to avoid co-adaptation, while are updated.
To constrain the learnable/random mixture and mitigate overfitting, we employ multiple regularizers addressing frequency alignment, scaling stability, and layer-wise weight control. First, we penalize the Frobenius norm to encourage the β-transformed frequencies to align with the mixed frequencies, preventing degeneration to arbitrary linear maps; second, we regularize to keep β near identity, stabilizing the frequency scale and discouraging pathological channel rescaling. These two terms are jointly weighted by , yieldingIn parallel, we impose L2 penalties on all trainable layers,so that the overall regularization becomes . Because dropout introduces random masking, we apply expectation scaling at inference to maintain compatibility with the above norm regularizers, thereby aligning the train/test distributions.
For binary classification, we adopt the logistic loss on logits, with labels (or, in MIL, pseudo/soft targets obtained after bag-level aggregation). The per-instance negative log-likelihood is , and the batch objective readswhere fi already includes the mean-function correction and the deep cosine mapping output. In the MIL setting, after computing {fi} at the instance level, a differentiable aggregator (e.g., , softmax, or pooling) produces the bag-level logit F = Agg({fi}), upon which an analogous logistic loss is optimized with bag labels. The deep random-feature network captures high-frequency and composite-kernel structure in the instance domain, while the MIL aggregator models set-level uncertainty, yielding a unified pipeline from “kernel approximation → deep representation → set inference.” Computationally, the method scales linearly as in the number of features, while statistically retaining the expressive priors of DGPs and the scalability of RFFs.
We visualize instance-level logits on whole slides and analyze misclassifications across biological strata. As shown in Fig. 16, positive cases exhibit concentrated high-intensity regions that co-localize with lesions, whereas negatives display no persistent hot spots. Figure 17 further summarizes boundary-case behaviour: row-normalized confusion matrices (molecular subtypes and histological grades) show errors mainly between neighboring classes, and stratified ROC AUC across TIL abundance and necrosis percentage remains high, indicating robust performance across clinically relevant cohorts.

Multi-instance aggregation and instance selection
Consider multiple-instance learning (MIL) for bag-level prediction. In the n-th bag, let the instance set be , where is the instance representation and is the instance logit. We provide two switchable aggregators: weakly supervised weighting (WSL, used by default) and gated attention. To stabilize weights and avoid degeneration, we first compress the raw weight score with a sigmoid and add a uniform base (UNIF_R_BASE = 0.1), then normalize to obtain :In an optional “instance selection” stage, we further apply Top-k or ratio keeping with r ∈ (0, 1] to emphasize more discriminative instances. Let a binary mask mi ∈ {0, 1} indicate whether instance i is kept (for Top-k, mi = 1 iff is among the largest k; for ratio keeping, we keep the top ⌈rN⌉). The weights are re-normalized asWSL forms a bag-level evidence Fw by the weighted sum of instance logits with (22), and maps it to a bag-level probability p using a polarizing factor aextrem = 2.0:Gated attention adopts a decomposed additive–gating structure to generate attention weights from features and then fuses them with logits. With learnable parameters and , and the Hadamard product ⊙ , the attention weights areand the probability is obtained with the same polarizing sigmoid:In practice, WSL and gated attention can share the instance representation hi and instance logit fi in the weight-generation stage. Top-k/ratio selection is likewise applicable to the attention weights {ai} to reduce the influence of noisy instances (Fig. 18).

Training objective and optimization
Let the bag label be Y ∈ { − 1, + 1}. We define the “correctness probability” and its log loss aswhere Fw comes from either (23) or (24)–(25), and aextrem = 2.0 sharpens the decision boundary. To address class imbalance, we use cost weightswhere wpos can be set to the positive-class fraction in training or tuned on validation.
We adopt an approximate α-likelihood, which aggregates the cost-weighted losses via a log-mean-exp (LME) form; with stochastic trials (e.g., data augmentation or subsampling), the mini-batch objective over B iswhere denotes the correctness probability of bag b in trial i via (26). As α → 0, (28) reduces to the cost-weighted expected loss; for larger α, it upweights harder (higher-loss) examples.
The final optimization problem iswhere R(Θ) is an optional regularizer (e.g., weight decay or attention entropy), and λ≥0 is its coefficient. We use Adam with learning rate lr = 10−3.
For importance sampling without replacement, draw a sub-bag using inclusion probabilities , and apply a Horvitz–Thompson style correction to keep Fw approximately unbiased:For edge-instance filtering, rank by the scaled score and remove extreme tails to reduce instability:Both strategies can be combined with the Top-k/ratio selection in (22).

Inference, calibration, and uncertainty
In multi-instance learning, the bag-level probability prediction is computed by the following formula:where σ( ⋅ ) denotes the sigmoid function, aextrem is a hyperparameter, is the normalized weight, and fi is the predicted value for each instance. When the number of instances is large, multiple predictions are performed using the same subsampling strategy as during training, and the probabilities of these predictions are averaged, a technique known as probability calibration (cal_BinProbs). This approach helps improve the stability and reliability of the model’s predictions, especially when handling complex data, as multiple sampling effectively reduces the risk of overfitting.
Temperature scaling is a common calibration technique aimed at adjusting the model’s output probability distribution to improve its reliability. By fitting the temperature T on an independent validation set and minimizing the binary cross-entropy loss (BCEWithLogits), the model’s output probabilities can be calibrated. The calculation is as follows:where is the unscaled prediction. The introduction of temperature scaling helps adjust the model’s sensitivity to extreme predictions, making the output probabilities better aligned with the true probability distribution. This adjustment reduces overconfidence in predictions and improves the model’s generalization capability.
Calibration quality is usually evaluated using the expected calibration error (ECE). ECE measures the consistency between predicted probabilities and actual labels by calculating the errors within different probability intervals and averaging them with weights. This method allows for evaluating the model’s performance across different probability ranges and adjusting the calibration strategy to better handle varying levels of uncertainty.
In terms of uncertainty analysis, the model’s multiple predictions for the same bag can be used to assess both aleatoric (inherent) uncertainty and epistemic (cognitive) uncertainty. Aleatoric uncertainty reflects the inherent uncertainty in the data itself, while epistemic uncertainty is related to the model’s lack of knowledge or the limitations of its structure. The calculation of aleatoric uncertainty is done by evaluating the mean and variance of multiple prediction results, specifically calculated as:where S is the number of samples, p(s) is the probability of the s-th prediction, and is the average probability.
The calculation of epistemic uncertainty focuses on evaluating the differences between the model’s various predictions, formulated as:This uncertainty analysis provides a deep understanding of the model’s confidence, which is crucial for making more precise decisions in real-world applications, particularly in high-risk scenarios, by identifying regions of the model’s uncertainty and taking appropriate measures.
As shown in Fig. 7, the relationship between confidence and uncertainty is illustrated, highlighting the varying levels of uncertainty at different confidence levels. Specifically, areas of low confidence are shown to have higher uncertainty, indicating regions where the model’s performance could be improved Fig. 19.
In Fig. 8, the impact of temperature scaling (calibration) on model predictions is shown. The figure demonstrates how temperature scaling reduces overconfidence in the model, leading to better-calibrated probabilities that align more closely with the true distribution. This highlights the role of calibration in improving the model’s overall reliability, especially in handling uncertainty in predictions Fig. 20.

Discussion on novelty and contribution
Although the individual components employed in this framework—multi-instance learning, Gaussian processes, and attention mechanisms—have been previously explored, their integration within a unified probabilistic architecture is novel. Our DGP–MIL design bridges the gap between deterministic MIL approaches and Bayesian deep models by embedding deep Gaussian process layers into the MIL pipeline, enabling end-to-end uncertainty-aware representation learning. Unlike prior works that treat attention weights as static parameters, our model derives them from posterior uncertainty, thereby coupling interpretability with probabilistic reasoning. Furthermore, this design allows multi-level uncertainty propagation, providing both instance-level and bag-level confidence estimation in a single formulation.
To the best of our knowledge, this is the first framework that applies deep Gaussian process-based multi-instance learning to colorectal whole-slide image classification and validates its cross-domain generalization across multiple gastrointestinal datasets. This integration of probabilistic attention and hierarchical kernel learning establishes a distinctive contribution beyond existing MIL or Transformer-based histopathological models.

Reproducibility and implementation details
To promote scientific reproducibility and to address potential overfitting concerns associated with transparent (white-box) architectures, we have documented the full training configuration of the proposed DGP–MIL framework. All experiments were conducted using PyTorch (v2.1) with CUDA (v12.1) on an NVIDIA A100 GPU (40GB memory) under Ubuntu 22.04. Random seeds were fixed across runs to ensure deterministic behavior. The Adam optimizer was used with an initial learning rate of 1 × 10−4, weight decay of 1 × 10−5, and a batch size of 1 slide per iteration. Training was performed for 50 epochs with early stopping based on validation AUC. Dropout (0.3) and data augmentation (random rotation, horizontal/vertical flips, and color jitter) were applied to mitigate overfitting.

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.