SHAP-explained machine-learning model for high-risk gastric cancer identification.
[INTRODUCTION] Gastric cancer (GC) remains a major public health concern in Asia.
- 95% CI 0.755-0.767
APA
Oh HJ, Kim CH, et al. (2026). SHAP-explained machine-learning model for high-risk gastric cancer identification.. Frontiers in oncology, 16, 1732072. https://doi.org/10.3389/fonc.2026.1732072
MLA
Oh HJ, et al.. "SHAP-explained machine-learning model for high-risk gastric cancer identification.." Frontiers in oncology, vol. 16, 2026, pp. 1732072.
PMID
41919267
Abstract
[INTRODUCTION] Gastric cancer (GC) remains a major public health concern in Asia. Risk prediction tailored to regional biological features such as () status and high-risk mucosal findings such as atrophic gastritis (AG) and intestinal metaplasia (IM) may help improve the screening workflow.
[METHODS] Using a large, real-world, nationwide screening cohort with available endoscopic AG/IM codes, we developed 2-year GC risk prediction models that integrate AG/IM with regional demographic and lifestyle factors. We compared a conventional Cox proportional hazards model (CPHM) with the following machine learning (ML) approaches: extreme gradient boosting (XGBoost), decision tree (DT), and logistic regression (LR). Discrimination and calibration were evaluated through internal and external validations. Model interpretability was assessed using Shapley Additive Explanations (SHAP).
[RESULTS] The XGBoost model demonstrated the best overall performance, achieving an AUROC of 0.764 (95% CI, 0.755-0.767), a sensitivity of 0.607 (95% CI, 0.560-0.650), and a specificity of 0.746 (95% CI, 0.744-0.750) in the internal validation. In the external validation cohort, XGBoost also showed the highest discrimination with an AUROC of 0.708 (95% CI, 0.682-0.884), a sensitivity of 0.666 (95% CI, 0.470-0.830), and a specificity of 0.597 (95% CI, 0.590-0.600). SHAP analysis consistently identified Helicobacter pylori infection, age, sex, smoking, and atrophic gastritis/intestinal metaplasia (AG/IM) as the major contributors to increased predicted gastric cancer risk.
[DISCUSSION] This externally validated and interpretable short-term GC risk model incorporating endoscopically ascertained AG/IM could provide a practical approach for informing risk-adapted screening workflows. The model could help identify individuals at a higher predicted risk for prospective evaluation and closer clinical review. In addition, SHAP clarifies the main contributors to each prediction by highlighting factors most strongly associated with a higher predicted risk.
[METHODS] Using a large, real-world, nationwide screening cohort with available endoscopic AG/IM codes, we developed 2-year GC risk prediction models that integrate AG/IM with regional demographic and lifestyle factors. We compared a conventional Cox proportional hazards model (CPHM) with the following machine learning (ML) approaches: extreme gradient boosting (XGBoost), decision tree (DT), and logistic regression (LR). Discrimination and calibration were evaluated through internal and external validations. Model interpretability was assessed using Shapley Additive Explanations (SHAP).
[RESULTS] The XGBoost model demonstrated the best overall performance, achieving an AUROC of 0.764 (95% CI, 0.755-0.767), a sensitivity of 0.607 (95% CI, 0.560-0.650), and a specificity of 0.746 (95% CI, 0.744-0.750) in the internal validation. In the external validation cohort, XGBoost also showed the highest discrimination with an AUROC of 0.708 (95% CI, 0.682-0.884), a sensitivity of 0.666 (95% CI, 0.470-0.830), and a specificity of 0.597 (95% CI, 0.590-0.600). SHAP analysis consistently identified Helicobacter pylori infection, age, sex, smoking, and atrophic gastritis/intestinal metaplasia (AG/IM) as the major contributors to increased predicted gastric cancer risk.
[DISCUSSION] This externally validated and interpretable short-term GC risk model incorporating endoscopically ascertained AG/IM could provide a practical approach for informing risk-adapted screening workflows. The model could help identify individuals at a higher predicted risk for prospective evaluation and closer clinical review. In addition, SHAP clarifies the main contributors to each prediction by highlighting factors most strongly associated with a higher predicted risk.