Comparative study on predicting postoperative distant metastasis of lung cancer based on machine learning models.
1/5 보강
PICO 자동 추출 (휴리스틱, conf 3/4)
유사 논문P · Population 대상 환자/모집단
120 patients with stage I-III lung cancer who underwent radical surgery were retrospectively collected and randomly divided into training and testing cohorts.
I · Intervention 중재 / 시술
radical surgery were retrospectively collected and randomly divided into training and testing cohorts
C · Comparison 대조 / 비교
추출되지 않음
O · Outcome 결과 / 결론
This methodology offers a data-driven basis for precision management. Ultimately, our findings provide an internally validated reference framework that warrants external and multicenter validation prior to clinical deployment.
Lung cancer remains the leading cause of cancer-related incidence and mortality worldwide.
- 95% CI 0.748-0.872
APA
Guo X, Xu T, et al. (2026). Comparative study on predicting postoperative distant metastasis of lung cancer based on machine learning models.. Scientific reports, 16(1), 6468. https://doi.org/10.1038/s41598-026-37113-w
MLA
Guo X, et al.. "Comparative study on predicting postoperative distant metastasis of lung cancer based on machine learning models.." Scientific reports, vol. 16, no. 1, 2026, pp. 6468.
PMID
41606056 ↗
Abstract 한글 요약
Lung cancer remains the leading cause of cancer-related incidence and mortality worldwide. Its tendency for postoperative distant metastasis significantly compromises long-term prognosis and survival. Accurately predicting the metastatic potential in a timely manner is crucial for formulating optimal treatment strategies. This study aimed to comprehensively compare the predictive performance of nine machine learning (ML) models and to enhance interpretability through SHAP (Shapley Additive Explanations), with the goal of developing a practical and transparent risk stratification tool for postoperative lung cancer management. Clinical data from 3,120 patients with stage I-III lung cancer who underwent radical surgery were retrospectively collected and randomly divided into training and testing cohorts. A total of 52 clinical, pathological, imaging, and laboratory variables were analyzed. Nine ML models-including eXtreme Gradient Boosting (XGBoost), Random Forest (RF), Light Gradient Boosting Machine (LightGBM), Adaptive Boosting (AdaBoost), Decision Tree (DT), Gradient Boosting Decision Tree (GBDT), Gaussian Naive Bayes (GNB), Complement Naive Bayes (CNB), and Multilayer Perceptron classifier (MLP)-were developed and evaluated. Model performance was assessed using accuracy, precision, recall, F1 score, ROC-AUC, PR-AUC, calibration, and decision curve analysis (DCA). All models were evaluated using nested cross-validation (outer stratified 70/30 splits repeated 10 times; inner fivefold tuning), with decision thresholds prespecified in the inner loop and applied unchanged to held-out tests. Given the approximately 4:1 class imbalance, cost-sensitive learning was primarily adopted, and PR-AUC was reported in addition to ROC-AUC. Among the nine models, GBDT demonstrated the highest predictive performance, achieving an AUC of 0.810 (95% CI: 0.748-0.872), accuracy of 0.766, sensitivity of 0.698, and specificity of 0.786 in the test set. SHAP analysis revealed that adjuvant chemotherapy, adjuvant radiotherapy, pathological N stage, age, body mass index (BMI), and preoperative neutrophil count (Pre-ANC) were the most influential predictors of distant metastasis. The combination of model performance and interpretability supported the model's potential for integration into clinical workflows to assist in real-time decision-making. In this work, we carried out a systematic comparison of nine machine learning algorithms in a large postoperative cohort under a coherent and interpretable framework. By jointly considering discrimination, calibration, clinical benefit (via decision curve analysis), and SHAP-based explanations, we constructed a practical prognostic tool to guide personalized treatment strategies and follow-up care. This methodology offers a data-driven basis for precision management. Ultimately, our findings provide an internally validated reference framework that warrants external and multicenter validation prior to clinical deployment.
🏷️ 키워드 / MeSH 📖 같은 키워드 OA만
같은 제1저자의 인용 많은 논문 (5)
- Ginsenoside Rg5 inhibits colorectal cancer, at least partially by blocking the lysosomal degradation of colorectal cancer cells.
- A novel risk stratification strategy for precision prevention of gastric cancer based on clinicopathological features and IGFBP7.
- MRPS28 serves as a biomarker of diagnostic, prognostic, and immune modulation in pan-cancer and promotes breast cancer malignant phenotypes.
- Macrophage-Derived CuET Vesicles Synergistically Enhance Paclitaxel Efficacy by Inhibiting Tumor Growth and Boosting Immunity in Breast Cancer.
- Dysregulation of CircZNF79(5) Modulates YBX1 Stability and Selective Autophagy to Drive Hepatocellular Carcinoma Progression.
📖 전문 본문 읽기 PMC JATS · ~116 KB · 영문
Introduction
Introduction
Lung cancer remains the leading cause of cancer-related morbidity and mortality worldwide1. Despite advances in screening, minimally invasive surgery, targeted therapy, and immunotherapy, long-term survival after curative resection is still curtailed by postoperative distant metastasis2,3. Clinical series suggest that ~ 30–50% of non-small cell lung cancer (NSCLC) patients ultimately develop metastases—most frequently in the brain, bone, adrenal glands, and liver4,5. Timely, accurate risk stratification is therefore crucial to tailor surveillance, optimize adjuvant therapy, and allocate resources efficiently.
Multiple prognostic approaches—TNM staging, clinicopathologic nomograms, imaging-derived features, and blood-based biomarkers—have been explored6–8. Yet many reported models show only modest sensitivity and specificity, limited generalizability across institutions, and suboptimal calibration, hindering clinical uptake9–13. Moreover, conventional regression may not capture the non-linear, high-dimensional interactions among heterogeneous clinical, pathological, imaging, and laboratory variables seen in practice14,15.
Machine learning (ML) offers principled tools to model such complexity and uncover latent patterns linked to metastatic risk16–18. Prior work often applied single algorithms with narrow scope, small cohorts, and limited interpretability19–21. Because opaque predictions are difficult to trust, transparent models whose reasoning can be interrogated and aligned with clinical knowledge are needed22–26.
To address these gaps, we performed a head-to-head comparison of nine ML algorithms—XGBoost, RF, LightGBM, AdaBoost, DT, GBDT, GNB, CNB, and MLP- in a single-center cohort of 3,120 stage I–III lung cancer patients treated with radical resection. Fifty-two candidate variables spanning clinical features, pathology, imaging-derived metrics, and laboratory indices were assessed. Performance was evaluated using accuracy, precision, recall, F1 score, ROC–AUC, precision–recall metrics, calibration, and decision curve analysis (DCA). We further applied SHAP to provide global and individualized interpretations of the best-performing model, with top contributors aligning with clinical knowledge—for example, nodal status, adjuvant therapy use, and systemic inflammatory indices.
We hypothesized that gradient-boosting approaches would outperform linear baselines and that SHAP would surface clinically plausible predictors. Our aim is to deliver accurate, explainable risk estimates to support postoperative stratification, follow-up planning, and shared decision-making. In summary, we contribute: (i) a large single-center cohort focused on distant metastasis; (ii) a unified, nested cross-validation pipeline for fair benchmarking of nine algorithms; and (iii) an integrated methodology combining LASSO-based feature selection, probability calibration, DCA, and SHAP. These results position our model as a methodological reference and a practical decision-support prototype warranting external, multicenter validation.
Lung cancer remains the leading cause of cancer-related morbidity and mortality worldwide1. Despite advances in screening, minimally invasive surgery, targeted therapy, and immunotherapy, long-term survival after curative resection is still curtailed by postoperative distant metastasis2,3. Clinical series suggest that ~ 30–50% of non-small cell lung cancer (NSCLC) patients ultimately develop metastases—most frequently in the brain, bone, adrenal glands, and liver4,5. Timely, accurate risk stratification is therefore crucial to tailor surveillance, optimize adjuvant therapy, and allocate resources efficiently.
Multiple prognostic approaches—TNM staging, clinicopathologic nomograms, imaging-derived features, and blood-based biomarkers—have been explored6–8. Yet many reported models show only modest sensitivity and specificity, limited generalizability across institutions, and suboptimal calibration, hindering clinical uptake9–13. Moreover, conventional regression may not capture the non-linear, high-dimensional interactions among heterogeneous clinical, pathological, imaging, and laboratory variables seen in practice14,15.
Machine learning (ML) offers principled tools to model such complexity and uncover latent patterns linked to metastatic risk16–18. Prior work often applied single algorithms with narrow scope, small cohorts, and limited interpretability19–21. Because opaque predictions are difficult to trust, transparent models whose reasoning can be interrogated and aligned with clinical knowledge are needed22–26.
To address these gaps, we performed a head-to-head comparison of nine ML algorithms—XGBoost, RF, LightGBM, AdaBoost, DT, GBDT, GNB, CNB, and MLP- in a single-center cohort of 3,120 stage I–III lung cancer patients treated with radical resection. Fifty-two candidate variables spanning clinical features, pathology, imaging-derived metrics, and laboratory indices were assessed. Performance was evaluated using accuracy, precision, recall, F1 score, ROC–AUC, precision–recall metrics, calibration, and decision curve analysis (DCA). We further applied SHAP to provide global and individualized interpretations of the best-performing model, with top contributors aligning with clinical knowledge—for example, nodal status, adjuvant therapy use, and systemic inflammatory indices.
We hypothesized that gradient-boosting approaches would outperform linear baselines and that SHAP would surface clinically plausible predictors. Our aim is to deliver accurate, explainable risk estimates to support postoperative stratification, follow-up planning, and shared decision-making. In summary, we contribute: (i) a large single-center cohort focused on distant metastasis; (ii) a unified, nested cross-validation pipeline for fair benchmarking of nine algorithms; and (iii) an integrated methodology combining LASSO-based feature selection, probability calibration, DCA, and SHAP. These results position our model as a methodological reference and a practical decision-support prototype warranting external, multicenter validation.
Data and methods
Data and methods
Data
Ethics statement
This study’s methodology was thoroughly reviewed and approved by the Ethics Committee of Yunnan Oncology Center (Authorization Code: KY2019141) and conducted in full compliance with the principles of the Helsinki Declaration. All participants provided written informed consent prior to enrollment.
Study population
In accordance with the guidelines outlined in the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement27, this study retrospectively analyzed 3,120 patients with stage I-III primary lung cancer confirmed by pathology. All patients underwent radical resection at Yunnan Oncology Center between January 2013 and December 2018.
Inclusion and exclusion criteria
Inclusion Criteria:
Pathologically confirmed primary lung cancer diagnosis.
Underwent radical surgical treatment.
Postoperative follow-up duration of at least 2 years.
Exclusion Criteria:
Presence of other malignant tumors.
Received preoperative radiotherapy or chemotherapy.
Incomplete clinical data Fig. 1.
Postoperative monitoring and metastasis
Postoperative monitoring involved chest and abdominal CT scans and blood tumor marker tests every 3 to 6 months during the first 3 years, followed by tests every 6 to 12 months in years 4 and 5. Additional imaging tests, such as PET-CT or bone scans for bone pain, head MRI for headaches, and enhanced CT, abdominal ultrasound, or gastrointestinal endoscopy for abdominal pain, were performed as needed based on patient symptoms. Upon confirmation of metastasis, a multidisciplinary team was consulted to determine the appropriate treatment plan.
Classification and definition of first organ metastases
An analysis was conducted on the post-relapse survival (PRS) of patients with metastases. PRS is defined as the time from the first detection of metastasis to the last follow-up or all-cause death in patients without event review. The metastatic sites were classified into seven categories: (i) lungs, (ii) brain, (iii) bone, (iv) abdomen (including liver and adrenal glands), (v) pleura, (vi) lymph nodes, and (vii) multiple sites (involving two or more organs).
Research variables
The study included 52 variables across the following categories:
General Information: Age, gender, smoking history, number of cigarettes smoked, alcohol consumption history, family history of cancer, previous cancer history, emphysema status, and BMI.
Imaging Data: Preoperative whole-body bone scan, lesion type, preoperative pathological biopsy, imaging-based TNM staging, Skeletal muscle area at the T4 level, CT attenuation of skeletal muscle at the T4 level, splenic area at the level of the splenic hilum, splenic hilum CT value, L3 skeletal muscle area, L3 skeletal muscle CT value, L3 visceral fat area, L3 visceral fat CT value, L3 subcutaneous fat area, L3 subcutaneous fat CT value, maximum lesion diameter on CT imaging, and the count of metastatic lymph nodes.
Treatment Status: Surgical method, type of resection, lesion location, and number of lymph nodes removed.
Pathological Findings: Pathological type, differentiation grade, vascular invasion, bronchial stump involvement, pleural invasion, EGFR and ALK mutation status, and pathological TNM staging.
Laboratory Data: Preoperative tumor markers (CEA, CA125, CA199, CA724, SCC, neuron-specific enolase, ferritin, malignant tumor markers), preoperative absolute neutrophil count, and preoperative absolute lymphocyte count (including CD3, CD4, CD8, NK cells).
Imaging feature extraction and reproducibility
ROI delineation and HU thresholds. Body-composition metrics, including skeletal muscle and adipose tissue areas at the T4 and L3 levels, were derived using semi-automated segmentation with independent manual corrections by two radiologists; discrepancies were resolved by consensus. Hounsfield Unit thresholds followed conventional ranges—muscle: ~ − 29 to + 150 HU; subcutaneous/visceral fat: ~ − 190 to − 30 HU—with scanner-specific calibration verified monthly28.
Reproducibility and harmonization. Inter-observer agreement was evaluated on an independently annotated subset, and intra-observer agreement on a separate subset with repeat annotations after a washout interval. For continuous features (areas and HU attenuation), agreement was quantified using two-way random-effects, absolute-agreement intraclass correlation coefficients, ICC(2,1), with 95% CIs obtained via bootstrap resampling, alongside Bland–Altman analysis (mean difference and 95% limits of agreement) and the coefficient of variation (CV = SD/mean)29. For mask overlap, the Dice similarity coefficient was computed for paired annotations (Dice = 2|A ∩ B|/(|A| +|B|)). HU stability was maintained through routine phantom-based QA, and sensitivity analyses included scanner type as a covariate. Detailed ICC/Bland–Altman/CV/Dice results are provided in the Supplement30.
Methods
Overview of the evaluation workflow
All models were trained and evaluated within a nested cross-validation framework with fold-confined preprocessing, class-imbalance handling, calibration, and thresholding, as detailed in Sects. “Feature Selection (LASSO)”Sects. Thresholding and Performance Reporting”. Statistical comparisons and descriptive analyses are described in Sect. “Statistical Analysis”.
Handling class imbalance
The cohort was moderately imbalanced (metastasis ~ 19.1%, approximately a 4:1 ratio). To mitigate potential training bias, we primarily adopted cost-sensitive learning as the main strategy, using class_weight = “balanced” for scikit-learn models and scale_pos_weight = N_neg/N_pos for XGBoost and LightGBM. All weights were determined within the inner cross-validation and fixed for the corresponding outer test split to avoid information leakage.
We additionally reported PR-AUC, which is more sensitive to minority-class performance under class imbalance. Oversampling methods (SMOTE and ADASYN) were explored during preliminary model development but were not adopted in the final pipeline due to inferior or unstable performance. Class-imbalance handling was applied only within training folds; the held-out test folds were kept untouched to preserve the real-world prevalence. This strategy has been widely adopted in previous studies addressing class imbalance in medical prediction models31.
Feature selection (LASSO)
We applied LASSO-based feature selection to reduce dimensionality and mitigate collinearity prior to model training. Candidate predictors included clinical, pathological, imaging-derived, and laboratory variables. Selected features from the training data were passed forward to all downstream models within the nested cross-validation (CV) pipeline to ensure a consistent predictor set and to avoid information leakage.
Overfitting mitigation and parameterization
Nine classifiers were trained within a single nested CV pipeline. The outer loop used stratified 70/30 (7:3) splits repeated 10 times to obtain unbiased test estimates; the inner loop used stratified fivefold CV for hyperparameter tuning. Tuning primarily optimized ROC–AUC, with PR–AUC and the Brier score as secondary criteria.
Given class imbalance (19.1% metastasis), imbalance handling was performed inside the inner loop to prevent leakage: class weighting (or scale_pos_weight for boosting models) and. For tree-based learners (including the final GBDT), the search space covered the number of estimators, learning rate, maximum depth, subsample and column-sample fractions, and L2 regularization; early stopping on inner-CV validation folds was used to curb overfitting. Linear models were trained with L1/L2 penalties, with regularization strength selected by CV. The configuration with the highest mean inner-CV ROC–AUC advanced to the corresponding outer held-out fold. Predicted probabilities were calibrated on inner-CV predictions and then applied unchanged to the outer-test evaluation32–34.
Thresholding and performance reporting
Decision thresholds were pre-specified in the inner CV using Youden’s J and then fixed for the outer held-out sets. In addition to AUROC and PR-AUC, we reported sensitivity, specificity, PPV, NPV, accuracy, and F1 score at the fixed operating threshold. Together, these metrics fully characterize the 2 × 2 classification outcome and are functionally equivalent to a confusion matrix. Test-set prevalence is additionally provided to contextualize PPV and NPV. Global and individualized interpretability analyses for the best-performing model were conducted using SHAP, as detailed in the corresponding Results subsection35,36.
Statistical analysis
The training and testing datasets were compared across all variables. Continuous variables were summarized as medians with interquartile ranges (IQRs) and compared using the Mann–Whitney U test, whereas categorical variables were expressed as counts and percentages and analyzed with the chi-square test. Effect sizes were reported as standardized mean differences (SMD) for continuous variables and Cramér’s V for categorical variables. To account for multiple testing, false discovery rate control with the Benjamini–Hochberg procedure was applied37. A two-tailed p-value < 0.05 was considered statistically significant.
All performance estimates were derived from internal validation only (repeated outer splits with nested hyperparameter tuning); no external or multicenter validation was available at the time of submission. Analyses were conducted using R (v4.2.3), Python (v3.11.4), and SPSS (v25.0).
Data
Ethics statement
This study’s methodology was thoroughly reviewed and approved by the Ethics Committee of Yunnan Oncology Center (Authorization Code: KY2019141) and conducted in full compliance with the principles of the Helsinki Declaration. All participants provided written informed consent prior to enrollment.
Study population
In accordance with the guidelines outlined in the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement27, this study retrospectively analyzed 3,120 patients with stage I-III primary lung cancer confirmed by pathology. All patients underwent radical resection at Yunnan Oncology Center between January 2013 and December 2018.
Inclusion and exclusion criteria
Inclusion Criteria:
Pathologically confirmed primary lung cancer diagnosis.
Underwent radical surgical treatment.
Postoperative follow-up duration of at least 2 years.
Exclusion Criteria:
Presence of other malignant tumors.
Received preoperative radiotherapy or chemotherapy.
Incomplete clinical data Fig. 1.
Postoperative monitoring and metastasis
Postoperative monitoring involved chest and abdominal CT scans and blood tumor marker tests every 3 to 6 months during the first 3 years, followed by tests every 6 to 12 months in years 4 and 5. Additional imaging tests, such as PET-CT or bone scans for bone pain, head MRI for headaches, and enhanced CT, abdominal ultrasound, or gastrointestinal endoscopy for abdominal pain, were performed as needed based on patient symptoms. Upon confirmation of metastasis, a multidisciplinary team was consulted to determine the appropriate treatment plan.
Classification and definition of first organ metastases
An analysis was conducted on the post-relapse survival (PRS) of patients with metastases. PRS is defined as the time from the first detection of metastasis to the last follow-up or all-cause death in patients without event review. The metastatic sites were classified into seven categories: (i) lungs, (ii) brain, (iii) bone, (iv) abdomen (including liver and adrenal glands), (v) pleura, (vi) lymph nodes, and (vii) multiple sites (involving two or more organs).
Research variables
The study included 52 variables across the following categories:
General Information: Age, gender, smoking history, number of cigarettes smoked, alcohol consumption history, family history of cancer, previous cancer history, emphysema status, and BMI.
Imaging Data: Preoperative whole-body bone scan, lesion type, preoperative pathological biopsy, imaging-based TNM staging, Skeletal muscle area at the T4 level, CT attenuation of skeletal muscle at the T4 level, splenic area at the level of the splenic hilum, splenic hilum CT value, L3 skeletal muscle area, L3 skeletal muscle CT value, L3 visceral fat area, L3 visceral fat CT value, L3 subcutaneous fat area, L3 subcutaneous fat CT value, maximum lesion diameter on CT imaging, and the count of metastatic lymph nodes.
Treatment Status: Surgical method, type of resection, lesion location, and number of lymph nodes removed.
Pathological Findings: Pathological type, differentiation grade, vascular invasion, bronchial stump involvement, pleural invasion, EGFR and ALK mutation status, and pathological TNM staging.
Laboratory Data: Preoperative tumor markers (CEA, CA125, CA199, CA724, SCC, neuron-specific enolase, ferritin, malignant tumor markers), preoperative absolute neutrophil count, and preoperative absolute lymphocyte count (including CD3, CD4, CD8, NK cells).
Imaging feature extraction and reproducibility
ROI delineation and HU thresholds. Body-composition metrics, including skeletal muscle and adipose tissue areas at the T4 and L3 levels, were derived using semi-automated segmentation with independent manual corrections by two radiologists; discrepancies were resolved by consensus. Hounsfield Unit thresholds followed conventional ranges—muscle: ~ − 29 to + 150 HU; subcutaneous/visceral fat: ~ − 190 to − 30 HU—with scanner-specific calibration verified monthly28.
Reproducibility and harmonization. Inter-observer agreement was evaluated on an independently annotated subset, and intra-observer agreement on a separate subset with repeat annotations after a washout interval. For continuous features (areas and HU attenuation), agreement was quantified using two-way random-effects, absolute-agreement intraclass correlation coefficients, ICC(2,1), with 95% CIs obtained via bootstrap resampling, alongside Bland–Altman analysis (mean difference and 95% limits of agreement) and the coefficient of variation (CV = SD/mean)29. For mask overlap, the Dice similarity coefficient was computed for paired annotations (Dice = 2|A ∩ B|/(|A| +|B|)). HU stability was maintained through routine phantom-based QA, and sensitivity analyses included scanner type as a covariate. Detailed ICC/Bland–Altman/CV/Dice results are provided in the Supplement30.
Methods
Overview of the evaluation workflow
All models were trained and evaluated within a nested cross-validation framework with fold-confined preprocessing, class-imbalance handling, calibration, and thresholding, as detailed in Sects. “Feature Selection (LASSO)”Sects. Thresholding and Performance Reporting”. Statistical comparisons and descriptive analyses are described in Sect. “Statistical Analysis”.
Handling class imbalance
The cohort was moderately imbalanced (metastasis ~ 19.1%, approximately a 4:1 ratio). To mitigate potential training bias, we primarily adopted cost-sensitive learning as the main strategy, using class_weight = “balanced” for scikit-learn models and scale_pos_weight = N_neg/N_pos for XGBoost and LightGBM. All weights were determined within the inner cross-validation and fixed for the corresponding outer test split to avoid information leakage.
We additionally reported PR-AUC, which is more sensitive to minority-class performance under class imbalance. Oversampling methods (SMOTE and ADASYN) were explored during preliminary model development but were not adopted in the final pipeline due to inferior or unstable performance. Class-imbalance handling was applied only within training folds; the held-out test folds were kept untouched to preserve the real-world prevalence. This strategy has been widely adopted in previous studies addressing class imbalance in medical prediction models31.
Feature selection (LASSO)
We applied LASSO-based feature selection to reduce dimensionality and mitigate collinearity prior to model training. Candidate predictors included clinical, pathological, imaging-derived, and laboratory variables. Selected features from the training data were passed forward to all downstream models within the nested cross-validation (CV) pipeline to ensure a consistent predictor set and to avoid information leakage.
Overfitting mitigation and parameterization
Nine classifiers were trained within a single nested CV pipeline. The outer loop used stratified 70/30 (7:3) splits repeated 10 times to obtain unbiased test estimates; the inner loop used stratified fivefold CV for hyperparameter tuning. Tuning primarily optimized ROC–AUC, with PR–AUC and the Brier score as secondary criteria.
Given class imbalance (19.1% metastasis), imbalance handling was performed inside the inner loop to prevent leakage: class weighting (or scale_pos_weight for boosting models) and. For tree-based learners (including the final GBDT), the search space covered the number of estimators, learning rate, maximum depth, subsample and column-sample fractions, and L2 regularization; early stopping on inner-CV validation folds was used to curb overfitting. Linear models were trained with L1/L2 penalties, with regularization strength selected by CV. The configuration with the highest mean inner-CV ROC–AUC advanced to the corresponding outer held-out fold. Predicted probabilities were calibrated on inner-CV predictions and then applied unchanged to the outer-test evaluation32–34.
Thresholding and performance reporting
Decision thresholds were pre-specified in the inner CV using Youden’s J and then fixed for the outer held-out sets. In addition to AUROC and PR-AUC, we reported sensitivity, specificity, PPV, NPV, accuracy, and F1 score at the fixed operating threshold. Together, these metrics fully characterize the 2 × 2 classification outcome and are functionally equivalent to a confusion matrix. Test-set prevalence is additionally provided to contextualize PPV and NPV. Global and individualized interpretability analyses for the best-performing model were conducted using SHAP, as detailed in the corresponding Results subsection35,36.
Statistical analysis
The training and testing datasets were compared across all variables. Continuous variables were summarized as medians with interquartile ranges (IQRs) and compared using the Mann–Whitney U test, whereas categorical variables were expressed as counts and percentages and analyzed with the chi-square test. Effect sizes were reported as standardized mean differences (SMD) for continuous variables and Cramér’s V for categorical variables. To account for multiple testing, false discovery rate control with the Benjamini–Hochberg procedure was applied37. A two-tailed p-value < 0.05 was considered statistically significant.
All performance estimates were derived from internal validation only (repeated outer splits with nested hyperparameter tuning); no external or multicenter validation was available at the time of submission. Analyses were conducted using R (v4.2.3), Python (v3.11.4), and SPSS (v25.0).
Results
Results
Patient characteristics
Baseline characteristics were comparable between the training and testing cohorts
Of the 3,120 eligible patients, 2,184 were assigned to the training cohort and 936 to the testing cohort via stratified random split. There were no statistically significant differences between cohorts across demographic, clinical, pathological, imaging, and laboratory variables (all P > 0.05), indicating that the random split achieved a balanced distribution of baseline characteristics and minimized allocation bias for subsequent model development(Table 1).
Baseline characteristics by metastasis status are summarized inTable 2.
Compared with the non-metastasis group (n = 2,524), the metastasis group (n = 596) was more likely to be male, have a history of smoking and alcohol use, and present with solid or part-solid pulmonary nodules (all P < 0.05). They also exhibited larger tumor size on CT and more advanced clinical and pathological T/N stages. Pathology features—poor differentiation, vascular tumor thrombus, and positive bronchial margins—were significantly enriched among patients who developed metastasis. In terms of treatment, the proportions receiving adjuvant chemotherapy and adjuvant radiotherapy were higher in the metastasis group (both P < 0.001). Regarding laboratory and body-composition indices, patients with metastasis had higher BMI, elevated Pre-ANC, and higher tumor markers (e.g., CEA, CA-125, CA-199, ferritin) compared with those without metastasis (most comparisons P < 0.05). Age was comparable between groups (median 59 vs 58 years, P = 0.439).
Feature selection for postoperative distant metastasis
LASSO regression analysis was conducted on the remaining independent variables, with lung cancer metastasis as the dependent variable38. LASSO reduced 52 candidates to 9 predictors at the λ minimizing inner-CV MSE (Fig. 2; Table S1). Of these, age, BMI, pathological N stage, adjuvant chemotherapy, adjuvant radiotherapy, and Pre-ANC remained significant (p < 0.05).
Multivariate logistic regression analysis
Multivariate logistic regression analysis was conducted to identify independent risk factors associated with postoperative distant metastasis (Table 3). Age (OR = 1.08, 95% CI: 1.05–1.11, P < 0.001), history of smoking (OR = 0.24, 95% CI: 0.12–0.46, P < 0.001), BMI (OR = 1.11, 95% CI: 1.03–1.24, P = 0.032), wedge resection (OR = 2.62, 95% CI: 1.08–6.92, P = 0.029), pathological N2 stage (OR = 3.87, 95% CI: 2.17–16.92, P < 0.001), adjuvant chemotherapy (OR = 9.34, 95% CI: 4.05–21.91, P < 0.001), and adjuvant radiotherapy (OR = 49.34, 95% CI: 16.18–199.29, P < 0.001) were identified as independent predictors of distant metastasis. These large effect sizes for adjuvant therapies likely reflect confounding by indication, as patients with more advanced or aggressive disease were preferentially selected for additional treatment,rather than a direct causal effect. These findings highlight that both clinicopathological features and treatment-related factors significantly contribute to metastasis risk.
Model performance comparison
These features were subsequently used to train nine machine-learning models (GBDT, XGBoost, RF, LightGBM, AdaBoost, DT, GNB, CNB, and MLP). Each model was trained and compared within the nested-CV pipeline (outer stratified 70/30 splits repeated 10 times; inner fivefold tuning). GBDT demonstrated the highest performance (mean AUROC ≈ 0.81), followed by XGBoost and RF (AUROC ≈ 0.79–0.80), whereas CNB and GNB performed lower (AUROC ≈ 0.74–0.75). Calibration curves indicated that GBDT exhibited the best agreement between predicted and observed probabilities, and decision-curve analysis (DCA) showed the greatest clinical net benefit across a broad threshold range (Fig. 3). Detailed operating-point metrics are provided in Sect. “Optimal Model Selection” and Table S2; representative ROC/PR/calibration/DCA curves are shown in Fig. 3. In the test set, GBDT achieved accuracy 0.766, sensitivity 0.698, specificity 0.786, and F1 0.564 (Fig. 3e-f; Sect. “Optimal Model Selection”), confirming its selection as the optimal model for this dataset.
Optimal model selection
Based on the head-to-head comparison, GBDT was selected as the final model. Held-out performance and the fixed operating point are detailed in Sect. “Thresholded performance (confusion-matrix–equivalent)”.
Thresholded performance (confusion-matrix–equivalent)
All reported results were obtained under the imbalance-aware, cost-sensitive training framework described above, with class weights fixed within the inner cross-validation and applied unchanged to the outer test folds. On the held-out set, the final GBDT achieved AUC 0.810 (95% CI 0.748–0.872) at the pre-specified threshold 0.199 (0.185–0.212). At this operating point, the confusion-matrix–equivalent metrics were: sensitivity 0.698 (0.672–0.724), specificity 0.786 (0.759–0.814), PPV 0.478 (0.436–0.520), NPV 0.904 (0.894–0.914), accuracy 0.766 (0.748–0.785), and F1 0.564 (0.537–0.590) (in Table S2). These metrics together describe the model’s classification behavior at the selected operating point, allowing clinicians to understand how often metastasis is correctly identified and how often it may be missed in routine practice. Given sensitivity < specificity, false negatives are relatively more frequent at the observed prevalence, consistent with our DCA-based threshold–benefit trade-offs (Fig. 4).
SHAP explanation of the optimal model
SHAP analysis was applied to enhance model transparency and interpretability. The summary plot identified adjuvant chemotherapy, adjuvant radiotherapy, pathological N stage, BMI, age, and Pre-ANC as the leading contributors to metastasis risk (Fig. 5a). Although age did not show a statistically significant difference in the univariate comparison (Table 3, P = 0.439), this does not contradict its contribution in the SHAP analysis. Univariate tests evaluate marginal group differences, whereas SHAP quantifies the conditional contribution of a feature within a multivariable, nonlinear model. In our model, age mainly contributed through interaction patterns with other clinical variables (e.g., pathological N stage, BMI, and treatment-related variables), which may not be detectable in univariate comparisons. Higher SHAP values for adjuvant therapy and advanced N stage reflected a stronger association with metastasis probability. However, these associations likely reflect treatment indication, whereby patients at higher baseline risk are more likely to receive adjuvant therapies, rather than a direct causal effect.
Dependence plots showed that low BMI and younger age were linked to elevated predicted risk, potentially reflecting more aggressive tumor biology or host frailty (Fig. 5b). Case-specific force plots further illustrated SHAP-based individualized predictions: in a high-risk case (f(x) = 0.92), multiple nodal metastases and poor differentiation were dominant risk-enhancing features (Fig. 5c); in contrast, in a low-risk case (f(x) = 0.07), a high proportion of ground-glass opacity and EGFR mutation positivity contributed to a protective profile (Fig. 5d).
Overall, the GBDT model achieved strong predictive performance and provided clinically interpretable insights when combined with SHAP analysis. By capturing adjuvant therapies as proxies for baseline severity rather than independent causal drivers. Importantly, this pattern should not be interpreted as a detrimental effect of adjuvant treatment itself, but rather as a reflection of baseline disease severity and clinical decision-making pathways. The model’s explanations therefore align with established clinical understanding and provide a credible foundation for potential clinical integration.
Patient characteristics
Baseline characteristics were comparable between the training and testing cohorts
Of the 3,120 eligible patients, 2,184 were assigned to the training cohort and 936 to the testing cohort via stratified random split. There were no statistically significant differences between cohorts across demographic, clinical, pathological, imaging, and laboratory variables (all P > 0.05), indicating that the random split achieved a balanced distribution of baseline characteristics and minimized allocation bias for subsequent model development(Table 1).
Baseline characteristics by metastasis status are summarized inTable 2.
Compared with the non-metastasis group (n = 2,524), the metastasis group (n = 596) was more likely to be male, have a history of smoking and alcohol use, and present with solid or part-solid pulmonary nodules (all P < 0.05). They also exhibited larger tumor size on CT and more advanced clinical and pathological T/N stages. Pathology features—poor differentiation, vascular tumor thrombus, and positive bronchial margins—were significantly enriched among patients who developed metastasis. In terms of treatment, the proportions receiving adjuvant chemotherapy and adjuvant radiotherapy were higher in the metastasis group (both P < 0.001). Regarding laboratory and body-composition indices, patients with metastasis had higher BMI, elevated Pre-ANC, and higher tumor markers (e.g., CEA, CA-125, CA-199, ferritin) compared with those without metastasis (most comparisons P < 0.05). Age was comparable between groups (median 59 vs 58 years, P = 0.439).
Feature selection for postoperative distant metastasis
LASSO regression analysis was conducted on the remaining independent variables, with lung cancer metastasis as the dependent variable38. LASSO reduced 52 candidates to 9 predictors at the λ minimizing inner-CV MSE (Fig. 2; Table S1). Of these, age, BMI, pathological N stage, adjuvant chemotherapy, adjuvant radiotherapy, and Pre-ANC remained significant (p < 0.05).
Multivariate logistic regression analysis
Multivariate logistic regression analysis was conducted to identify independent risk factors associated with postoperative distant metastasis (Table 3). Age (OR = 1.08, 95% CI: 1.05–1.11, P < 0.001), history of smoking (OR = 0.24, 95% CI: 0.12–0.46, P < 0.001), BMI (OR = 1.11, 95% CI: 1.03–1.24, P = 0.032), wedge resection (OR = 2.62, 95% CI: 1.08–6.92, P = 0.029), pathological N2 stage (OR = 3.87, 95% CI: 2.17–16.92, P < 0.001), adjuvant chemotherapy (OR = 9.34, 95% CI: 4.05–21.91, P < 0.001), and adjuvant radiotherapy (OR = 49.34, 95% CI: 16.18–199.29, P < 0.001) were identified as independent predictors of distant metastasis. These large effect sizes for adjuvant therapies likely reflect confounding by indication, as patients with more advanced or aggressive disease were preferentially selected for additional treatment,rather than a direct causal effect. These findings highlight that both clinicopathological features and treatment-related factors significantly contribute to metastasis risk.
Model performance comparison
These features were subsequently used to train nine machine-learning models (GBDT, XGBoost, RF, LightGBM, AdaBoost, DT, GNB, CNB, and MLP). Each model was trained and compared within the nested-CV pipeline (outer stratified 70/30 splits repeated 10 times; inner fivefold tuning). GBDT demonstrated the highest performance (mean AUROC ≈ 0.81), followed by XGBoost and RF (AUROC ≈ 0.79–0.80), whereas CNB and GNB performed lower (AUROC ≈ 0.74–0.75). Calibration curves indicated that GBDT exhibited the best agreement between predicted and observed probabilities, and decision-curve analysis (DCA) showed the greatest clinical net benefit across a broad threshold range (Fig. 3). Detailed operating-point metrics are provided in Sect. “Optimal Model Selection” and Table S2; representative ROC/PR/calibration/DCA curves are shown in Fig. 3. In the test set, GBDT achieved accuracy 0.766, sensitivity 0.698, specificity 0.786, and F1 0.564 (Fig. 3e-f; Sect. “Optimal Model Selection”), confirming its selection as the optimal model for this dataset.
Optimal model selection
Based on the head-to-head comparison, GBDT was selected as the final model. Held-out performance and the fixed operating point are detailed in Sect. “Thresholded performance (confusion-matrix–equivalent)”.
Thresholded performance (confusion-matrix–equivalent)
All reported results were obtained under the imbalance-aware, cost-sensitive training framework described above, with class weights fixed within the inner cross-validation and applied unchanged to the outer test folds. On the held-out set, the final GBDT achieved AUC 0.810 (95% CI 0.748–0.872) at the pre-specified threshold 0.199 (0.185–0.212). At this operating point, the confusion-matrix–equivalent metrics were: sensitivity 0.698 (0.672–0.724), specificity 0.786 (0.759–0.814), PPV 0.478 (0.436–0.520), NPV 0.904 (0.894–0.914), accuracy 0.766 (0.748–0.785), and F1 0.564 (0.537–0.590) (in Table S2). These metrics together describe the model’s classification behavior at the selected operating point, allowing clinicians to understand how often metastasis is correctly identified and how often it may be missed in routine practice. Given sensitivity < specificity, false negatives are relatively more frequent at the observed prevalence, consistent with our DCA-based threshold–benefit trade-offs (Fig. 4).
SHAP explanation of the optimal model
SHAP analysis was applied to enhance model transparency and interpretability. The summary plot identified adjuvant chemotherapy, adjuvant radiotherapy, pathological N stage, BMI, age, and Pre-ANC as the leading contributors to metastasis risk (Fig. 5a). Although age did not show a statistically significant difference in the univariate comparison (Table 3, P = 0.439), this does not contradict its contribution in the SHAP analysis. Univariate tests evaluate marginal group differences, whereas SHAP quantifies the conditional contribution of a feature within a multivariable, nonlinear model. In our model, age mainly contributed through interaction patterns with other clinical variables (e.g., pathological N stage, BMI, and treatment-related variables), which may not be detectable in univariate comparisons. Higher SHAP values for adjuvant therapy and advanced N stage reflected a stronger association with metastasis probability. However, these associations likely reflect treatment indication, whereby patients at higher baseline risk are more likely to receive adjuvant therapies, rather than a direct causal effect.
Dependence plots showed that low BMI and younger age were linked to elevated predicted risk, potentially reflecting more aggressive tumor biology or host frailty (Fig. 5b). Case-specific force plots further illustrated SHAP-based individualized predictions: in a high-risk case (f(x) = 0.92), multiple nodal metastases and poor differentiation were dominant risk-enhancing features (Fig. 5c); in contrast, in a low-risk case (f(x) = 0.07), a high proportion of ground-glass opacity and EGFR mutation positivity contributed to a protective profile (Fig. 5d).
Overall, the GBDT model achieved strong predictive performance and provided clinically interpretable insights when combined with SHAP analysis. By capturing adjuvant therapies as proxies for baseline severity rather than independent causal drivers. Importantly, this pattern should not be interpreted as a detrimental effect of adjuvant treatment itself, but rather as a reflection of baseline disease severity and clinical decision-making pathways. The model’s explanations therefore align with established clinical understanding and provide a credible foundation for potential clinical integration.
Discussion
Discussion
Recent advances in machine learning have enabled improved risk stratification across multiple domains of lung cancer management, including diagnosis, treatment response prediction, and prognostic assessment. However, most existing models focus on imaging-based or molecular endpoints, with limited attention to postoperative distant metastasis and insufficient emphasis on interpretability39–45. This gap motivates the present study.
Performance of different machine learning models in predicting distant metastasis after lung cancer surgery
We systematically compared nine machine-learning algorithms for predicting postoperative distant metastasis. Average AUROCs were obtained within the nested-CV pipeline (outer stratified 70/30 splits repeated 10 times; inner fivefold tuning), rather than a single tenfold CV (Fig. 3). GBDT achieved the best held-out performance (AUROC 0.810, accuracy 0.766, sensitivity 0.698, specificity 0.786; Sect. “Optimal Model Selection”, Table S2). This profile of moderate sensitivity and higher specificity implies relatively more false negatives at the observed prevalence. Clinically, if the priority is to minimize missed metastasis, a left-shift of the deployment threshold (favoring higher sensitivity and NPV at the expense of more false positives) is appropriate; if avoiding unnecessary interventions is prioritized, the current threshold—or a right-shift after recalibration—may be preferable. Consistent with prior oncology studies, gradient boosting has been reported to outperform linear baselines46,47, and to maintain stable performance across internal validation48,49, although external transportability still requires confirmation in multicenter cohorts50.
Clinical application value of the optimal model
We evaluated the clinical utility of the GBDT model from multiple complementary perspectives, including discrimination, calibration, decision-curve analysis, and learning-curve stability (Figs. 3b, 3c, 3f). The GBDT achieved an AUROC of 0.810 (95% CI 0.748–0.872) at the prespecified operating threshold derived from inner cross-validation. Thresholded metrics (sensitivity, specificity, PPV, NPV, accuracy, F1) are reported in Sect. “Optimal Model Selection” and Table S2, alongside PR-AUC, calibration, and DCA summaries (Fig. 3c). In addition, Learning-curve and repeated-split analyses suggested stable internal generalization and efficient data utilization for GBDT, which is particularly valuable in real-world settings where sample sizes may be limited (Fig. 4d). External transportability remains to be established. According to prior literature, GBDT offers several practical advantages. (i) Dynamic risk stratification enables tailored follow-up schedules across low-, intermediate-, and high-risk groups. (ii) Resource optimization is supported by decision-curve analysis, which indicates net clinical benefit across broad thresholds (Fig. 3c). (iii) Actionable interpretability is achievable when predictions are paired with SHAP summaries and case-level explanations, facilitating clinician review rather than replacing clinical judgment. Notably, these findings are associational rather than causal, and prospective multicenter validation is required before model-guided interventions can be recommended51,52.
Misclassification profile and operating-point strategy
At the reported operating point, sensitivity is lower than specificity, implying relatively more false negatives at the observed prevalence. This aligns with our DCA: if the clinical priority is to minimize missed metastasis, the deployment threshold can be shifted left to gain sensitivity (with more false positives); if avoiding unnecessary interventions is prioritized, the current threshold—or a right-shift after recalibration—may be preferable. We report the fixed-threshold baseline here and outline a clear path for external validation and recalibration53. Together, these findings indicate that model performance, error profile, and interpretability must be considered jointly when translating predictive tools into clinical workflows.
Clinical significance of SHAP interpretation results
SHAP-based interpretation enhances the transparency of the GBDT model by providing clinically meaningful explanations for individual predictions. This may improve clinicians’ understanding of model predictions and facilitate more informed decision-making. Previous studies have suggested that SHAP-based interpretability tools may enhance clinicians’ confidence in model-assisted decision-making and facilitate more rational treatment adjustments45–55. The SHAP analysis based on the GBDT model in this study reveals the significant impact of clinical features such as adjuvant chemotherapy, adjuvant radiotherapy, pathological N stage, BMI, age, and Pre-ANC on distant metastasis after lung cancer surgery. Notably, age appeared among the top contributors despite a non-significant univariate difference. This is expected because SHAP reflects model-based conditional contribution rather than marginal group separation. Clinically, age may modulate risk in conjunction with nodal status, body composition and treatment decisions, highlighting the importance of considering patient factors jointly rather than in isolation. This finding is consistent with the CALGB 9633 study, which reported a survival benefit from adjuvant chemotherapy, and with the 8th edition of the AJCC staging system, which emphasizes pathological N stage as a core prognostic factor. Using LASSO preserved variable-level semantics and facilitated alignment56,57. Selection frequencies across outer repetitions suggested a core subset of predictors was repeatedly retained, indicating robustness to resampling. Complementary selectors (e.g., elastic‑net for grouped correlation) could be considered as sensitivity analyses while keeping reporting focused on clinically interpretable variables.This pattern is clinically plausible, as younger patients with advanced nodal disease or aggressive pathology may experience higher metastatic risk despite similar chronological age distributions.
SHAP-based interpretation made the model’s predictions transparent and clinically interpretable. Summary plots identified adjuvant chemotherapy, adjuvant radiotherapy, pathological N stage, BMI, age, and Pre-ANC as the leading contributors to predicted metastasis risk (Fig. 5a–b).
These explanations are associational rather than causal; in particular, attributions involving adjuvant therapies likely reflect confounding by indication, as patients at higher baseline risk are more likely to receive additional treatment (Fig. 5c–d). This observation is compatible with established clinicopathologic knowledge summarized in the WHO classification of lung tumors58.
Comparison with prior work
Compared with prior studies that typically assessed a single algorithm, used narrow feature sets, or emphasized discrimination alone, our work adopts a broader algorithmic scope, a richer predictor set, and a more clinically oriented evaluation. We conducted a unified, head-to-head comparison of nine ML models; complemented ROC–AUC/PR–AUC with calibration and decision curve analysis (DCA) to quantify clinical utility; and used SHAP to connect predictions with clinical reasoning at both global and patient levels59. These design choices enhance reproducibility and interpretability, and may partly explain why specific variables emerge as dominant predictors in both traditional between-group analyses and the ML framework.
Recent radiomics-based studies have reported higher discrimination for metastasis-related endpoints. For example, a CT radiomics study published in BMC Cancer reported an AUC approaching 0.90 under an internal split validation scheme60. Importantly, the higher AUC reported in radiomics-based studies is often obtained under single-split internal validation without fold-confined preprocessing or probability calibration, which may inflate apparent performance. In contrast, our repeated nested cross-validation framework yields more conservative but less optimistically biased estimates. However, direct numerical comparison is not straightforward because the predicted endpoint (e.g., brain metastasis versus overall distant metastasis), feature modality (radiomics versus routinely available clinicopathologic and treatment variables), and validation strategy differ. In the present study, we intentionally restricted predictors to readily available perioperative variables and adopted repeated nested cross-validation, which yields more conservative but less optimistically biased estimates (Table S4).
Limitations of the research and future prospects
This study delivers a head-to-head, multi-algorithm comparison on a large, clinically rich cohort; complements discrimination with calibration, decision-curve, and precision–recall evaluation; and provides transparent explanations via SHAP at both global and individual levels. These choices enhance reproducibility, interpretability, and clinical relevance, supporting workflow integration for risk-stratified postoperative management. Nevertheless, several limitations merit discussion.
First, this was a single-center retrospective study, which may limit generalizability and transportability across different clinical settings. Although we used repeated nested cross-validation with fold-confined preprocessing, tuning, and calibration to mitigate overfitting, the absence of external validation may still leave residual optimism, and multicenter validation is therefore warranted before clinical implementation. Although repeated outer splits with nested tuning and systematic internal evaluation (ROC–/PR–AUC, calibration, and DCA) reduce optimism, they do not substitute for external or multicenter validation. Generalizability may be constrained by center-specific case-mix, scanner/reconstruction variability, and local practice patterns.GBDT demonstrated stable convergence across increasing training set sizes (Fig. 4d), suggesting robust internal generalization. This aligns with Lee et al.61, who noted that single-center datasets are prone to selection bias affecting performance across diverse populations.
Second, the mechanistic linkage of selected features remains limited. LASSO screening (Fig. 2; e.g., 9 non-zero coefficients identified at λ = 0.028 via tenfold cross-validation) successfully recognized imaging-derived and clinicopathologic features with predictive value. However, the biological mechanisms underlying these features have not been deeply linked to pathological indicators or molecular markers (e.g., EGFR). The coefficient distribution plot (Fig. 3) only reflects the statistical significance of the features, rather than providing insights into the tumor microenvironmental context. This limitation aligns with a common gap in the field of radiomics: statistically significant features often lack mechanistic validation—a challenge also highlighted in the study by Hatt et al.62.
Third, imaging-feature variability. Imaging-derived features can vary with slice selection, segmentation, and vendor/kernel differences. We mitigated and quantified these via semi-automated HU-thresholded segmentation for body composition extraction, isotropic resampling, and inter/intra-observer assessments (ICC(2,1), Bland–Altman, Dice, CV%). Sensitivity analyses (± HU, morphological operations) suggested minimal impact on conclusions (Table Sx, Figure Sx), but multicenter, multi-vendor data and automated segmentation are needed to further reduce measurement error63.
Fourth, static postoperative inputs. The present model uses cross-sectional postoperative variables and does not incorporate time-varying information (e.g., treatment response, surveillance trajectories), which may understate risk in evolving clinical courses64.
Future work will include temporal and geographic external validation across multiple centers equipped with imaging devices from different vendors, reconstruction kernels, and case-prevalence profiles to rigorously assess model generalizability. In parallel, we will evaluate model updating strategies, incorporate SHAP interaction values to clarify synergistic feature effects, and explore dynamic prediction approaches together with multi-omics integration. Through these efforts, our goal is to transition the model from an “internally validated” construct to a deployable decision-support tool with verified performance in out-of-sample populations.
In this context, the current findings should be regarded as an internal benchmark pending external validation and head-to-head multicenter evaluation. To enable rigorous out-of-sample assessment and prospective clinical integration, we have provided experimental protocols, containerized tools, and a scoping dossier. These findings should be interpreted in the context of existing machine learning–based prognostic studies in lung cancer65.
Recent advances in machine learning have enabled improved risk stratification across multiple domains of lung cancer management, including diagnosis, treatment response prediction, and prognostic assessment. However, most existing models focus on imaging-based or molecular endpoints, with limited attention to postoperative distant metastasis and insufficient emphasis on interpretability39–45. This gap motivates the present study.
Performance of different machine learning models in predicting distant metastasis after lung cancer surgery
We systematically compared nine machine-learning algorithms for predicting postoperative distant metastasis. Average AUROCs were obtained within the nested-CV pipeline (outer stratified 70/30 splits repeated 10 times; inner fivefold tuning), rather than a single tenfold CV (Fig. 3). GBDT achieved the best held-out performance (AUROC 0.810, accuracy 0.766, sensitivity 0.698, specificity 0.786; Sect. “Optimal Model Selection”, Table S2). This profile of moderate sensitivity and higher specificity implies relatively more false negatives at the observed prevalence. Clinically, if the priority is to minimize missed metastasis, a left-shift of the deployment threshold (favoring higher sensitivity and NPV at the expense of more false positives) is appropriate; if avoiding unnecessary interventions is prioritized, the current threshold—or a right-shift after recalibration—may be preferable. Consistent with prior oncology studies, gradient boosting has been reported to outperform linear baselines46,47, and to maintain stable performance across internal validation48,49, although external transportability still requires confirmation in multicenter cohorts50.
Clinical application value of the optimal model
We evaluated the clinical utility of the GBDT model from multiple complementary perspectives, including discrimination, calibration, decision-curve analysis, and learning-curve stability (Figs. 3b, 3c, 3f). The GBDT achieved an AUROC of 0.810 (95% CI 0.748–0.872) at the prespecified operating threshold derived from inner cross-validation. Thresholded metrics (sensitivity, specificity, PPV, NPV, accuracy, F1) are reported in Sect. “Optimal Model Selection” and Table S2, alongside PR-AUC, calibration, and DCA summaries (Fig. 3c). In addition, Learning-curve and repeated-split analyses suggested stable internal generalization and efficient data utilization for GBDT, which is particularly valuable in real-world settings where sample sizes may be limited (Fig. 4d). External transportability remains to be established. According to prior literature, GBDT offers several practical advantages. (i) Dynamic risk stratification enables tailored follow-up schedules across low-, intermediate-, and high-risk groups. (ii) Resource optimization is supported by decision-curve analysis, which indicates net clinical benefit across broad thresholds (Fig. 3c). (iii) Actionable interpretability is achievable when predictions are paired with SHAP summaries and case-level explanations, facilitating clinician review rather than replacing clinical judgment. Notably, these findings are associational rather than causal, and prospective multicenter validation is required before model-guided interventions can be recommended51,52.
Misclassification profile and operating-point strategy
At the reported operating point, sensitivity is lower than specificity, implying relatively more false negatives at the observed prevalence. This aligns with our DCA: if the clinical priority is to minimize missed metastasis, the deployment threshold can be shifted left to gain sensitivity (with more false positives); if avoiding unnecessary interventions is prioritized, the current threshold—or a right-shift after recalibration—may be preferable. We report the fixed-threshold baseline here and outline a clear path for external validation and recalibration53. Together, these findings indicate that model performance, error profile, and interpretability must be considered jointly when translating predictive tools into clinical workflows.
Clinical significance of SHAP interpretation results
SHAP-based interpretation enhances the transparency of the GBDT model by providing clinically meaningful explanations for individual predictions. This may improve clinicians’ understanding of model predictions and facilitate more informed decision-making. Previous studies have suggested that SHAP-based interpretability tools may enhance clinicians’ confidence in model-assisted decision-making and facilitate more rational treatment adjustments45–55. The SHAP analysis based on the GBDT model in this study reveals the significant impact of clinical features such as adjuvant chemotherapy, adjuvant radiotherapy, pathological N stage, BMI, age, and Pre-ANC on distant metastasis after lung cancer surgery. Notably, age appeared among the top contributors despite a non-significant univariate difference. This is expected because SHAP reflects model-based conditional contribution rather than marginal group separation. Clinically, age may modulate risk in conjunction with nodal status, body composition and treatment decisions, highlighting the importance of considering patient factors jointly rather than in isolation. This finding is consistent with the CALGB 9633 study, which reported a survival benefit from adjuvant chemotherapy, and with the 8th edition of the AJCC staging system, which emphasizes pathological N stage as a core prognostic factor. Using LASSO preserved variable-level semantics and facilitated alignment56,57. Selection frequencies across outer repetitions suggested a core subset of predictors was repeatedly retained, indicating robustness to resampling. Complementary selectors (e.g., elastic‑net for grouped correlation) could be considered as sensitivity analyses while keeping reporting focused on clinically interpretable variables.This pattern is clinically plausible, as younger patients with advanced nodal disease or aggressive pathology may experience higher metastatic risk despite similar chronological age distributions.
SHAP-based interpretation made the model’s predictions transparent and clinically interpretable. Summary plots identified adjuvant chemotherapy, adjuvant radiotherapy, pathological N stage, BMI, age, and Pre-ANC as the leading contributors to predicted metastasis risk (Fig. 5a–b).
These explanations are associational rather than causal; in particular, attributions involving adjuvant therapies likely reflect confounding by indication, as patients at higher baseline risk are more likely to receive additional treatment (Fig. 5c–d). This observation is compatible with established clinicopathologic knowledge summarized in the WHO classification of lung tumors58.
Comparison with prior work
Compared with prior studies that typically assessed a single algorithm, used narrow feature sets, or emphasized discrimination alone, our work adopts a broader algorithmic scope, a richer predictor set, and a more clinically oriented evaluation. We conducted a unified, head-to-head comparison of nine ML models; complemented ROC–AUC/PR–AUC with calibration and decision curve analysis (DCA) to quantify clinical utility; and used SHAP to connect predictions with clinical reasoning at both global and patient levels59. These design choices enhance reproducibility and interpretability, and may partly explain why specific variables emerge as dominant predictors in both traditional between-group analyses and the ML framework.
Recent radiomics-based studies have reported higher discrimination for metastasis-related endpoints. For example, a CT radiomics study published in BMC Cancer reported an AUC approaching 0.90 under an internal split validation scheme60. Importantly, the higher AUC reported in radiomics-based studies is often obtained under single-split internal validation without fold-confined preprocessing or probability calibration, which may inflate apparent performance. In contrast, our repeated nested cross-validation framework yields more conservative but less optimistically biased estimates. However, direct numerical comparison is not straightforward because the predicted endpoint (e.g., brain metastasis versus overall distant metastasis), feature modality (radiomics versus routinely available clinicopathologic and treatment variables), and validation strategy differ. In the present study, we intentionally restricted predictors to readily available perioperative variables and adopted repeated nested cross-validation, which yields more conservative but less optimistically biased estimates (Table S4).
Limitations of the research and future prospects
This study delivers a head-to-head, multi-algorithm comparison on a large, clinically rich cohort; complements discrimination with calibration, decision-curve, and precision–recall evaluation; and provides transparent explanations via SHAP at both global and individual levels. These choices enhance reproducibility, interpretability, and clinical relevance, supporting workflow integration for risk-stratified postoperative management. Nevertheless, several limitations merit discussion.
First, this was a single-center retrospective study, which may limit generalizability and transportability across different clinical settings. Although we used repeated nested cross-validation with fold-confined preprocessing, tuning, and calibration to mitigate overfitting, the absence of external validation may still leave residual optimism, and multicenter validation is therefore warranted before clinical implementation. Although repeated outer splits with nested tuning and systematic internal evaluation (ROC–/PR–AUC, calibration, and DCA) reduce optimism, they do not substitute for external or multicenter validation. Generalizability may be constrained by center-specific case-mix, scanner/reconstruction variability, and local practice patterns.GBDT demonstrated stable convergence across increasing training set sizes (Fig. 4d), suggesting robust internal generalization. This aligns with Lee et al.61, who noted that single-center datasets are prone to selection bias affecting performance across diverse populations.
Second, the mechanistic linkage of selected features remains limited. LASSO screening (Fig. 2; e.g., 9 non-zero coefficients identified at λ = 0.028 via tenfold cross-validation) successfully recognized imaging-derived and clinicopathologic features with predictive value. However, the biological mechanisms underlying these features have not been deeply linked to pathological indicators or molecular markers (e.g., EGFR). The coefficient distribution plot (Fig. 3) only reflects the statistical significance of the features, rather than providing insights into the tumor microenvironmental context. This limitation aligns with a common gap in the field of radiomics: statistically significant features often lack mechanistic validation—a challenge also highlighted in the study by Hatt et al.62.
Third, imaging-feature variability. Imaging-derived features can vary with slice selection, segmentation, and vendor/kernel differences. We mitigated and quantified these via semi-automated HU-thresholded segmentation for body composition extraction, isotropic resampling, and inter/intra-observer assessments (ICC(2,1), Bland–Altman, Dice, CV%). Sensitivity analyses (± HU, morphological operations) suggested minimal impact on conclusions (Table Sx, Figure Sx), but multicenter, multi-vendor data and automated segmentation are needed to further reduce measurement error63.
Fourth, static postoperative inputs. The present model uses cross-sectional postoperative variables and does not incorporate time-varying information (e.g., treatment response, surveillance trajectories), which may understate risk in evolving clinical courses64.
Future work will include temporal and geographic external validation across multiple centers equipped with imaging devices from different vendors, reconstruction kernels, and case-prevalence profiles to rigorously assess model generalizability. In parallel, we will evaluate model updating strategies, incorporate SHAP interaction values to clarify synergistic feature effects, and explore dynamic prediction approaches together with multi-omics integration. Through these efforts, our goal is to transition the model from an “internally validated” construct to a deployable decision-support tool with verified performance in out-of-sample populations.
In this context, the current findings should be regarded as an internal benchmark pending external validation and head-to-head multicenter evaluation. To enable rigorous out-of-sample assessment and prospective clinical integration, we have provided experimental protocols, containerized tools, and a scoping dossier. These findings should be interpreted in the context of existing machine learning–based prognostic studies in lung cancer65.
Conclusion
Conclusion
This study established and compared nine machine learning models to predict distant metastasis after lung cancer surgery, with the GBDT model demonstrating the most robust performance (76.6% accuracy at the prespecified operating threshold; AUROC 0.810, 95% CI 0.748–0.872). SHAP-based interpretation revealed adjuvant chemotherapy, adjuvant radiotherapy, pathological N stage, BMI, age, and Pre-ANC as the principal contributors to metastatic risk. By enabling postoperative risk stratification, the model may support individualized diagnostic and therapeutic planning. Although constrained by a single-center cohort and the absence of external validation, the findings highlight the potential role of machine learning for lung cancer prognosis. Importantly, they lay a foundation for the next stage of research—integrating multi-omics data, leveraging multicenter cohorts, and moving toward a clinically deployable, transparent decision-support tool capable of enhancing personalized lung cancer management.
This study established and compared nine machine learning models to predict distant metastasis after lung cancer surgery, with the GBDT model demonstrating the most robust performance (76.6% accuracy at the prespecified operating threshold; AUROC 0.810, 95% CI 0.748–0.872). SHAP-based interpretation revealed adjuvant chemotherapy, adjuvant radiotherapy, pathological N stage, BMI, age, and Pre-ANC as the principal contributors to metastatic risk. By enabling postoperative risk stratification, the model may support individualized diagnostic and therapeutic planning. Although constrained by a single-center cohort and the absence of external validation, the findings highlight the potential role of machine learning for lung cancer prognosis. Importantly, they lay a foundation for the next stage of research—integrating multi-omics data, leveraging multicenter cohorts, and moving toward a clinically deployable, transparent decision-support tool capable of enhancing personalized lung cancer management.
Supplementary Information
Supplementary Information
출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.
🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반
- A Phase I Study of Hydroxychloroquine and Suba-Itraconazole in Men with Biochemical Relapse of Prostate Cancer (HITMAN-PC): Dose Escalation Results.
- Self-management of male urinary symptoms: qualitative findings from a primary care trial.
- Clinical and Liquid Biomarkers of 20-Year Prostate Cancer Risk in Men Aged 45 to 70 Years.
- Diagnostic accuracy of Ga-PSMA PET/CT versus multiparametric MRI for preoperative pelvic invasion in the patients with prostate cancer.
- Comprehensive analysis of androgen receptor splice variant target gene expression in prostate cancer.
- Clinical Presentation and Outcomes of Patients Undergoing Surgery for Thyroid Cancer.