A machine learning framework predicts oncogenic driver mutations from SNP profiles in lung adenocarcinoma.
1/5 보강
PICO 자동 추출 (휴리스틱, conf 2/4)
유사 논문P · Population 대상 환자/모집단
83 patients with lung adenocarcinoma into 7:3 groups, and there was no significant difference in baseline characteristics ( > 0.
I · Intervention 중재 / 시술
추출되지 않음
C · Comparison 대조 / 비교
추출되지 않음
O · Outcome 결과 / 결론
Its high discriminatory power and clinical consistency validated the potential of SNPs as multi-gene coregulatory biomarkers. [SUPPLEMENTARY INFORMATION] The online version contains supplementary material available at 10.1007/s12672-026-04745-3.
[BACKGROUND] This study aims to explore a new application paradigm of single nucleotide polymorphisms (SNP) in precision treatment of lung adenocarcinoma by integrating genomics and machine learning t
APA
Li J, Wang Z, et al. (2026). A machine learning framework predicts oncogenic driver mutations from SNP profiles in lung adenocarcinoma.. Discover oncology, 17(1). https://doi.org/10.1007/s12672-026-04745-3
MLA
Li J, et al.. "A machine learning framework predicts oncogenic driver mutations from SNP profiles in lung adenocarcinoma.." Discover oncology, vol. 17, no. 1, 2026.
PMID
41831103 ↗
Abstract 한글 요약
[BACKGROUND] This study aims to explore a new application paradigm of single nucleotide polymorphisms (SNP) in precision treatment of lung adenocarcinoma by integrating genomics and machine learning techniques.
[METHODS] This study is based on a cohort of 83 lung adenocarcinoma patients diagnosed by pathology. Clinical features and SNP genotype data are integrated, and a gradient boosting decision tree (GBDT) algorithm is used to establish an SNPdriver prediction framework. By adaptively learning the nonlinear interaction effects between SNP features, binary classification prediction of driving factors is achieved.
[RESULTS] This study randomly divided 83 patients with lung adenocarcinoma into 7:3 groups, and there was no significant difference in baseline characteristics ( > 0.05). The SNPdriver model based on GBDT adopts a 6-decision tree ensemble architecture and achieves mutation state weighted prediction through feature path splitting. The validation showed that the predicted Area Under the Curve (AUC) for EGFR and KRAS mutations were 0.90 and 0.85, respectively, and the calibration curve confirmed that the predicted probability was highly consistent with the actual incidence rate.
[CONCLUSIONS] This study successfully constructed the SNPdriver model for predicting driver gene mutations in lung adenocarcinoma based on SNP feature networks. Its high discriminatory power and clinical consistency validated the potential of SNPs as multi-gene coregulatory biomarkers.
[SUPPLEMENTARY INFORMATION] The online version contains supplementary material available at 10.1007/s12672-026-04745-3.
[METHODS] This study is based on a cohort of 83 lung adenocarcinoma patients diagnosed by pathology. Clinical features and SNP genotype data are integrated, and a gradient boosting decision tree (GBDT) algorithm is used to establish an SNPdriver prediction framework. By adaptively learning the nonlinear interaction effects between SNP features, binary classification prediction of driving factors is achieved.
[RESULTS] This study randomly divided 83 patients with lung adenocarcinoma into 7:3 groups, and there was no significant difference in baseline characteristics ( > 0.05). The SNPdriver model based on GBDT adopts a 6-decision tree ensemble architecture and achieves mutation state weighted prediction through feature path splitting. The validation showed that the predicted Area Under the Curve (AUC) for EGFR and KRAS mutations were 0.90 and 0.85, respectively, and the calibration curve confirmed that the predicted probability was highly consistent with the actual incidence rate.
[CONCLUSIONS] This study successfully constructed the SNPdriver model for predicting driver gene mutations in lung adenocarcinoma based on SNP feature networks. Its high discriminatory power and clinical consistency validated the potential of SNPs as multi-gene coregulatory biomarkers.
[SUPPLEMENTARY INFORMATION] The online version contains supplementary material available at 10.1007/s12672-026-04745-3.
🏷️ 키워드 / MeSH 📖 같은 키워드 OA만
같은 제1저자의 인용 많은 논문 (5)
- Ultrasound guided transabdominal botulinum toxin injection for refractory overactive bladder treatment.
- Short Nose Lengthening in Primary and Revision Rhinoplasty in Asians.
- Histopathological changes of fibrosis in human extra-ocular muscle caused by botulinum toxin A.
- Corrigendum to "Dual-energy CT radiomics for predicting neoadjuvant chemotherapy response in locally advanced gastric cancer: A dual-vendor validation study" [Eur. J. Surg. Oncol. 51 (2025) 110548].
- A case report of breast cancer recurrence with cystitis: the impact of immune checkpoint inhibitor therapy on the incidence of cystitis.
📖 전문 본문 읽기 PMC JATS · ~56 KB · 영문
Introduction
Introduction
In recent years, genome-wide association studies (GWAS) have identified numerous single nucleotide polymorphisms (SNPs) associated with lung cancer susceptibility and treatment outcomes [1]. For instance, SNPs in DNA repair genes such as ERCC1 have been linked to platinum-based chemotherapy response in non-small cell lung cancer (NSCLC) [2], while polymorphisms in PD-L1 may influence immunotherapy efficacy [3]. Despite these advances, most SNP-based studies remain focused on prognostic or pharmacogenomic correlations, rather than predicting the presence of specific driver mutations, such as those in EGFR or KRAS, which are critical for targeted therapy selection in lung adenocarcinoma.
Meanwhile, machine learning (ML) has emerged as a powerful tool for integrating high-dimensional genomic data and modeling complex, non-linear relationships between genetic variants and phenotypes [4]. The key difference between traditional methods and machine learning is that models in machine learning learn from instances, which are provided in the form of inputs and outputs for a given task. In terms of imaging and pathology, computers can also use observation based learning to determine how to perform the mapping from inputs to outputs, thereby creating a model that generalizes information to correctly perform tasks using new, previously unseen inputs [5]. In contrast to traditional GWAS that test single loci independently, ML approaches can capture multi-SNP interactions and improve predictive performance for polygenic traits. For example, ML models have been applied to SNP data for predicting Alzheimer’s disease risk, and type 2 diabetes, demonstrating their capability to uncover synergistic genetic effects. However, the application of such approaches to predict driver mutations from SNP profiles in cancer remains underexplored, particularly in lung adenocarcinoma [6, 7]. Furthermore, specific SNP haplotypes may co-inherit with genetic variants predisposing to particular mutagenic processes or carcinogenic pathways. SNPs within enhancer or promoter regions can modulate gene expression involved in DNA repair, cell cycle regulation, or apoptosis, thereby influencing mutation patterns and the selection of driver events.
To address this gap, we developed SNPdriver, a gradient boosting decision tree (GBDT)-based framework designed to predict EGFR and KRAS mutation status from genome-wide SNP data. GBDT is particularly suited for this task due to its ability to handle non-linear feature interactions, automatically assess SNP importance, and provide interpretable decision pathways. By leveraging SNP data from lung adenocarcinoma patients, we aimed to establish a proof-of-concept model that links SNP signatures to driver mutation status, offering a potential non-invasive tool for mutation screening and personalized therapeutic planning.
In recent years, genome-wide association studies (GWAS) have identified numerous single nucleotide polymorphisms (SNPs) associated with lung cancer susceptibility and treatment outcomes [1]. For instance, SNPs in DNA repair genes such as ERCC1 have been linked to platinum-based chemotherapy response in non-small cell lung cancer (NSCLC) [2], while polymorphisms in PD-L1 may influence immunotherapy efficacy [3]. Despite these advances, most SNP-based studies remain focused on prognostic or pharmacogenomic correlations, rather than predicting the presence of specific driver mutations, such as those in EGFR or KRAS, which are critical for targeted therapy selection in lung adenocarcinoma.
Meanwhile, machine learning (ML) has emerged as a powerful tool for integrating high-dimensional genomic data and modeling complex, non-linear relationships between genetic variants and phenotypes [4]. The key difference between traditional methods and machine learning is that models in machine learning learn from instances, which are provided in the form of inputs and outputs for a given task. In terms of imaging and pathology, computers can also use observation based learning to determine how to perform the mapping from inputs to outputs, thereby creating a model that generalizes information to correctly perform tasks using new, previously unseen inputs [5]. In contrast to traditional GWAS that test single loci independently, ML approaches can capture multi-SNP interactions and improve predictive performance for polygenic traits. For example, ML models have been applied to SNP data for predicting Alzheimer’s disease risk, and type 2 diabetes, demonstrating their capability to uncover synergistic genetic effects. However, the application of such approaches to predict driver mutations from SNP profiles in cancer remains underexplored, particularly in lung adenocarcinoma [6, 7]. Furthermore, specific SNP haplotypes may co-inherit with genetic variants predisposing to particular mutagenic processes or carcinogenic pathways. SNPs within enhancer or promoter regions can modulate gene expression involved in DNA repair, cell cycle regulation, or apoptosis, thereby influencing mutation patterns and the selection of driver events.
To address this gap, we developed SNPdriver, a gradient boosting decision tree (GBDT)-based framework designed to predict EGFR and KRAS mutation status from genome-wide SNP data. GBDT is particularly suited for this task due to its ability to handle non-linear feature interactions, automatically assess SNP importance, and provide interpretable decision pathways. By leveraging SNP data from lung adenocarcinoma patients, we aimed to establish a proof-of-concept model that links SNP signatures to driver mutation status, offering a potential non-invasive tool for mutation screening and personalized therapeutic planning.
Materials and methods
Materials and methods
Study population and dataset
Table 1 records 83 patients with lung adenocarcinoma who were pathologically diagnosed at the BC Cancer Research Centre between October 23, 2019 and July 28, 2020. These patients were all recorded with personal medical history such as gender, age, race, stage, smoking history, etc. After surgical resection and evaluation by professional pathologists, these patients were made into Affymetrix SNP6.0 arrays and uploaded to the Gene Expression Omnibus (GEO) database with access code GSE139294 [8].
Constructing a prediction model SNPdriver based on gradient boosting decision tree
In this study, for each prediction task (EGFR or KRAS mutation status), samples carrying the corresponding driver gene mutation were defined as the positive class. This study used the GBDT algoithm to construct a lung adenocarcinoma driver gene mutation prediction model. Compared with linear models such as logistic regression, GBDT can capture interaction effects and nonlinear patterns without manual feature engineering. In addition, GBDT iteratively corrects errors in previous trees, refines decision boundaries, and significantly improves the interpretability of the model by obtaining feature importance scores from tree splitting [9]. The model models the relative likelihood that the sample belongs to a positive class by scoring F (x) using logarithmic odds. It is converted into probability through the Sigmoid function:
Among them, F (x) represents the logarithmic probability that sample x belongs to the positive class.
The predicted values of each new tree hm (x) are accumulated into F (x), gradually correcting the model’s prediction error for the samples:
The learning rate ν controls the update amplitude, and γm is the optimal step size.
Construct each regression tree hm (x) with pseudo residuals as the target value, the evaluation of the quality of node splitting is conducted using friedman_mse, samples represents the number of samples in a node, while value is the predicted value of the node.
Develop predictive model SNPdriver
In the binary classification task of driving gene mutations in lung adenocarcinoma, we use machine learning GBDT algorithm as the basis to generate probability predictions by modeling log odds. The function expression is:
k is the index of the tree, from 1 to 6, indicating that there are 6 trees, nk is the number of leaf nodes in the kth tree, valuek, i is the value of the i-th leaf node of the kth tree, I()is an indicator function, with a value of 1 if sample x falls on the path corresponding to the ,the i-th leaf node of the kth tree, and 0 otherwise.
The output of each tree is the weighted sum of the values of all its leaf nodes, which is determined by whether the sample falls on the path of that leaf node. Therefore, the final output F (x) of the model is the sum of the outputs of all trees.
Label encoding SNPs genotypes
All SNP positions in this study are based on the GRCh37/hg19 human reference genome assembly. We adopt a common data preprocessing method: label encoding [10]. We encode the genotypes (‘AA ‘,‘AB’, ‘NA’, ‘BB’) in the sequencing matrix as (0,1,2,3), assigning a numerical label to each genotype and converting the SNP genotype classification data into numerical types. The numerical content does not affect the intrinsic meaning of the genotypes, and the distance between the numerical values is meaningless [11]. This helps improve the training efficiency of the model, ensuring equal consideration for all genotypes and reducing potential biases.
SNP selection and feature importance
To ensure the translational relevance and biological interpretability of the model, a structured feature selection and importance analysis pipeline was implemented within the GBDT framework.In GBDT, the importance of each SNP feature is quantified based on the total reduction in the loss function (friedman_mse) contributed by that feature across all decision trees in the ensemble. Specifically, for each split in every tree, the improvement in MSE brought by the split is weighted by the number of samples reaching that node. These improvements are then summed across all trees and normalized to obtain a relative importance score for each SNP. SNP features were ranked according to their importance scores. For EGFR and KRAS mutation status, we retained SNPs with importance scores above the 90th percentile of the distribution. This threshold was chosen to balance model complexity and interpretability, ensuring that only SNPs with substantial contributions to prediction were included in the final signature.
Evaluation of predictive performance of models
By analyzing the Receiver Operating Characteristic Curve (ROC) of the subjects, the Area Under the Curve (AUC) is calculated to evaluate the diagnostic performance of the model [12, 13]. At the same time, the Youden index in the ROC curve is used to determine the optimal cut-off point, and based on this, the accuracy, sensitivity, specificity, positive predictive value, and negative predictive value of the model are calculated to further evaluate its diagnostic ability [14, 15]. Predictions were made on the test set, and the estimated probabilities for EGFR and KRAS mutation status were grouped into deciles based on the predicted risk. For each bin, the mean predicted probability was plotted against the observed fraction of positive cases. Perfect calibration would align with the diagonal line. No post-hoc probability calibration was applied; the curves reflect the inherent calibration of the GBDT model’s probability outputs via the sigmoid-transformed ensemble scores. Given the potential imbalance in mutation prevalence within our cohort, we assessed model performance using both the ROC curve and the precision-recall curve, which is more informative under class imbalance. During GBDT training, we did not apply explicit class re-weighting or resampling, as gradient boosting inherently adjusts sample contributions through iterative residual fitting. The model prioritizes correct classification of the minority class by focusing on samples with larger gradients, thereby naturally mitigating imbalance effects. The PR curve and area under the PR curve were used as primary metrics to evaluate performance on the minority class, ensuring that high precision could be maintained at varying recall levels [16, 17].
Statistical methods
Statistical analysis was conducted on the data using R 4.3.3 and Python 3.8. Key Python packages included scikit-learn 1.3.0, pandas 2.1.0, NumPy 1.24.3, SciPy 1.11.3, and Matplotlib 3.7.2. Quantitative data with normal distribution and homogeneity of variance are represented by mean plus minus standard deviation (x ®± s)[18]. For comparison between two groups, independent sample t-test is used for comparing continuous variables that conform to normal distribution, Mann-Whitney U test is used for comparing continuous variables that do not conform to normal distribution, and Pearson chi square test is used for comparing categorical variables between two groups [19]. Use Spearman correlation analysis to evaluate the correlation between features, with the strength of the correlation represented by the correlation coefficient. When evaluating the diagnostic performance of a model, AUC value greater than 0.75 is generally considered to indicate good predictive performance of the model [12]. When the calibration curve approaches or coincides with the ideal calibration line, it indicates a high consistency between the predicted probability of the model and the actual observed probability, indicating good calibration of the model [16]. In all statistical tests, a double tailed P < 0.05 is considered statistically significant.
Study population and dataset
Table 1 records 83 patients with lung adenocarcinoma who were pathologically diagnosed at the BC Cancer Research Centre between October 23, 2019 and July 28, 2020. These patients were all recorded with personal medical history such as gender, age, race, stage, smoking history, etc. After surgical resection and evaluation by professional pathologists, these patients were made into Affymetrix SNP6.0 arrays and uploaded to the Gene Expression Omnibus (GEO) database with access code GSE139294 [8].
Constructing a prediction model SNPdriver based on gradient boosting decision tree
In this study, for each prediction task (EGFR or KRAS mutation status), samples carrying the corresponding driver gene mutation were defined as the positive class. This study used the GBDT algoithm to construct a lung adenocarcinoma driver gene mutation prediction model. Compared with linear models such as logistic regression, GBDT can capture interaction effects and nonlinear patterns without manual feature engineering. In addition, GBDT iteratively corrects errors in previous trees, refines decision boundaries, and significantly improves the interpretability of the model by obtaining feature importance scores from tree splitting [9]. The model models the relative likelihood that the sample belongs to a positive class by scoring F (x) using logarithmic odds. It is converted into probability through the Sigmoid function:
Among them, F (x) represents the logarithmic probability that sample x belongs to the positive class.
The predicted values of each new tree hm (x) are accumulated into F (x), gradually correcting the model’s prediction error for the samples:
The learning rate ν controls the update amplitude, and γm is the optimal step size.
Construct each regression tree hm (x) with pseudo residuals as the target value, the evaluation of the quality of node splitting is conducted using friedman_mse, samples represents the number of samples in a node, while value is the predicted value of the node.
Develop predictive model SNPdriver
In the binary classification task of driving gene mutations in lung adenocarcinoma, we use machine learning GBDT algorithm as the basis to generate probability predictions by modeling log odds. The function expression is:
k is the index of the tree, from 1 to 6, indicating that there are 6 trees, nk is the number of leaf nodes in the kth tree, valuek, i is the value of the i-th leaf node of the kth tree, I()is an indicator function, with a value of 1 if sample x falls on the path corresponding to the ,the i-th leaf node of the kth tree, and 0 otherwise.
The output of each tree is the weighted sum of the values of all its leaf nodes, which is determined by whether the sample falls on the path of that leaf node. Therefore, the final output F (x) of the model is the sum of the outputs of all trees.
Label encoding SNPs genotypes
All SNP positions in this study are based on the GRCh37/hg19 human reference genome assembly. We adopt a common data preprocessing method: label encoding [10]. We encode the genotypes (‘AA ‘,‘AB’, ‘NA’, ‘BB’) in the sequencing matrix as (0,1,2,3), assigning a numerical label to each genotype and converting the SNP genotype classification data into numerical types. The numerical content does not affect the intrinsic meaning of the genotypes, and the distance between the numerical values is meaningless [11]. This helps improve the training efficiency of the model, ensuring equal consideration for all genotypes and reducing potential biases.
SNP selection and feature importance
To ensure the translational relevance and biological interpretability of the model, a structured feature selection and importance analysis pipeline was implemented within the GBDT framework.In GBDT, the importance of each SNP feature is quantified based on the total reduction in the loss function (friedman_mse) contributed by that feature across all decision trees in the ensemble. Specifically, for each split in every tree, the improvement in MSE brought by the split is weighted by the number of samples reaching that node. These improvements are then summed across all trees and normalized to obtain a relative importance score for each SNP. SNP features were ranked according to their importance scores. For EGFR and KRAS mutation status, we retained SNPs with importance scores above the 90th percentile of the distribution. This threshold was chosen to balance model complexity and interpretability, ensuring that only SNPs with substantial contributions to prediction were included in the final signature.
Evaluation of predictive performance of models
By analyzing the Receiver Operating Characteristic Curve (ROC) of the subjects, the Area Under the Curve (AUC) is calculated to evaluate the diagnostic performance of the model [12, 13]. At the same time, the Youden index in the ROC curve is used to determine the optimal cut-off point, and based on this, the accuracy, sensitivity, specificity, positive predictive value, and negative predictive value of the model are calculated to further evaluate its diagnostic ability [14, 15]. Predictions were made on the test set, and the estimated probabilities for EGFR and KRAS mutation status were grouped into deciles based on the predicted risk. For each bin, the mean predicted probability was plotted against the observed fraction of positive cases. Perfect calibration would align with the diagonal line. No post-hoc probability calibration was applied; the curves reflect the inherent calibration of the GBDT model’s probability outputs via the sigmoid-transformed ensemble scores. Given the potential imbalance in mutation prevalence within our cohort, we assessed model performance using both the ROC curve and the precision-recall curve, which is more informative under class imbalance. During GBDT training, we did not apply explicit class re-weighting or resampling, as gradient boosting inherently adjusts sample contributions through iterative residual fitting. The model prioritizes correct classification of the minority class by focusing on samples with larger gradients, thereby naturally mitigating imbalance effects. The PR curve and area under the PR curve were used as primary metrics to evaluate performance on the minority class, ensuring that high precision could be maintained at varying recall levels [16, 17].
Statistical methods
Statistical analysis was conducted on the data using R 4.3.3 and Python 3.8. Key Python packages included scikit-learn 1.3.0, pandas 2.1.0, NumPy 1.24.3, SciPy 1.11.3, and Matplotlib 3.7.2. Quantitative data with normal distribution and homogeneity of variance are represented by mean plus minus standard deviation (x ®± s)[18]. For comparison between two groups, independent sample t-test is used for comparing continuous variables that conform to normal distribution, Mann-Whitney U test is used for comparing continuous variables that do not conform to normal distribution, and Pearson chi square test is used for comparing categorical variables between two groups [19]. Use Spearman correlation analysis to evaluate the correlation between features, with the strength of the correlation represented by the correlation coefficient. When evaluating the diagnostic performance of a model, AUC value greater than 0.75 is generally considered to indicate good predictive performance of the model [12]. When the calibration curve approaches or coincides with the ideal calibration line, it indicates a high consistency between the predicted probability of the model and the actual observed probability, indicating good calibration of the model [16]. In all statistical tests, a double tailed P < 0.05 is considered statistically significant.
Results
Results
Clinical characteristics of cohort
According to a 7:3 ratio, the population was randomly divided into a training set (n = 58) and a testing set (n = 25). We can see that pack-years, gender, race, smoking history, EGFR status, KRAS status, and tumor staging did not show significant differences between the testing and training sets (p > 0.05), indicating that the two groups have good balance in these variables and are suitable for subsequent model training and validation (Table 1).
SNPdriver output and SNP round by round filtering
Using SNPdriver to predict the driver gene EGFR, the model gradually corrected the prediction bias through 6 rounds of iteration: the first round of tree (Tree1) completed coarse screening based on the SNP8305104 genotype (accuracy rate of 41%), the subsequent tree (Tree2-6) introduced SNP1787318 and other site interactions, resulting in an overall model accuracy improvement of 89% (Δ + 21%, p < 0.001). (Supplementary Fig. 1)
Using SNPdriver to predict the driver gene KRAS, the first round of Tree 1 implemented baseline risk stratification based on SNP1883190 (accuracy rate of 41%), and subsequent trees introduced key loci such as SNP8462767 in sequence. The final ensemble model AUC reached 0.85 (p < 0.001). Specifically, Tree4 accurately identified 17 KRAS mutation through tandem determination of SNP8437376 ≤ 1.5( Supplementary Fig. 2).
SNPs signatures associated with driver gene mutations
There is a total of 28 SNPs associated with susceptibility to EGFR gene mutations, which are distributed on different chromosomes. Among them, 2 SNPs (rs5942628 and rs2651176) are on the X chromosome, 3 SNPs are on chromosome 2 and 5 respectively, and the highest number of SNPs is on chromosome 3, with 5 SNPs. Secondly, A/G and C/T SNP genotypes, which are mainly transition SNPs, are more common, while transition SNPs are relatively rare, that is, the substitution between purine and purine or pyrimidine and pyrimidine. (Table 2)
There is a total of 25 SNPs associated with susceptibility to KRAS gene mutations, and the same SNP genotypes are mainly A/G and C/T, but only one SNP (rs1965009) is on the X chromosome. Moreover, the number of SNPs on chromosome 5 is the highest, with 5 SNPs, which may indicate that these chromosomal regions may be strongly associated with the function or regulation of EGFR and KRAS genes. (Table 3)
Constructing EGFR and KRAS mutation prediction models using SNP signatures
We used the single nucleotide polymorphism signatures screened by SNPdriver to predict the EGFR and KRAS mutation status in lung adenocarcinoma patients. It can be clearly seen that the model performance shows good discriminative ability, with an AUC of 0.9 for predicting EGFR mutations in the test set (Fig. 1A) and a slightly lower AUC of 0.85 for predicting KRAS mutations in the test set (Fig. 1D). Their calibration curves show good consistency between the predicted probability of the model and the actual situation, with calibration slope close to ideal (Fig. 1B and E). Considering the potential imbalance in the allocation of cases driven by gene mutations in the partitioned dataset, we also evaluated the accuracy and recall of the model. The predicted AUPR for EGFR mutations was 0.83 (Fig. 1C), and the predicted AUPR for KRAS mutations was 0.91 (Fig. 1F), indicating that the model can maintain high accuracy even with high recall rates. To further assess the robustness and generalizability of the SNPdriver model, we performed external validation on two independent lung adenocarcinoma cohorts, In the TCGA-LUAD cohort, the AUC for predicting EGFR mutations was 0.83, and for KRAS mutations was 0.76. In the European LUAD cohort, the AUC for predicting EGFR mutations was 0.85, and for KRAS mutations was 0.78. These results consistently demonstrate the robustness and applicability of the SNPdriver framework across different patient populations and data sources. (Supplementary Fig. 3)
Comparison with alternative machine learning models
To evaluate the advantage of our GBDT-based SNPdriver approach, we performed a comprehensive comparison against three other widely used machine learning models: XGBoost, Random Forest, and LASSO-logistic regression. The detailed performance metrics (AUC, Accuracy, Sensitivity, Specificity with 95% CIs) for both training and test sets are presented in Supplementary Table 1. Our SNPdriver model demonstrated superior performance, achieving the highest AUC (0.85) and accuracy (0.77) on the independent test set, while maintaining the best balance between sensitivity (0.95) and specificity (0.60). Notably, the performance gap between training and test sets for SNPdriver was smaller than for other models, suggesting good generalization ability without severe overfitting.
Clinical characteristics of cohort
According to a 7:3 ratio, the population was randomly divided into a training set (n = 58) and a testing set (n = 25). We can see that pack-years, gender, race, smoking history, EGFR status, KRAS status, and tumor staging did not show significant differences between the testing and training sets (p > 0.05), indicating that the two groups have good balance in these variables and are suitable for subsequent model training and validation (Table 1).
SNPdriver output and SNP round by round filtering
Using SNPdriver to predict the driver gene EGFR, the model gradually corrected the prediction bias through 6 rounds of iteration: the first round of tree (Tree1) completed coarse screening based on the SNP8305104 genotype (accuracy rate of 41%), the subsequent tree (Tree2-6) introduced SNP1787318 and other site interactions, resulting in an overall model accuracy improvement of 89% (Δ + 21%, p < 0.001). (Supplementary Fig. 1)
Using SNPdriver to predict the driver gene KRAS, the first round of Tree 1 implemented baseline risk stratification based on SNP1883190 (accuracy rate of 41%), and subsequent trees introduced key loci such as SNP8462767 in sequence. The final ensemble model AUC reached 0.85 (p < 0.001). Specifically, Tree4 accurately identified 17 KRAS mutation through tandem determination of SNP8437376 ≤ 1.5( Supplementary Fig. 2).
SNPs signatures associated with driver gene mutations
There is a total of 28 SNPs associated with susceptibility to EGFR gene mutations, which are distributed on different chromosomes. Among them, 2 SNPs (rs5942628 and rs2651176) are on the X chromosome, 3 SNPs are on chromosome 2 and 5 respectively, and the highest number of SNPs is on chromosome 3, with 5 SNPs. Secondly, A/G and C/T SNP genotypes, which are mainly transition SNPs, are more common, while transition SNPs are relatively rare, that is, the substitution between purine and purine or pyrimidine and pyrimidine. (Table 2)
There is a total of 25 SNPs associated with susceptibility to KRAS gene mutations, and the same SNP genotypes are mainly A/G and C/T, but only one SNP (rs1965009) is on the X chromosome. Moreover, the number of SNPs on chromosome 5 is the highest, with 5 SNPs, which may indicate that these chromosomal regions may be strongly associated with the function or regulation of EGFR and KRAS genes. (Table 3)
Constructing EGFR and KRAS mutation prediction models using SNP signatures
We used the single nucleotide polymorphism signatures screened by SNPdriver to predict the EGFR and KRAS mutation status in lung adenocarcinoma patients. It can be clearly seen that the model performance shows good discriminative ability, with an AUC of 0.9 for predicting EGFR mutations in the test set (Fig. 1A) and a slightly lower AUC of 0.85 for predicting KRAS mutations in the test set (Fig. 1D). Their calibration curves show good consistency between the predicted probability of the model and the actual situation, with calibration slope close to ideal (Fig. 1B and E). Considering the potential imbalance in the allocation of cases driven by gene mutations in the partitioned dataset, we also evaluated the accuracy and recall of the model. The predicted AUPR for EGFR mutations was 0.83 (Fig. 1C), and the predicted AUPR for KRAS mutations was 0.91 (Fig. 1F), indicating that the model can maintain high accuracy even with high recall rates. To further assess the robustness and generalizability of the SNPdriver model, we performed external validation on two independent lung adenocarcinoma cohorts, In the TCGA-LUAD cohort, the AUC for predicting EGFR mutations was 0.83, and for KRAS mutations was 0.76. In the European LUAD cohort, the AUC for predicting EGFR mutations was 0.85, and for KRAS mutations was 0.78. These results consistently demonstrate the robustness and applicability of the SNPdriver framework across different patient populations and data sources. (Supplementary Fig. 3)
Comparison with alternative machine learning models
To evaluate the advantage of our GBDT-based SNPdriver approach, we performed a comprehensive comparison against three other widely used machine learning models: XGBoost, Random Forest, and LASSO-logistic regression. The detailed performance metrics (AUC, Accuracy, Sensitivity, Specificity with 95% CIs) for both training and test sets are presented in Supplementary Table 1. Our SNPdriver model demonstrated superior performance, achieving the highest AUC (0.85) and accuracy (0.77) on the independent test set, while maintaining the best balance between sensitivity (0.95) and specificity (0.60). Notably, the performance gap between training and test sets for SNPdriver was smaller than for other models, suggesting good generalization ability without severe overfitting.
Discussion
Discussion
This study is based on genomic data from 83 patients with lung adenocarcinoma, and constructed an SNPdriver model using GBDT to explore the SNP features and predictive efficacy associated with susceptibility to EGFR and KRAS gene mutations. The research results show that the SNPdriver model can effectively identify SNP loci associated with EGFR and KRAS mutations, and the predictive model constructed through these loci exhibits high accuracy in distinguishing mutation states. This discovery not only provides a new approach for non-invasive prediction of driver gene mutations in lung adenocarcinoma, but also offers potential genetic markers for further understanding the molecular mechanisms underlying susceptibility to gene mutations.
The model used in this study integrates the prediction results of multiple decision trees to construct a nonlinear additive prediction function F (x). Its core is to gradually approximate the optimal solution of the objective function through a weighted combination of multiple weak classifiers. Each decision tree generates leaf node paths by recursively dividing the feature space, and the final predicted value of the sample is the sum of its corresponding leaf node values in all trees [20]. This model architecture not only inherits the ability of decision trees to capture high-dimensional and nonlinear data feature interactions [21], but also enhances the generalization performance of the model through gradient boosting strategy, enabling it to maintain high prediction stability even with limited sample size.
From the perspective of model design, the integration scale of 6 trees may have been optimized through cross validation, achieving a balance between avoiding overfitting and ensuring computational efficiency. The leaf node value valuek, i of each tree reflects the local adjustment weight of the path on the target variable (such as EGFR or KRAS mutation status), while the dynamic activation mechanism of the indicator function pathk, i (x) enables the model to adapt to the SNP feature combination of the sample. It is worth noting that although GBDT is often regarded as a “black box” model, the decomposability of function F (x) in this study provides the basis for partial interpretability. For example, by analyzing the pathways of high-frequency activation and their corresponding SNP combinations, loci that contribute significantly to predicting driver gene mutations can be identified, which may be potentially associated with previously discovered chromosome specific SNPs [22], suggesting that the model may implicitly capture the synergistic regulatory patterns of chromosome regions on gene function. During the training process, the gradient boosting algorithm iteratively fits the negative gradient of the current model to gradually reduce the loss function. This strategy enables the model to prioritize learning highly significant feature patterns in the data, thereby improving its ability to recognize minority class samples. This may partially explain why the AUPR of KRAS mutations is significantly higher than that of EGFR mutations (0.91 vs 0.83) in cases of class imbalance: the model implicitly adjusted the weight allocation of samples from different classes through gradient optimization, thus maintaining high accuracy even at high recall rates. In addition, a good fit of the calibration curve indicates that the probability values output by the model are highly consistent with the actual observation frequency, thanks to the natural advantage of GBDT in calibrating probability scales without relying on additional post-processing steps, further enhancing its practicality in clinical risk assessment.
In terms of model performance, SNPdriver has better predictive ability for EGFR mutations than KRAS mutations, which may be related to the larger number of EGFR mutation related SNPs or their stronger direct regulatory effect on gene function. It is worth noting that although the AUC of KRAS mutation is slightly lower, its AUPR is significantly higher than EGFR mutation, indicating that the model can still maintain a balance between high accuracy and recall in data with class imbalanc. The good fit of the calibration curve further indicates that the model’s predicted probability has a high consistency with the true mutation state, which is particularly important in clinical decision-making and can reduce the risk of misjudgment caused by probability bias [23]. Compared with existing research, the innovation of SNPdriver lies in integrating multi chromosome SNP features to construct a joint prediction model, rather than relying on a single gene locus or clinical indicator. This provides a new strategy to overcome the impact of tumor heterogeneity on prediction performance [24, 25].
However, this study still has certain limitations. Firstly, the small sample size may limit the generalization ability of the model, and its stability needs to be verified through a multi-center queue in the future. Secondly, although the sites screened by SNPdriver have statistical significance, their biological functions have not been experimentally validated. For example, whether X chromosome SNPs affect driver gene mutations by evading immune surveillance or hormone regulatory pathways still needs further exploration [26]. In addition, in terms of the model, the shallow integration of 6 trees may limit its ability to characterize complex SNP interaction effects. Although this helps to avoid overfitting, it may miss the synergistic signals of low-frequency or weak effect SNPs [27]. Secondly, the biological interpretation of leaf node values still poses challenges, such as whether a high weight pathway corresponds to a specific chromosome regulatory element or signaling pathway, which needs to be further validated with functional genomics data. In addition, although GBDT has strong robustness to feature loss or noise, if there is population structure bias in the training data such as race or gender related SNP frequency differences, it may affect the external validation performance of the model. Moreover, the occurrence and development of lung adenocarcinoma involve the synergistic effect of multiple genes, and in the future, multiple omics data such as copy number variation and methylation can be combined to further enhance the explanatory power of the model [28].
The robust performance of the SNPdriver model supports its potential integration into the clinical workflow for lung adenocarcinoma management, primarily as a pre-screening and decision-support tool rather than a standalone diagnostic. First, In resource-constrained settings or for patients where tissue biopsy is high-risk or insufficient, SNPdriver could be used as a non-invasive, low-cost pre-screening filter. Patients with a high predicted probability of harboring an EGFR or KRAS mutation could be prioritized for subsequent, more definitive tissue-based genomic testing. This triage function could optimize resource allocation and reduce the number of unnecessary invasive procedures. Second, While not replacing tissue confirmation, the model’s prediction available from peripheral blood or saliva-derived DNA could provide early, probabilistic insights while awaiting tissue results. This could facilitate preliminary discussions between clinicians and patients about potential targeted therapy options, especially in cases with delayed biopsy results or when initial biopsy material is inadequate for molecular testing. Third, The continuous probability output from SNPdriver could be combined with established clinical variablesto construct more comprehensive personalized risk profiles. This integrated approach could help stratify patients into different management pathways, such as intensifying surveillance for those with high predicted risk of aggressive subtypes or guiding enrollment into clinical trials for targeted therapies. By clearly positioning SNPdriver as an adjunct to, not a replacement for, gold-standard tissue testing, this framework aims to enhance clinical efficiency and decision-making while acknowledging the current necessity of confirmatory diagnosis.
In summary, the SNPdriver model achieves accurate prediction of driver gene mutation status in lung adenocarcinoma by mining SNP features, and its good discriminative ability and calibration performance provide potential tools for clinical auxiliary diagnosis. The discovery of chromosome specific SNPs provides new clues for understanding the genetic susceptibility mechanism of driver gene mutations, and model construction strategies can provide reference for genotype phenotype association studies in other tumors.
This study is based on genomic data from 83 patients with lung adenocarcinoma, and constructed an SNPdriver model using GBDT to explore the SNP features and predictive efficacy associated with susceptibility to EGFR and KRAS gene mutations. The research results show that the SNPdriver model can effectively identify SNP loci associated with EGFR and KRAS mutations, and the predictive model constructed through these loci exhibits high accuracy in distinguishing mutation states. This discovery not only provides a new approach for non-invasive prediction of driver gene mutations in lung adenocarcinoma, but also offers potential genetic markers for further understanding the molecular mechanisms underlying susceptibility to gene mutations.
The model used in this study integrates the prediction results of multiple decision trees to construct a nonlinear additive prediction function F (x). Its core is to gradually approximate the optimal solution of the objective function through a weighted combination of multiple weak classifiers. Each decision tree generates leaf node paths by recursively dividing the feature space, and the final predicted value of the sample is the sum of its corresponding leaf node values in all trees [20]. This model architecture not only inherits the ability of decision trees to capture high-dimensional and nonlinear data feature interactions [21], but also enhances the generalization performance of the model through gradient boosting strategy, enabling it to maintain high prediction stability even with limited sample size.
From the perspective of model design, the integration scale of 6 trees may have been optimized through cross validation, achieving a balance between avoiding overfitting and ensuring computational efficiency. The leaf node value valuek, i of each tree reflects the local adjustment weight of the path on the target variable (such as EGFR or KRAS mutation status), while the dynamic activation mechanism of the indicator function pathk, i (x) enables the model to adapt to the SNP feature combination of the sample. It is worth noting that although GBDT is often regarded as a “black box” model, the decomposability of function F (x) in this study provides the basis for partial interpretability. For example, by analyzing the pathways of high-frequency activation and their corresponding SNP combinations, loci that contribute significantly to predicting driver gene mutations can be identified, which may be potentially associated with previously discovered chromosome specific SNPs [22], suggesting that the model may implicitly capture the synergistic regulatory patterns of chromosome regions on gene function. During the training process, the gradient boosting algorithm iteratively fits the negative gradient of the current model to gradually reduce the loss function. This strategy enables the model to prioritize learning highly significant feature patterns in the data, thereby improving its ability to recognize minority class samples. This may partially explain why the AUPR of KRAS mutations is significantly higher than that of EGFR mutations (0.91 vs 0.83) in cases of class imbalance: the model implicitly adjusted the weight allocation of samples from different classes through gradient optimization, thus maintaining high accuracy even at high recall rates. In addition, a good fit of the calibration curve indicates that the probability values output by the model are highly consistent with the actual observation frequency, thanks to the natural advantage of GBDT in calibrating probability scales without relying on additional post-processing steps, further enhancing its practicality in clinical risk assessment.
In terms of model performance, SNPdriver has better predictive ability for EGFR mutations than KRAS mutations, which may be related to the larger number of EGFR mutation related SNPs or their stronger direct regulatory effect on gene function. It is worth noting that although the AUC of KRAS mutation is slightly lower, its AUPR is significantly higher than EGFR mutation, indicating that the model can still maintain a balance between high accuracy and recall in data with class imbalanc. The good fit of the calibration curve further indicates that the model’s predicted probability has a high consistency with the true mutation state, which is particularly important in clinical decision-making and can reduce the risk of misjudgment caused by probability bias [23]. Compared with existing research, the innovation of SNPdriver lies in integrating multi chromosome SNP features to construct a joint prediction model, rather than relying on a single gene locus or clinical indicator. This provides a new strategy to overcome the impact of tumor heterogeneity on prediction performance [24, 25].
However, this study still has certain limitations. Firstly, the small sample size may limit the generalization ability of the model, and its stability needs to be verified through a multi-center queue in the future. Secondly, although the sites screened by SNPdriver have statistical significance, their biological functions have not been experimentally validated. For example, whether X chromosome SNPs affect driver gene mutations by evading immune surveillance or hormone regulatory pathways still needs further exploration [26]. In addition, in terms of the model, the shallow integration of 6 trees may limit its ability to characterize complex SNP interaction effects. Although this helps to avoid overfitting, it may miss the synergistic signals of low-frequency or weak effect SNPs [27]. Secondly, the biological interpretation of leaf node values still poses challenges, such as whether a high weight pathway corresponds to a specific chromosome regulatory element or signaling pathway, which needs to be further validated with functional genomics data. In addition, although GBDT has strong robustness to feature loss or noise, if there is population structure bias in the training data such as race or gender related SNP frequency differences, it may affect the external validation performance of the model. Moreover, the occurrence and development of lung adenocarcinoma involve the synergistic effect of multiple genes, and in the future, multiple omics data such as copy number variation and methylation can be combined to further enhance the explanatory power of the model [28].
The robust performance of the SNPdriver model supports its potential integration into the clinical workflow for lung adenocarcinoma management, primarily as a pre-screening and decision-support tool rather than a standalone diagnostic. First, In resource-constrained settings or for patients where tissue biopsy is high-risk or insufficient, SNPdriver could be used as a non-invasive, low-cost pre-screening filter. Patients with a high predicted probability of harboring an EGFR or KRAS mutation could be prioritized for subsequent, more definitive tissue-based genomic testing. This triage function could optimize resource allocation and reduce the number of unnecessary invasive procedures. Second, While not replacing tissue confirmation, the model’s prediction available from peripheral blood or saliva-derived DNA could provide early, probabilistic insights while awaiting tissue results. This could facilitate preliminary discussions between clinicians and patients about potential targeted therapy options, especially in cases with delayed biopsy results or when initial biopsy material is inadequate for molecular testing. Third, The continuous probability output from SNPdriver could be combined with established clinical variablesto construct more comprehensive personalized risk profiles. This integrated approach could help stratify patients into different management pathways, such as intensifying surveillance for those with high predicted risk of aggressive subtypes or guiding enrollment into clinical trials for targeted therapies. By clearly positioning SNPdriver as an adjunct to, not a replacement for, gold-standard tissue testing, this framework aims to enhance clinical efficiency and decision-making while acknowledging the current necessity of confirmatory diagnosis.
In summary, the SNPdriver model achieves accurate prediction of driver gene mutation status in lung adenocarcinoma by mining SNP features, and its good discriminative ability and calibration performance provide potential tools for clinical auxiliary diagnosis. The discovery of chromosome specific SNPs provides new clues for understanding the genetic susceptibility mechanism of driver gene mutations, and model construction strategies can provide reference for genotype phenotype association studies in other tumors.
Conclusions
Conclusions
This study successfully constructed the SNPdriver model for predicting driver gene mutations in lung adenocarcinoma based on SNP feature networks. Through deep analysis of SNP nonlinear interaction effects using a machine learning algorithm, this study provides a framework for interpretable mapping from single nucleotide polymorphisms to driver gene phenotypes in lung adenocarcinoma. Its high discriminatory power (AUC 0.85–0.90) and clinical consistency validated the potential of SNPs as multi-gene coregulatory biomarkers. In the future, the application value of the model in clinical decision support systems will be further optimized by expanding the sample queue and integrating multiple omics data.
This study successfully constructed the SNPdriver model for predicting driver gene mutations in lung adenocarcinoma based on SNP feature networks. Through deep analysis of SNP nonlinear interaction effects using a machine learning algorithm, this study provides a framework for interpretable mapping from single nucleotide polymorphisms to driver gene phenotypes in lung adenocarcinoma. Its high discriminatory power (AUC 0.85–0.90) and clinical consistency validated the potential of SNPs as multi-gene coregulatory biomarkers. In the future, the application value of the model in clinical decision support systems will be further optimized by expanding the sample queue and integrating multiple omics data.
Supplementary Information
Supplementary Information
Below is the link to the electronic supplementary material.
Below is the link to the electronic supplementary material.
출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.
🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반
- Nanotechnology-Assisted Molecular Profiling: Emerging Advances in Circulating Tumor DNA Detection.
- Building Hybrid Pharmacometric-Machine Learning Models in Oncology Drug Development: Current State and Recommendations.
- Machine learning integrating MRI and clinical features predicts early recurrence of hepatocellular carcinoma after resection.
- Machine learning approaches to optimize the integration of sociodemographic factors for predicting cancer-specific survival among patients with high-risk prostate cancer.
- Integrative Computational Approaches to Prostate Cancer with Conditional Reprogramming and AI-Driven Precision Medicine.
- Dynamic changes in serum HER2-peptide-specific autoantibodies predict response to neoadjuvant therapy in HER2-positive breast cancer.