Gene Expression-Based Colorectal Cancer Prediction Using Machine Learning and SHAP Analysis.
To develop and validate a genetic diagnostic model for colorectal cancer (CRC).
APA
Yin Y, Yang Z, et al. (2026). Gene Expression-Based Colorectal Cancer Prediction Using Machine Learning and SHAP Analysis.. Genes, 17(1). https://doi.org/10.3390/genes17010114
MLA
Yin Y, et al.. "Gene Expression-Based Colorectal Cancer Prediction Using Machine Learning and SHAP Analysis.." Genes, vol. 17, no. 1, 2026.
PMID
41595533
Abstract
To develop and validate a genetic diagnostic model for colorectal cancer (CRC). First, differential expression genes (DEGs) between colorectal cancer and normal groups were screened using the TCGA database. Subsequently, a two-sample Mendelian randomization analysis was performed using the eQTL genomic data from the IEU OpenGWAS database and colorectal cancer outcomes from the R12 Finnish database to identify associated genes. The intersecting genes from both methods were selected for the development and validation of the CRC genetic diagnostic model using nine machine learning algorithms: Lasso Regression, XGBoost, Gradient Boosting Machine (GBM), Generalized Linear Model (GLM), Neural Network (NN), Support Vector Machine (SVM), k-Nearest Neighbors (KNN), Random Forest (RF), and Decision Tree (DT). A total of 3716 DEGs were identified from the TCGA database, while 121 genes were associated with CRC based on the eQTL Mendelian randomization analysis. The intersection of these two methods yielded 27 genes. Among the nine machine learning methods, XGBoost achieved the highest AUC value of 0.990. The top five genes predicted by the XGBoost method-RIF1, GDPD5, DBNDD1, RCCD1, and CLDN5-along with the five most significantly differentially expressed genes (, , , , and ) in the GSE87211 dataset, were selected for the construction of the final colorectal cancer (CRC) genetic diagnostic model. The ROC curve analysis revealed an AUC (95% CI) of 0.9875 (0.9737-0.9875) for the training set, and 0.9601 (0.9145-0.9601) for the validation set, indicating strong predictive performance of the model. SHAP model interpretation further identified and as the most influential genes in the XGBoost model, with both making positive contributions to the model's predictions. The gene expression profile in colorectal cancer is characterized by enhanced cell proliferation, elevated metabolic activity, and immune evasion. A genetic diagnostic model constructed based on ten genes (, , , , , , , , , and ) demonstrates strong predictive performance. This model holds significant potential for the early diagnosis and intervention of colorectal cancer, contributing to the implementation of third-tier prevention strategies.
MeSH Terms
Humans; Colorectal Neoplasms; Machine Learning; Gene Expression Regulation, Neoplastic; Biomarkers, Tumor; Gene Expression Profiling; Quantitative Trait Loci; Databases, Genetic
같은 제1저자의 인용 많은 논문 (5)
- CO transoral laser microsurgery for early glottic carcinoma with anterior commissure involvement.
- Reply: Persistent Racial Disparities in Lung Cancer Survival and the Overlooked Role of Post-Treatment Care.
- Racial/Ethnic Disparities in Non-Small Cell Lung Cancer Mortality in the U.S., 2000-2020: A Population-Based Study.
- Mechanistic study of deoxycholic acid in colorectal cancer based on network toxicology and machine learning approaches.
- Ultrasound Responsive Mn/Se-Nanozyme as PANoptosis Initiators for Bladder Cancer Immunotherapy.