본문으로 건너뛰기
← 뒤로

Gene Expression-Based Colorectal Cancer Prediction Using Machine Learning and SHAP Analysis.

Genes 2026 Vol.17(1)

Yin Y, Yang Z, Li X, Gong S, Xu C

📝 환자 설명용 한 줄

To develop and validate a genetic diagnostic model for colorectal cancer (CRC).

이 논문을 인용하기

BibTeX ↓ RIS ↓
APA Yin Y, Yang Z, et al. (2026). Gene Expression-Based Colorectal Cancer Prediction Using Machine Learning and SHAP Analysis.. Genes, 17(1). https://doi.org/10.3390/genes17010114
MLA Yin Y, et al.. "Gene Expression-Based Colorectal Cancer Prediction Using Machine Learning and SHAP Analysis.." Genes, vol. 17, no. 1, 2026.
PMID 41595533

Abstract

To develop and validate a genetic diagnostic model for colorectal cancer (CRC). First, differential expression genes (DEGs) between colorectal cancer and normal groups were screened using the TCGA database. Subsequently, a two-sample Mendelian randomization analysis was performed using the eQTL genomic data from the IEU OpenGWAS database and colorectal cancer outcomes from the R12 Finnish database to identify associated genes. The intersecting genes from both methods were selected for the development and validation of the CRC genetic diagnostic model using nine machine learning algorithms: Lasso Regression, XGBoost, Gradient Boosting Machine (GBM), Generalized Linear Model (GLM), Neural Network (NN), Support Vector Machine (SVM), k-Nearest Neighbors (KNN), Random Forest (RF), and Decision Tree (DT). A total of 3716 DEGs were identified from the TCGA database, while 121 genes were associated with CRC based on the eQTL Mendelian randomization analysis. The intersection of these two methods yielded 27 genes. Among the nine machine learning methods, XGBoost achieved the highest AUC value of 0.990. The top five genes predicted by the XGBoost method-RIF1, GDPD5, DBNDD1, RCCD1, and CLDN5-along with the five most significantly differentially expressed genes (, , , , and ) in the GSE87211 dataset, were selected for the construction of the final colorectal cancer (CRC) genetic diagnostic model. The ROC curve analysis revealed an AUC (95% CI) of 0.9875 (0.9737-0.9875) for the training set, and 0.9601 (0.9145-0.9601) for the validation set, indicating strong predictive performance of the model. SHAP model interpretation further identified and as the most influential genes in the XGBoost model, with both making positive contributions to the model's predictions. The gene expression profile in colorectal cancer is characterized by enhanced cell proliferation, elevated metabolic activity, and immune evasion. A genetic diagnostic model constructed based on ten genes (, , , , , , , , , and ) demonstrates strong predictive performance. This model holds significant potential for the early diagnosis and intervention of colorectal cancer, contributing to the implementation of third-tier prevention strategies.

MeSH Terms

Humans; Colorectal Neoplasms; Machine Learning; Gene Expression Regulation, Neoplastic; Biomarkers, Tumor; Gene Expression Profiling; Quantitative Trait Loci; Databases, Genetic

같은 제1저자의 인용 많은 논문 (5)