ColoLDB: a machine learning-based predictive model for colorectal cancer using routine laboratory parameters.
[BACKGROUND] Colorectal cancer (CRC) is one of the most common and highly prevalent cancers worldwide, posing a serious threat to public health.
APA
Zhang X, Tong X, et al. (2026). ColoLDB: a machine learning-based predictive model for colorectal cancer using routine laboratory parameters.. Journal of gastrointestinal oncology, 17(1), 12. https://doi.org/10.21037/jgo-2025-611
MLA
Zhang X, et al.. "ColoLDB: a machine learning-based predictive model for colorectal cancer using routine laboratory parameters.." Journal of gastrointestinal oncology, vol. 17, no. 1, 2026, pp. 12.
PMID
41816568
Abstract
[BACKGROUND] Colorectal cancer (CRC) is one of the most common and highly prevalent cancers worldwide, posing a serious threat to public health. Current CRC screening and diagnosis primarily depend on colonoscopy, an invasive procedure that often misses early-stage tumors, contributing to delayed diagnoses. The aim of this study is to develop a simpler, more accessible screening method to assist clinicians in the early identification and diagnosis of CRC and its precancerous lesions.
[METHODS] Using the patient's hospitalization number as the unique identifier, invalid age records were excluded, non-numerical laboratory test results were removed, and only the first diagnostic test result for each parameter per patient (i.e., the initial test value at first diagnosis) was retained. The study distinguished between the CRC experimental group and the control group. The study collected laboratory test data from each participant, including tumor markers, biochemical parameters, immunological indicators, complete blood count, coagulation tests, and routine urinalysis. We selected light gradient boosting machine (LightGBM), logistic regression (LR), random forest (RF), and extreme gradient boosting (XGBoost) to construct the models. Finally, the SHapley Additive explanations (SHAP) algorithm was employed to interpret the models.
[RESULTS] After analyzing the four selected models, the intersection of the top-ranked features across all models was identified, ultimately screening eight laboratory parameters to construct the diagnostic colorectal laboratory digital biomarker (ColoLDB) model: specific gravity (SG), carbohydrate antigen 19-9 (CA19-9), carcinoembryonic antigen (CEA), age, albumin (ALB), cytokeratin 19 fragment (CYFRA21-1), high-density lipoprotein cholesterol (HDL-C) and carbohydrate antigen 72-4 (CA72-4). In the test set, the RF machine learning model demonstrated optimal performance in identifying CRC, achieving an area under the curve (AUC) of 0.863 (95% confidence interval: 0.792-0.922), an accuracy of 0.900, a sensitivity of 0.225, a specificity of 0.997, a positive predictive value (PPV) of 0.917, and a negative predictive value (NPV) of 0.900. When the specificity was set at 0.903, the ColoLDB model's sensitivity reached 0.694. In comparison, a diagnostic model combining CEA and CA19-9 yielded an AUC of 0.688, a sensitivity of 0.429 and a specificity of 0.947. The RF diagnostic ColoLDB model exhibited superior diagnostic efficacy compared to the combined CEA and CA19-9 diagnosis model.
[CONCLUSIONS] Our research findings indicate that eight laboratory test indicators may be related the risk of developing CRC. Our RF diagnostic ColoLDB model is an innovative and practical tool that effectively predicts the occurrence of CRC, enhancing the diagnostic efficiency for this disease. This method holds promise as a valuable tool for diagnosing CRC.
[METHODS] Using the patient's hospitalization number as the unique identifier, invalid age records were excluded, non-numerical laboratory test results were removed, and only the first diagnostic test result for each parameter per patient (i.e., the initial test value at first diagnosis) was retained. The study distinguished between the CRC experimental group and the control group. The study collected laboratory test data from each participant, including tumor markers, biochemical parameters, immunological indicators, complete blood count, coagulation tests, and routine urinalysis. We selected light gradient boosting machine (LightGBM), logistic regression (LR), random forest (RF), and extreme gradient boosting (XGBoost) to construct the models. Finally, the SHapley Additive explanations (SHAP) algorithm was employed to interpret the models.
[RESULTS] After analyzing the four selected models, the intersection of the top-ranked features across all models was identified, ultimately screening eight laboratory parameters to construct the diagnostic colorectal laboratory digital biomarker (ColoLDB) model: specific gravity (SG), carbohydrate antigen 19-9 (CA19-9), carcinoembryonic antigen (CEA), age, albumin (ALB), cytokeratin 19 fragment (CYFRA21-1), high-density lipoprotein cholesterol (HDL-C) and carbohydrate antigen 72-4 (CA72-4). In the test set, the RF machine learning model demonstrated optimal performance in identifying CRC, achieving an area under the curve (AUC) of 0.863 (95% confidence interval: 0.792-0.922), an accuracy of 0.900, a sensitivity of 0.225, a specificity of 0.997, a positive predictive value (PPV) of 0.917, and a negative predictive value (NPV) of 0.900. When the specificity was set at 0.903, the ColoLDB model's sensitivity reached 0.694. In comparison, a diagnostic model combining CEA and CA19-9 yielded an AUC of 0.688, a sensitivity of 0.429 and a specificity of 0.947. The RF diagnostic ColoLDB model exhibited superior diagnostic efficacy compared to the combined CEA and CA19-9 diagnosis model.
[CONCLUSIONS] Our research findings indicate that eight laboratory test indicators may be related the risk of developing CRC. Our RF diagnostic ColoLDB model is an innovative and practical tool that effectively predicts the occurrence of CRC, enhancing the diagnostic efficiency for this disease. This method holds promise as a valuable tool for diagnosing CRC.
같은 제1저자의 인용 많은 논문 (5)
- Effects of varicocele and microsurgical varicocelectomy on the metabolites in semen.
- Novel staurosporine-type indolocarbazole glycoalkaloids as potent and selective FLT3-ITD inhibitors for acute myeloid leukemia.
- IDH1 mutation creates a dependency on fatty acid metabolism that underlies sensitivity to cuproptosis in acute myeloid leukemia cells.
- MASH and liver fibrosis: Clinical trials to watch.
- E3 ubiquitin ligase DTX3L promotes breast cancer progression by enhancing PKCα ubiquitination and inhibiting the p38 MAPK signaling pathway.