본문으로 건너뛰기
← 뒤로

Cancer Risk Prediction Using Machine Learning for Supporting Early Cancer Diagnosis in Symptomatic Patients: A Systematic Review of Model Types.

메타분석 1/5 보강
Cancer medicine 📖 저널 OA 98.3% 2022: 15/15 OA 2023: 14/14 OA 2024: 36/36 OA 2025: 164/164 OA 2026: 224/232 OA 2022~2026 2025 Vol.14(24) p. e71463
Retraction 확인
출처

Pennisi F, Borlini S, Harrison H, Cuciniello R, D'Amelio AC, Barclay M

📝 환자 설명용 한 줄

[INTRODUCTION] Predictive models could support clinicians in identifying patients who may benefit from cancer investigations.

🔬 핵심 임상 통계 (초록에서 자동 추출 — 원문 검증 권장)
  • 표본수 (n) 9
  • 연구 설계 systematic review

이 논문을 인용하기

↓ .bib ↓ .ris
APA Pennisi F, Borlini S, et al. (2025). Cancer Risk Prediction Using Machine Learning for Supporting Early Cancer Diagnosis in Symptomatic Patients: A Systematic Review of Model Types.. Cancer medicine, 14(24), e71463. https://doi.org/10.1002/cam4.71463
MLA Pennisi F, et al.. "Cancer Risk Prediction Using Machine Learning for Supporting Early Cancer Diagnosis in Symptomatic Patients: A Systematic Review of Model Types.." Cancer medicine, vol. 14, no. 24, 2025, pp. e71463.
PMID 41388924 ↗
DOI 10.1002/cam4.71463

Abstract

[INTRODUCTION] Predictive models could support clinicians in identifying patients who may benefit from cancer investigations. We aimed to examine published evidence on machine learning models (ML) developed to estimate cancer risk based on symptoms and other patient characteristics.

[METHODS] Using MEDLINE, Scopus, and EMBASE, we performed a systematic review of studies published in 2014-2024, which included data on signs/symptoms for cancer risk prediction. We used the QUADAS-AI tools to assess study quality. We performed a quantitative synthesis of diagnostic performance, including accuracy, sensitivity, specificity, area under the curve (AUC). Adherence to TRIPOD guidelines was assessed.

[RESULTS] Among the 5646 initially identified articles, 34 met inclusion criteria. Included studies most frequently examined lung (n = 9 studies), mesothelioma (n = 7), and gastrointestinal cancers (n = 4) and used hospital electronic health records (n = 8) or publicly available online datasets (n = 13). In addition to signs/symptoms (n = 34), most models included sociodemographic characteristics (n = 27) and lifestyle factors (n = 20). In 70% of studies, internal validation was performed. ML models demonstrated variable performance, with AUC values ranging from 0.60 to 1 during validation. Random Forest, Support Vector Machine, Decision Tree, and Multilayer Perceptron showed the best predictive performance. Most of the studies (94.1%) had a high risk of bias for the index test.

[CONCLUSION] ML models have been reported to demonstrate potential in managing complex data for cancer risk prediction. However, the current evidence is heterogeneous and frequently limited by bias and incomplete reporting. Further validation and thorough assessments of real-world performance are necessary before these models can be considered reliable for clinical use.

[TRIAL REGISTRATION] International Prospective Register of Systematic Reviews (PROSPERO) registration number: CRD42024548088.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

📖 전문 본문 읽기 PMC JATS · ~195 KB · 영문

Introduction

1
Introduction
The majority of cancers are diagnosed after patients present with symptoms, rather than through screening [1, 2, 3, 4, 5, 6]. For example, over 85% of colorectal cancers [1, 2] are diagnosed following symptomatic presentation in the UK, despite the availability of a national colorectal cancer screening programme. The non‐specific nature of many presenting symptoms of as‐yet‐undiagnosed cancer, combined with the frequent presence of non‐neoplastic chronic conditions, complicates the diagnostic process, with clinicians facing significant challenges in determining which patients might benefit from diagnostic investigations [7, 8, 9, 10]. Thus, despite ongoing efforts to diagnose cancer early, a substantial proportion of patients continue to be diagnosed as emergencies and/or at an advanced stage [11, 12, 13]. In the UK, approximately 22% of colorectal cancers and more than 45% of lung cancers are diagnosed following an emergency presentation [14], which is associated with worse outcomes even after adjusting for stage at diagnosis [15, 16].
Advancements in technology, such as the integration of machine learning (ML) in clinical practice, may be opening new avenues to enhance diagnostic processes [17, 18, 19, 20, 21, 22]. ML's capability to analyze complex biomedical data offers unprecedented opportunities to accelerate and refine the accuracy of cancer risk assessment and diagnostic approaches. When patients present with symptoms, ML tools could help clinicians discriminate between those who require urgent investigations and those who can be safely monitored or reassured. Integrating diverse data in ML models, including symptoms, genomic sequences, behavioral risk factors, comorbidities, laboratory test results, and other patient characteristics, may enable more accurate risk assessments compared to traditional methods [23, 24, 25, 26, 27]. Some studies have reported a 20%–25% increase in risk prediction accuracy compared to conventional methods [28, 29]. While numerous studies have focused on the application of AI in medical imaging [29, 30, 31, 32, 33, 34], there remains a notable gap in research concerning ML's capabilities for risk prediction and risk stratification in clinical practice, an area requiring further exploration to fully realize its potential in supporting early cancer diagnosis.
This systematic review aims to examine and synthesize studies on ML models development for cancer risk prediction. By specifically focusing on models that include data on signs and symptoms, in combination with other clinical and patient characteristics, the review aims to identify ML models that could support doctors' decision‐making on referrals and diagnostic work‐up in clinical practice, when patients present with potential cancer symptoms. While previous meta‐analyses and systematic reviews have largely concentrated on imaging‐based AI tools, genomic classifiers, or population‐level screening models, the present review uniquely addresses the use of ML algorithms in symptom‐based cancer risk prediction, a clinically distinct and understudied phase of the diagnostic pathway. Previous meta‐analyses have indicated that the additive value of ML over conventional statistical methods for risk prediction is uncertain [35]. This uncertainty carries important clinical implications, indicating that current ML‐based approaches should be regarded as complementary to, rather than replacements for, conventional statistical models. Robust evidence from external validation studies and head‐to‐head comparative analyses is still required to determine whether ML techniques can achieve meaningful improvements in predictive accuracy, calibration, and clinical utility beyond established methods. In this context, our review extends beyond performance comparison to provide a critical appraisal of methodological rigor, including assessments of risk of bias, validation strategies, and reporting transparency, which have often been inconsistently evaluated in prior literature. By adopting this approach, we aim to delineate the current landscape of symptom‐based ML research, highlight existing methodological and reporting gaps, and inform future studies aiming for robust, transparent, and clinically applicable AI integration in oncology.

Materials and Methods

2
Materials and Methods
The review has been conducted and reported following the guidelines of the Cochrane Handbook for Systematic Reviews and the Preferred Reporting Items for Systematic Reviews and Meta‐Analyses (PRISMA) statement [36].
2.1
Search Strategy
A systematic search was performed to identify published studies from 2014 to 2024 that developed ML algorithms for cancer risk prediction, incorporating symptoms and other relevant clinical, genetic, socio‐demographic, and lifestyle risk factors. A comprehensive literature search was conducted across four electronic databases: PubMed/MEDLINE, EMBASE, Scopus, and Web of Science. The search string used for PubMed is provided in Table S1. Search strategies for other databases were adapted based on this string to accommodate the specific syntax and indexing of each database. Additional articles were retrieved by screening references of included studies or consulting experts in the field. Papers were eligible for inclusion if they described original studies utilizing ML tools and algorithms for cancer risk prediction, which included at least one risk factor that was a clinical sign or symptom.
As the aim was to identify ML models that could support doctors' decision‐making when patients present with symptoms, we excluded papers that did not consider symptoms in their models. This exclusion criterion was applied because our review specifically aimed to identify ML models that could be applied in clinical settings for supporting decision‐making when patients present to their doctor with symptoms potentially indicative of cancer. Models developed for population‐level screening or risk stratification, not including symptoms among input variables, have a different target group and context of application, and were therefore outside the scope of this review. Our review included studies comparing ML predictive models with traditional or other ML‐based models. Reports published from January 1, 2014, to May 17, 2024, were eligible for inclusion; the final search update was conducted on May 17, 2024. All inclusion and exclusion criteria are described below.
2.1.1
Inclusion Criteria

Studies that developed ML‐based methods for predicting a primary cancer.

Studies that used at least a clinical sign/symptom for cancer prediction in the final model.

Studies that provided at least one quantitative performance metric for the predictive model (e.g., area under the receiver operating characteristic curve [AUC], sensitivity, specificity, calibration, etc.).

Original studies that used observational data (including cohort studies, case–control studies, or randomized controlled trials [RCTs]).

Studies that predicted the risk of incident cancer when patients present with symptoms, rather than studies focused on risk of cancer diagnosis in the general population, prognosis, or cancer recurrence.

2.1.2
Exclusion Criteria

Systematic reviews or conference abstracts.

Studies that did not use ML‐based models. We recognize there is no clear split between traditional statistical methods and ML. For this review, ML was defined as a set of computational algorithms capable of automatically identifying patterns and learning from data without prespecified parametric relationships between variables. This category included supervised, unsupervised, and ensemble methods such as random forest (RF), support vector machines (SVM), k‐nearest neighbors (KNN), decision trees (DT), gradient boosting, and deep learning models (CNN, ANN, LSTM). Traditional statistical methods were defined as approaches based on explicit model specification and hypothesis‐driven relationships, including logistic regression, Cox regression, and linear models. Studies employing regularized regression (e.g., LASSO, Ridge, Elastic Net) or hybrid designs were classified based on their dominant analytical framework.

Studies that did not include at least one clinical sign or symptom as a potential predictor in the final model.

Non‐English language articles or studies published before 1 January 2014.

2.2
Data Collection and Synthesis of Results
Study selection occurred in two phases: initial screening of titles and abstracts, followed by full‐text review of potentially eligible articles. Two reviewers (S.B. and R.C.) independently assessed the studies at each stage. Disagreements were resolved through discussion, or by involving a third senior researcher if necessary (F.P.). Data were extracted using a pre‐defined data extraction form in Excel, which included clear operational definitions and examples for each predictor category. The data extraction form was piloted to ensure consistent interpretation of variables and domains across reviewers. Two reviewers independently extracted the data from all included studies. Any discrepancies were discussed and resolved through consensus, with arbitration by a senior reviewer when required. This process ensured reproducibility and standardization in the categorization of predictors across heterogeneous studies. Extracted data included key characteristics of each included study (publication year, authors, study period, country, and type of cancer), information on the development of the prediction model (ML methods used, data sources, data input, sample size, number of predictive variables). Extracted data included key characteristics from each included study (publication year, authors, study period, country, and type of cancer), information on the development of the prediction model (ML methods used, data sources, data input, sample size, number of predictive variables).
We also systematically extracted and categorized all variables used for model training as reported in each study. These were grouped into six domains: symptoms/signs, socio‐demographic characteristics, comorbidities, behavioral/lifestyle factors, laboratory and diagnostic tests, and genetic information. A detailed summary of these predictive parameters is provided in Table S2.
Predictive factors were categorized into demographics, signs and symptoms, comorbidities, behavioral/lifestyle factors, laboratory tests, and others. Key performance metrics were collected (AUC, sensitivity, specificity, VPP, VPN, R
2, D‐statistics, F1, calibration, precision). Data extraction was performed in duplicate, and discrepancies were resolved by discussion. The TRIPOD adherence assessment form was used to evaluate the adherence rate for the reporting of individual studies [37].

2.3
Risk of Bias
The methodological quality of the studies was independently assessed by two reviewers (F.P., A.C.D.A.) using the QUADAS‐AI tool. QUADAS‐AI is specifically designed for evaluating AI‐centric diagnostic precision and accuracy studies, addressing unique terminology and criteria relevant to AI applications. This tool extends the QUADAS‐2 framework to account for methodological aspects specific to AI in diagnostic research, including data preprocessing, model training, and validation. As the included studies primarily evaluated ML algorithms for diagnostic risk prediction based on clinical symptoms, QUADAS‐AI was deemed more appropriate. Articles and their supplementary materials were meticulously screened to assess potential biases and concerns regarding applicability.

Results

3
Results
3.1
Study Selection and Study Characteristics
The database search identified a total of 5646 studies, among which we selected 4197 unique studies after removing duplicates. Following title and abstract screening, we excluded 4121 papers, as they did not consider signs or symptoms, leaving 76 papers for further assessment. We included a total of 34 studies on ML models' development and validation for early cancer diagnosis and risk stratification.
A PRISMA flowchart diagram illustrating the article selection process is presented in Figure 1.
While we searched studies published between 2014 and 2024, the majority were published in the past five years (n = 25) [38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62]. Included studies were from three geographic areas (based on the data origin): Europe (n = 7) [40, 52, 56, 57, 61, 63, 64], North America (n = 7) [38, 44, 46, 50, 55, 65, 66], and Asia (n = 17) [39, 41, 42, 43, 45, 47, 48, 49, 51, 53, 59, 60, 62, 67, 68, 69, 70] (Table 1). Most of them (n = 32) [39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 68, 69, 70, 71] focused on developing and validating ML models for cancer risk prediction by assessing performance metrics such as accuracy, sensitivity, and AUC. Among the included studies, two [38, 58] employed multiple AI models and 27 [39, 40, 41, 42, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 65, 66, 68, 70] reported more than one performance metric. The median (IQR [range]) overall adherence rate to TRIPOD checklist items was 55 (40–68 [10–81]) % (Table 1), with a breakdown of reporting per individual item shown in Figure 2.
We identified most studies using ML techniques to predict risk of cancer diagnosis in symptomatic patients for lung cancer (n = 9 studies) [45, 47, 54, 55, 56, 58, 60, 63, 71], mesothelioma (n = 7) [38, 46, 59, 62, 65, 68, 70], gastrointestinal cancer (n = 4) [40, 57, 64, 69] and pancreatic cancer (n = 3) [44, 52, 66] (Figure 3).
The Random Forest (RF) algorithm was the most commonly employed technique (n = 12) [39, 40, 42, 45, 48, 49, 51, 52, 59, 61, 64, 68], followed by Support Vector Machines (SVM) (n = 8) [40, 45, 47, 51, 54, 59, 61, 70], Decision Trees (DT) (n = 8) [45, 47, 49, 51, 53, 67, 68, 69], Logistic Regression (LR) (n = 8) [40, 42, 45, 48, 49, 51, 52, 64], and XGBoost (n = 6) [40, 41, 42, 44, 45, 48] (Table 2). Nine studies utilized deep learning models [39, 47, 49, 50, 53, 59, 65, 66, 71]. Additionally, five hybrid models (models incorporating both ML and more traditional biostatistical methods) were included [55, 58, 60, 62, 70].

3.2
Type of Data Included in Risk Prediction Models
The studies were based on either non‐publicly available datasets, with access restricted to researchers belonging to specific institutions (n = 23) [38, 39, 41, 42, 43, 46, 47, 48, 49, 51, 53, 56, 57, 59, 61, 62, 63, 64, 65, 67, 68, 69, 70], or datasets available for the wider research community, with access typically requiring a research application and ethics approval (n = 10) [40, 44, 50, 52, 54, 55, 58, 60, 66, 71]. Four studies [49, 50, 57, 66] utilized two different datasets. Of these, two [49, 50] reported using one dataset for training and the other for testing.
Restricted access datasets originate from hospitals or primary care settings, with nine studies [39, 43, 47, 48, 49, 56, 61, 63, 64] using hospital data, one [57] using primary care data, and one study [52] including both primary and secondary care data. Three studies [45, 60, 66] derived their data from surveys. Datasets available for the wider research community, typically provided by government agencies or research institutions, were employed in 10 studies [40, 44, 50, 52, 54, 55, 58, 60, 66, 71] and are summarized in Table 3. Only one study [50] included clinical free text notes.
Most studies used single data sources. However, two studies [40, 69] combined different types of data to create multimodal predictive models, integrating electronic health records with cancer markers or medical imaging data.
There was considerable variability in the sample sizes used to develop cancer risk prediction models. The cancer sample sizes (cases) ranged from a minimum of 17 patients [53] to 7471 patients [40]. The maximum total sample size, including both cases and controls, was 964,579 patients. Additionally, two studies [41, 50] included an external validation sample, bringing the total to over one million patients. Two studies [70, 71] did not specify the number of cases and controls, six studies [39, 43, 47, 48, 50, 55] lacked a control sample, and one study [60] provided only the total sample size without distinguishing between cases and controls. Regarding the sample size, three studies [41, 48, 68] divided the sample into training, testing, and validation sets, with six studies [40, 44, 59, 62, 65, 66] providing the percentages of the sample used for training (ranging from 70% to 80%) and testing (ranging from 20% to 30%) (Table S2).
Only 11 studies [39, 41, 44, 49, 51, 53, 54, 58, 61, 65, 66] reported their methods for handling missing data before training. The most frequent approach, employed by four studies [51, 54, 58, 61], was complete case analysis. Five studies [39, 49, 51, 54, 58] removed missing data as a part of pre‐processing and data cleaning (complete case); two studies [53, 65] reported no missing data, one [44] used statistical modeling to address missing data in real‐time systems, another [66] employed one‐hot encoding with missing values set to −1, and one study [41] relied on XGBoost handling missing data effectively.

3.3
Risk Predictors
The initial number of variables considered for risk prediction ranged from 15 to 18,220, with the final selection ranging from 3 [48] to 582 [44] (excluding one article [61] with 75,000 variables). Table S3 summarizes the risk predictors across the included studies. Most articles (n = 33) included cancer‐specific symptoms or signs. The six lung cancer studies [45, 47, 55, 60, 63, 71] considered both specific symptoms/signs (cough, chest pain, dyspnea, wheezing, yellow fingers) and generic symptoms (fever, fatigue, appetite loss), while three studies [54, 57, 58] used only specific symptoms. Of the three pancreatic cancer articles, two reported specific (pruritus) and generic symptoms (pain, weight loss, fatigue, anorexia, anxiety, weakness, fever) [44, 66], while the third listed only one comorbidity (asthma) [52]. Further details are in Tables S3 and S4.
Twenty‐seven studies described a model including sociodemographic characteristics, with age (n = 27) [38, 39, 40, 42, 44, 45, 46, 47, 48, 50, 51, 53, 54, 55, 56, 58, 59, 60, 61, 62, 63, 64, 65, 66, 68, 69, 70], gender (n = 21) [38, 42, 43, 44, 45, 46, 47, 50, 51, 54, 56, 58, 59, 62, 63, 64, 65, 66, 68, 69, 70], geographic area (n = 9) [38, 46, 51, 59, 62, 63, 65, 68, 70], and education level (n = 4) [44, 56, 63, 64] being the most common.
Seventeen articles [41, 43, 44, 45, 48, 52, 53, 54, 55, 57, 58, 60, 61, 63, 64, 66, 69] reported comorbidities, either cancer‐specific (e.g., colorectal cancer: hemorrhoids) or general chronic conditions (e.g., hypertension, allergy, diabetes). Eight articles [43, 47, 48, 49, 52, 55, 63, 66] incorporated genetic data. Lifestyle factors such as smoking (n = 20) [38, 45, 46, 47, 48, 49, 52, 54, 55, 56, 58, 59, 60, 62, 63, 64, 65, 66, 68, 70] and alcohol drinking (n = 8) [45, 47, 52, 54, 55, 58, 66, 69] were also included. For mesothelioma, studies (n = 7) [38, 46, 59, 62, 65, 68, 70] considered asbestos, C‐reactive protein, and pleural protein exposure as a key risk factor. In contrast, gastrointestinal cancer studies [40, 57, 64, 69] identified predictive factors such as diabetes, age, and hyperlipidemia. Most studies focused on the predictive nature of associations without inferring causality.
3.3.1
Grouping Studies by Data Input Type
To address the substantial heterogeneity in model inputs, we categorized the studies based on the types of data included in the final ML models. Six categories were defined: (1) symptoms only; (2) symptoms + sociodemographic variables (e.g., age, sex); (3) symptoms + comorbidities (e.g., diabetes, hypertension); (4) symptoms + lifestyle factors (e.g., smoking, alcohol use); (5) symptoms + laboratory, genetic, or imaging data; and (6) multimodal models (including ≥ 3 of the above domains). Most studies (n = 27) included sociodemographic data; 17 studies [41, 43, 44, 45, 48, 52, 53, 54, 55, 57, 58, 60, 61, 63, 64, 66, 69] included comorbidities, and 20 [38, 45, 46, 47, 48, 49, 52, 54, 55, 56, 58, 59, 60, 62, 63, 64, 65, 66, 68, 70] incorporated lifestyle variables. Only a minority used genetic or laboratory data (n = 8) [43, 47, 48, 49, 52, 55, 63, 66], though these were more frequent in recent studies. Multimodal models, which integrated three or more data types, were observed in 8 studies [42, 44, 52, 56, 59, 60, 62, 66]. A summary classification of each study by data type is presented in Table S5.

3.4
Evaluation and Performance Metrics
Nearly all studies (n = 24) [39, 40, 46, 47, 48, 50, 51, 52, 53, 54, 56, 57, 59, 60, 61, 62, 63, 64, 66, 67, 68, 69, 70, 71] used internal validation, which reduced the available sample size. Cross‐validation techniques were applied in three studies [39, 50, 53], while K‐fold cross‐validation was the most used method (n = 16) [40, 44, 46, 48, 49, 51, 54, 56, 57, 58, 60, 61, 64, 66, 68, 70]. Only two studies [42, 50] conducted external validation to assess their predictive model's generalizability. Of these, one study [50] was externally validated using 100 physician notes randomly selected from a dataset. The other article [42] used a separate dataset of 1 million individuals resident in Taiwan.
The evaluation metrics used in the reviewed studies varied widely. Most studies (n = 33) reported performance metrics, most commonly the AUC (n = 22) [40, 41, 42, 44, 45, 46, 47, 48, 49, 51, 52, 53, 54, 56, 57, 58, 61, 63, 64, 65, 66, 70], accuracy (n = 23) [39, 40, 41, 42, 43, 45, 46, 47, 48, 49, 51, 53, 54, 55, 56, 58, 59, 60, 62, 65, 68, 70, 71], specificity (n = 15) [41, 42, 44, 45, 47, 51, 52, 53, 55, 56, 57, 61, 65, 66, 68], precision (n = 8) [40, 41, 45, 49, 50, 58, 61, 62], sensitivity (n = 24) [39, 40, 41, 42, 44, 45, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 65, 66, 68], and F1 score (n = 11) [40, 45, 48, 50, 55, 58, 60, 62, 65, 68, 70]. Despite the importance of calibrating ML models for predictive performance assessment, only two studies (n = 2) [40, 43] evaluated model calibration (Table 4).
In terms of prediction horizon, the only studies that reported this information (n = 6) [42, 52, 61, 64, 66, 69] focused on predictions of cancer within a 5‐year timeframe.
3.4.1
Performance Comparison Between Traditional and ML‐Based Predictive Models
Some studies [40, 42, 43, 45, 48, 49, 51, 52, 64] compared the predictive capabilities of ML with conventional methods. Specifically, in one study [40] the comparison was made between several ML models and the current UK esophagogastric cancer risk‐assessment tool (ogRAT). Importantly, both the ML models and the ogRAT were developed and validated using the same dataset, derived from the UK General Practice Research Database (GPRD). This ensures a fair comparison, as the models were evaluated on identical datasets. The study used symptoms encoded as binary variables (presence/absence) and included recurrence for specific symptoms (e.g., dyspepsia, dysphagia). Symptoms were parameterized individually, with no explicit pairwise or higher‐order combinations analyzed in the models. The ML models achieved similar performance, with an accuracy of 0.89 (95% CI: 0.86–0.92) and an AUROC of 0.87 (95% CI: 0.84–0.90), compared to the ogRAT's AUROC of 0.81 (95% CI: 0.79–0.83). The ML models identified 11% more cancer patients than the ogRAT, with minimal impact on false positives, or up to 25% more patients with a slight increase in false positives depending on the decision threshold. Symptoms were parameterized using binary encoding to represent the presence or absence of each symptom, with repeated occurrences recorded for certain symptoms. No further parameterization was applied to analyze combinations of symptoms, either pairwise or higher order.
In another study [43], an ML‐based symptomatic assessment chatbot was evaluated for its ability to identify women with breast cancer. The chatbot utilized an NLP algorithm combined with decision trees to analyze patient‐reported symptoms and risk factors. The model was compared against benchmark breast cancer assessment scores obtained from specialist doctors. The study reported that the chatbot achieved very high accuracy, with a sensitivity of 0.98 and specificity of 0.90, showing comparable performance to the specialists' scores. However, the specific dataset used for training and testing the chatbot was not identical to the data used by the specialists, as the chatbot relied on standardized inputs, such as symptom descriptions and risk factor data, while doctors had access to broader clinical information. This discrepancy made the comparison between the chatbot and human specialists less direct, and no 95% confidence intervals or false positive rates were reported for the chatbot's performance. Symptoms were parameterized using a weighted scoring system (1–10), reflecting the seriousness of responses, as defined by breast cancer specialists. While individual symptoms were explicitly scored, the study did not address pairwise or higher‐order symptom combinations. The chatbot's rule‐based design focused on single symptom assessment per query, with the knowledge base structured to guide sequential interactions rather than explore symptom interactions.
Dirik [45] compared ML models, including LR, for lung cancer diagnosis based on 15 binary symptoms (e.g., persistent cough, chest pain, weight loss). LR achieved 86% accuracy, while NB and SVM outperformed it with 91%. Performance metrics such as sensitivity, specificity, and precision confirmed the higher reliability of ML models. Symptoms were parameterized as binary variables (presence/absence), with no analysis of pairwise or higher‐order symptom combinations. While the paper demonstrated that ML models performed better than LR, it does not provide detailed reasons for this or explore differences in feature handling between methods.
Another study [42] compared models for predicting nasopharyngeal carcinoma, with LR as the baseline achieving an AUROC of 0.80. ML models, particularly LGB, outperformed LR, reaching an AUROC of 0.83 with higher sensitivity and specificity. The study incorporated 14 features, including demographics, 28 pre‐NPC symptoms, and combined diagnostic and treatment features, all parameterized as binary variables. Although combined features were used to improve accuracy, the study did not investigate pairwise or higher‐order symptom combinations, limiting the analysis of symptom interactions. Similarly, another study [48] compared ML models, including LR, for predicting endometrial intraepithelial neoplasia and endometrial cancer risks. LR served as the baseline but was outperformed by the MLP model, which achieved the highest AUC of 0.938 for precancer prediction. The study utilized 9 features, including age, BMI, and endometrial thickness, parameterized as continuous variables, with the Boruta algorithm selecting the most important features. Pairwise or higher‐order symptom combinations were not analyzed explicitly, as ML models focused on individual predictors.
Two further studies reported the superiority of ML models over conventional diagnostic techniques, evaluating their performance based on economic viability, practical implementation, and non‐invasiveness. Specifically, Muhammad et al. [66] developed a non‐invasive ML model for endometrial cancer risk stratification using demographic and clinical data, aiming to reduce patient burden and healthcare costs by avoiding invasive procedures like biopsies. However, no confidence intervals were reported, and the model was not directly compared with traditional methods. Gorynski et al. [63] developed an ML model for the early diagnosis of nasopharyngeal carcinoma, achieving an AUROC of 0.998% and 97.9% accuracy, though without confidence intervals or external validation. Both ML approaches showed advantages in terms of cost‐effectiveness and non‐invasiveness.

3.4.2
Performance Comparison of Different ML‐Based Predictive Models
Most of the included studies (n = 22) [39, 40, 42, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 59, 61, 63, 64, 65, 67, 68, 69, 70] compared various different ML‐based predictive models. In one study [62], which developed a model for mesothelioma diagnosis using the Harris Hawk Optimization (HHO) algorithm for feature selection, SVM and ANN both achieved 100% accuracy when trained on a set of 17 selected variables. This was compared to other methods such as DE with KNN, which achieved 99.7% accuracy (95% CI not provided) in the development, and a standard SVM model, which achieved 98% accuracy using 33 features. The models utilizing the HHO‐selected features demonstrated superior performance by optimizing the feature set, reducing redundancy, and enhancing diagnostic accuracy. However, no validations of these models were carried out.
GBFS and GFS demonstrated excellent performance in validation, with accuracy rates of 97.8% (95% CI: 96.4%–98.7%) and 96.2% (95% CI: 94.8%–97.6%), respectively, using fewer variables (16 and 15). SAC reached an AUC of 95% (95% CI: 93.0%–96.5%) with 16 variables. Several other models, including RF, Adaboost, LR, and NB, displayed accuracies between 93.1% and 94.4%, all using 16 variables. However, confidence intervals for these models were not consistently reported, limiting the interpretability of the results.
Another study that used the same dataset and variables to compare the performance measures of various ML models was conducted by Hossain et al. [49]. In this study, all models were trained and tested on the same dataset, consisting of 16 symptom‐based variables collected from 840 patients, including both leukemia and non‐leukemia cases. The performance metrics reported were based on the validation set, following a train‐test split where data from one hospital was used for training and data from another hospital was used for validation. The DT model achieved the highest performance, with an accuracy of 97.45%, an AUC of 0.783, and an MCC of 0.63. The RF model followed closely, with an accuracy of 95.41% and an AUC of 0.782. AB showed strong results, with an accuracy of 94.66%, while NB reached an accuracy of 93.13%, and LR achieved 91.60%. The k‐NN model exhibited the lowest performance, with an accuracy of 68.70%. Confidence intervals were not reported for these models.
A study [68] reported relevant differences in performance across various ML models, using a dataset of 324 patient records and clinical variables. All models were trained and tested on the same dataset, with performance metrics reflecting the results from the test sets, following an 80/20 random split for training and testing. RF achieved the highest performance, with a Matthews Correlation Coefficient (MCC) of +0.37, a specificity of 0.97, and a sensitivity of 0.28, using all 33 variables. MLP obtained an MCC of +0.11, with a sensitivity of 0.66 and a specificity of 0.42, also utilizing the full dataset. DT, applied only to the selected features of lung side and platelet count (2 variables), resulted in an MCC of +0.28, a specificity of 0.95, and a sensitivity of 0.28. The One Rule model demonstrated an MCC of +0.27, with a specificity of 0.97 and a low sensitivity of 0.17, using all 33 variables. No confidence intervals were reported for these performance metrics. Specific performance data are reported in Table 4.
Another study [39] demonstrated that a classifier model based on super symptom analysis performed better than classical methods such as ANN, LDA, and RF in terms of accuracy for breast cancer diagnosis. In this study, the models were trained and tested using the same dataset of 65 breast cancer patients, with a 70/30 split for training and validation. The super symptom analysis model achieved the highest accuracy at 72.7%. In comparison, ANN achieved an accuracy of 45.4%, LDA reached 57.1%, and RF achieved 63.1%. The study did not provide confidence intervals for these performance metrics. All models used the same variables and were evaluated on the validation set, allowing for a consistent comparison of their effectiveness in diagnosing breast cancer.

3.5
Quality Assessment of Included Studies: QUADAS‐AI
In the included studies, adherence to established reporting guidelines was not frequently mentioned. Out of the 34 studies, none explicitly acknowledged their adherence to the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) reporting guidelines [37]. Additionally, while making implementation codes publicly accessible is important for transparency and reproducing methods, only 5 [40, 49, 50, 60, 68] studies made their analysis code publicly available.
A detailed analysis of the risk of bias assessment and concerns regarding applicability was performed for each study and summarized in Figure 4A,B. Risk of bias: 5 studies (14.7%) [50, 55, 60, 70, 71] had a high risk of bias in patient selection, mostly due to the lack of a clear rationale for their sample size and an unspecified data source. Most of the studies had a high risk of bias for the index test (94.1%), typically due to the lack of validation or testing on external data. In the reference standard domain, which assesses whether the method used as the gold standard for diagnosis or outcome measurement is reliable and applied appropriately, 76.5% of articles were at a low risk of bias. Finally, the risk of bias in the flow and timing domain was low in 82.4% of studies. Applicability concerns: In the patient selection domain, concerns about applicability were high in 17.6% [45, 50, 55, 60, 70, 71] of the included studies. In the index test domain, concerns about applicability were low in 91.2% of the studies due to the lack of detail on the construct or architecture of the algorithm. Finally, in the reference standard domain, concerns about applicability were low in 85.3% of studies. The main reasons for lower applicability included the use of surrogate or proxy measures instead of gold‐standard diagnostic tests, reliance on internal validation only (with only two studies [42, 50] explicitly performing external validation), and patient populations that may not be representative of the broader clinical setting. Additionally, issues like improper patient flow, where inclusion or exclusion criteria are unclear, or where the patient population does not mirror real‐world clinical practice, could introduce bias and further reduce applicability.

Discussion

4
Discussion
This systematic review summarizes the reported potential of different types of ML models for predicting cancer risk based on clinical signs, symptoms, and other patient characteristics. The reviewed evidence indicates that ML models demonstrated variable performance, with AUC values ranging from 0.60 to 1 during validation. This variability reflects differences in dataset quality, feature selection methods, and model optimization techniques across studies. Models such as RF, SVM, and DT were often reported to achieve high accuracy, particularly in lung cancer and mesothelioma prediction. However, variability in performance highlights the need for further evaluation to determine their robustness and suitability for clinical use.
A growing interest in using ML models for cancer risk prediction is apparent, fueled both by recent advancements in ML technology and the increasing focus on precision medicine in supporting clinical decision‐making [72, 73, 74].
The reviewed studies used a broad range of data types, including symptoms, sociodemographic characteristics, lifestyle factors, comorbidities, genetic information, laboratory, and diagnostic test results. This diversity of input data emphasized the flexibility of ML models to integrate multiple sources of information, potentially improving the accuracy of cancer risk predictions by creating a more comprehensive risk profile for each patient. On the other hand, the variability in the quality of data across studies may also explain the wide range of model performance. Studies incorporating genetic data (although used in only 8 studies [43, 47, 48, 49, 52, 55, 63, 66]) or laboratory results (n = 9) [38, 40, 46, 59, 62, 63, 65, 68, 70], in addition to symptoms, could provide more nuanced risk assessments [43] underscoring the importance of multimodal data integration in enhancing predictive capabilities. However, models with too many variables risk overfitting, particularly when the data is limited or highly specific to the training population. This highlights the need for careful selection and optimization of variables to balance model complexity with robustness [75]. Moreover, for ML algorithms to be effectively deployed in clinical practice, they need to be trained and tested on datasets that adequately represent the clinical scenarios likely to be encountered in real‐world settings [76, 77, 78, 79]. This ensures that the range and depth of variables used to develop the model—such as symptoms, diagnostic tests, and comorbidities—are also accessible in the clinical environment where the model will be implemented. The clinical relevance and applicability of AI models depend not only on the representativeness of the training data but also on the alignment between the data infrastructure available during model deployment and the wealth of variables utilized during model development [80, 81]. This highlights the importance of a robust IT support system in hospitals or clinics to enable the integration of these models into a “learning health system” framework. Furthermore, a clear and transparent description of the training data characteristics—such as how data were collected, labeled, and processed—is critical to identifying potential sources of bias and ensuring clinical applicability [82]. Despite this, there is currently no standardized approach within the medical community for documenting datasets used in AI model development, leading to calls for greater transparency in the field [83].
From a clinical perspective, ML‐based cancer risk prediction models have the potential to support earlier diagnosis, guide surveillance strategies, and enable more personalized patient management. For clinicians, these tools may assist in identifying high‐risk individuals who could benefit from intensified monitoring or preventive interventions, even when conventional risk factors [84] are absent or unclear. For patients, particularly those in underserved or high‐risk populations, such models may facilitate more equitable access to timely risk assessment and targeted care pathways. By focusing exclusively on models that included at least one symptom or sign as a predictor, this review encompasses the subset of ML tools most relevant as a clinical decision support for symptomatic patients. Consequently, our findings mainly reflect the potential of ML in supporting clinical diagnostic decision‐making in symptomatic presentation settings, rather than across the full cancer risk continuum.
A limitation identified in this review is the lack of transparency regarding the interaction and contribution of these modalities in final predictions. While fully achieving interpretability in black‐box models remains challenging, the integration of post hoc explainability methods, such as SHapley Additive exPlanations (SHAP) or Local Interpretable Model‐agnostic Explanations (LIME), could provide valuable insights into the relative influence of genetic, clinical, and sociodemographic factors. This added transparency, even if partial, could improve the clinical relevance and trustworthiness of these predictive models. Traditional statistical methods, though limited in automatic variable selection, allow for explicit definitions of non‐linear associations and interactions, encouraging thoughtful model construction [85]. In contrast, ML models automatically capture complex relationships, which may simplify modeling but also risk bypassing important contextual insights. This distinction is critical, as prediction models capture correlations that do not imply causation [86]. Cautious interpretation is therefore essential; explainable models might reveal feature importance without clarifying causal links, underscoring the need for clinicians to interpret machine learning‐generated risk predictions carefully, ensuring clinical decisions are not misinformed by correlations alone.
Most studies using restricted access data did not describe important demographic information, and most models were not available for additional evaluation to assess robustness. The lack of details and transparency about data and models limited our ability to systematically assess robustness and potential biases, which is an important direction for future work. Another key issue is algorithmic bias, which arises when models are trained on unrepresentative data, leading to inaccurate predictions for certain groups. This can worsen healthcare disparities. Developers must address this by using diverse datasets, evaluating model performance across demographics, and correcting biases to ensure fair and equitable healthcare outcomes [87, 88, 89].
External validation of a risk model is critical for ensuring trustworthiness before clinical deployment [90, 91, 92]. Unfortunately, in this review, only two of the 35 included studies conducted an external validation. This is an important limitation, as without proper external validation, it is difficult to determine whether the predictive performance of the models would hold up when applied to diverse patient populations or different healthcare environments. One [50] of the two studies that conducted external validation focused on using unstructured clinical notes from electronic health records to extract cancer‐related symptoms. This study developed a DL model to automatically identify symptoms documented in free‐text format, which is challenging to process at scale. The authors first trained the model using outpatient notes and then performed external validation using the MIMIC‐III dataset, which consists of physician notes from intensive care units. In this external validation, the model continued to perform well, with F1 scores ranging from 0.97 for symptoms like diarrhea and dizziness to 0.73 for swelling. These results indicate that the model was successful in capturing a wide range of clinically significant cancer‐related symptoms both in the original training setting and in a different clinical context, making it a valuable tool for scalable symptom monitoring in oncology.
The utilization of unstructured data is an emerging trend in the field of ML for healthcare. Castro et al. [93] explored the potential of ML in predicting breast cancer recurrence by combining structured and unstructured healthcare data. Tayefi et al. [94] explored the use of electronic health records for healthcare, addressing the challenges and potential of integrating structured and unstructured data. Deshmukh et al. [95] developed a clinical decision‐unifying staging method to accurately extract and predict the prognostic stage of breast cancer from unstructured medical records across various health institutions.
The comparison between ML models and traditional statistical models for cancer risk prediction suggests that ML may offer advantages in handling complex, nonlinear relationships within high‐dimensional datasets, as it does not require manually specifying the functional form of these relationships in the model. Despite these advantages, traditional models like ogRAT benefit from extensive external validation, which has proven their reliability in real‐world clinical settings and simplicity. In contrast, a relevant number of ML models, including those previously mentioned, still require robust external validation to guarantee their generalizability and consistency across diverse patient populations and healthcare systems. Further efforts in external validation are crucial for their broader adoption in clinical practice.
A notable finding of this review is that none of the included studies explicitly adhered to the TRIPOD reporting guideline. This represents a major limitation in the methodological transparency of the current literature. Without adherence to standardized reporting frameworks such as TRIPOD or the forthcoming TRIPOD‐AI, key details regarding model development, variable selection, validation strategy, and calibration are often omitted, hindering reproducibility and independent assessment of model quality. The absence of structured reporting not only limits interpretability and external validation but also impedes the integration of these models into clinical practice. Future research should prioritize compliance with TRIPOD‐AI to ensure clarity, transparency, and comparability across ML‐based cancer risk prediction studies.
4.1
Limitations
This review has some limitations. First, only studies published in English were included, as multilingual screening, study assessment, and data extraction were not feasible within the research team. This may have led to the exclusion of relevant non‐English studies. Second, we focused on peer‐reviewed literature and did not include gray literature, preprints, theses, or conference proceedings. This approach ensured methodological rigor and reproducibility, but it might have limited the capture of emerging AI‐based research given the fast pace of developments in this field.
A major limitation of this review, and of the literature it synthesizes, is the sparsity of detailed reporting in many of the included studies, which restricted our ability to fully evaluate essential aspects such as model calibration and reproducibility. In addition, an important limitation concerns the high risk of bias observed in the index test domain, primarily due to the lack of external validation. Only 2 out of 35 studies conducted external validation, and even these were often limited to retrospective datasets from similar institutional settings. For example, while Lindvall et al. [50] validated a deep learning model on ICU notes, it was not tested in a primary care setting where the majority of symptomatic patients initially present. This severely limits the generalizability and clinical applicability of the reviewed models. Therefore, the conclusions should be interpreted with caution.
The included studies were highly heterogeneous in terms of data sources, cancer types, and predictive model architectures, which limited the possibility of direct comparison and meta‐analysis. Although all studies incorporated symptoms, the type and combination of additional predictors, such as sociodemographic characteristics, comorbidities, and laboratory or genetic data, varied substantially, contributing to the variability in model performance. This heterogeneity also influenced the assessment of methodological quality. The high proportion of studies rated as having a high risk of bias within the index test domain mainly reflects incomplete reporting, lack of external validation, and limited methodological transparency, rather than uniform methodological flaws. While alternative thresholds or subgroup analyses (for instance, stratified by study design, publication year, or validation method) might slightly modify the distribution of bias ratings, the overall conclusion remains consistent: current ML‐based cancer risk prediction studies display considerable methodological diversity and suboptimal reporting standards, highlighting the need for more rigorous and standardized research in this field.
Furthermore, our review highlights a recurring limitation in reporting performance metrics across primary studies. While AUC was the most frequently reported measure, it does not fully capture clinical utility. Only a minority of studies provided additional metrics such as calibration, positive predictive value, or net benefit, key elements for assessing real‐world applicability. Future research should prioritize comprehensive and standardized reporting of clinically relevant metrics to support the safe and effective translation of ML models into practice.
Another factor that may limit generalizability is the geographic distribution of the included studies, which was heavily skewed toward Asian countries. While this reflects the rapid integration of AI technologies and availability of large‐scale clinical datasets in these settings, differences in healthcare infrastructure, diagnostic practices, and population characteristics may restrict the applicability of these models to other regions. Future research should aim to validate and adapt ML‐based cancer risk prediction models across diverse healthcare systems to enhance external validity and global relevance.
Limited information on variable selection and data sources further restricted our assessment of model robustness. Moreover, the review focused only on indexed articles, potentially overlooking valuable insights from preprints and conference proceedings that are prominent in AI research but not yet peer‐reviewed or included in databases like PubMed, which may have narrowed our findings. Inconsistent approaches to handling missing data, validation, and variability in reporting standards across studies made it difficult to compare models and assess their practical applicability.
Another limitation is the focus on longer prediction horizons, which restricts the evaluation of model performance in shorter‐term contexts, such as 6‐month predictions. Longer follow‐up periods allow for higher predictive accuracy, as events such as cancer development are more likely to occur over longer periods. Of the included studies, only six [40, 50, 59, 62, 64, 67] reported prediction horizons, all within a 5‐year window. Short‐term predictions are more challenging but are critical for timely clinical decision‐making and early intervention. Future research should explore the performance of ML models over shorter prediction intervals to address this gap and enhance their practical utility in urgent clinical scenarios.
Additionally, this study could not determine whether the comparisons between ML models and conventional statistical techniques were entirely appropriate. In many instances, the statistical comparator used may not reflect the optimal implementation of traditional methods, potentially biasing results in favor of ML models. This raises concerns about the rigor and validity of such comparisons, as suboptimal statistical models may not provide a robust benchmark for evaluating ML performance. Greater transparency and standardization in the design and reporting of both ML and conventional models are essential to ensure fair and meaningful head‐to‐head evaluations.
Also, this review may be subject to selection bias. We excluded non‐English language studies and restricted our search to indexed literature, thereby potentially overlooking relevant gray literature and studies published in other languages. While these criteria ensured the inclusion of peer‐reviewed and methodologically sound sources, they may have narrowed the scope of the findings, especially in the context of a rapidly evolving research area such as machine learning in cancer prediction.
Finally, a key limitation of the reviewed studies is the lack of implementation of models within a learning health system framework. Such systems are critical for evaluating the real‐world performance and impact of predictive models, as they enable continuous feedback, adaptation, and integration into clinical workflows. Without this framework, it remains challenging to assess how these models perform in dynamic healthcare settings, further emphasizing the need for future research to embed predictive models into learning health systems to ensure their practical utility and scalability.

Conclusions

5
Conclusions
ML models show promise in managing high‐dimensional data and capturing complex, non‐linear relationships, but their practical applicability in clinical settings remains to be fully established. Techniques such as LR with carefully selected interaction terms continue to demonstrate reliability and utility. Achieving a balance between model complexity, interpretability, and clinical relevance is crucial for improving practical applicability in healthcare. Key methodological challenges, including limited external validation and calibration issues of ML models, must be addressed to enable their clinical adoption. Enhancing reporting practices—through thorough documentation of data preprocessing, model training, and validation—will be essential for developing robust and clinically useful risk prediction tools. Further research is needed to robustly compare the diagnostic accuracy of ML models and traditional statistical methods, with a focus on explaining performance differences and considering the influence of prediction horizons. Additionally, implementation studies evaluating these algorithms within learning health systems are critical to determine their real‐world applicability and impact. Future research should prioritize adherence to TRIPOD‐AI guidelines, transparent sharing of code and data, and standardized reporting of prediction horizons, missing data handling, and calibration metrics. Embedding ML models within learning health systems and ensuring fairness across diverse populations will be essential steps to enhance their real‐world utility, reliability, and ethical deployment in oncology.

Author Contributions

Author Contributions

Flavia Pennisi: conceptualization, investigation, writing – original draft, methodology, validation, visualization, writing – review and editing, data curation, formal analysis, software. Stefania Borlini: investigation, writing – original draft, visualization, data curation, software. Hannah Harrison: writing – review and editing, methodology, supervision. Rita Cuciniello: investigation, visualization, formal analysis, writing – original draft, software. Anna Carole D'Amelio: writing – original draft, visualization, data curation, software. Matthew Barclay: writing – review and editing, methodology, formal analysis. Giovanni Emanuele Ricciardi: visualization, writing – review and editing. Georgios Lyratzopoulos: supervision, methodology, writing – review and editing. Cristina Renzi: conceptualization, writing – review and editing, writing – original draft, resources, supervision, funding acquisition, project administration.

Funding

Funding
Prof Cristina Renzi and Prof Georgios Lyratzopoulos were funded by the early detection and diagnosis committee grant EDDCPJT\100018 from Cancer Research UK. Prof Cristina Renzi, Prof Georgios Lyratzopoulos, dr Flavia Pennisi and dr Giovanni Emanuele Ricciardi were founded by grant May24/100066 from Cancer Research UK.

Disclosure

Disclosure
Role of the funder/sponsor: The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Consent

Consent
The authors have nothing to report.

Conflicts of Interest

Conflicts of Interest
The authors declare no conflicts of interest.

Supporting information

Supporting information

Table S1: Literature search strategy (Pubmed).

Table S2: Data information of included studies.

Table S3: Risk predictors.

Table S4: Risk predictors: specific variables included in “other”.

Table S5: Classification of included studies according to the types of data used in machine learning models for cancer risk prediction. A study was classified as “Included” for a given data type if that type was explicitly used as input in the final model. Studies integrating three or more data types (excluding symptoms, which were present in all studies) are defined as multimodal and are shown in bold.

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

🟢 PMC 전문 열기