본문으로 건너뛰기
← 뒤로

Machine Learning Models for Disease-Free Survival Analysis after Liver Resection for Hepatocellular Carcinoma: A Multicentric French Collaborative Study.

1/5 보강
Liver cancer 📖 저널 OA 100% 2025: 58/58 OA 2026: 24/24 OA 2025~2026 2026 OA
Retraction 확인
출처

PICO 자동 추출 (휴리스틱, conf 2/4)

유사 논문
P · Population 대상 환자/모집단
663 patients who underwent LR between 2010 and 2020 in 3 French HPB referral centers were analyzed.
I · Intervention 중재 / 시술
LR between 2010 and 2020 in 3 French HPB referral centers were analyzed
C · Comparison 대조 / 비교
추출되지 않음
O · Outcome 결과 / 결론
추출되지 않음

Rhaiem R, Abdo A, Calderaro J, Dokmak S, Herrero A, Toubert C

📝 환자 설명용 한 줄

[INTRODUCTION] Liver resection (LR) is a potentially curative treatment of hepatocellular carcinoma (HCC), but early recurrence rates remain high reaching as high as 70%.

이 논문을 인용하기

↓ .bib ↓ .ris
APA Rhaiem R, Abdo A, et al. (2026). Machine Learning Models for Disease-Free Survival Analysis after Liver Resection for Hepatocellular Carcinoma: A Multicentric French Collaborative Study.. Liver cancer. https://doi.org/10.1159/000550554
MLA Rhaiem R, et al.. "Machine Learning Models for Disease-Free Survival Analysis after Liver Resection for Hepatocellular Carcinoma: A Multicentric French Collaborative Study.." Liver cancer, 2026.
PMID 41868609 ↗
DOI 10.1159/000550554

Abstract

[INTRODUCTION] Liver resection (LR) is a potentially curative treatment of hepatocellular carcinoma (HCC), but early recurrence rates remain high reaching as high as 70%. Accurate prediction of disease-free survival (DFS) after LR is crucial to optimize patients' selection for clinical trials evaluating adjuvant strategies. This study assessed machine learning (ML) models for predicting DFS after LR for HCC.

[METHODS] A total of 663 patients who underwent LR between 2010 and 2020 in 3 French HPB referral centers were analyzed. Three ML models - random survival forest (RSF), gradient boosting survival (GBS), and fast survival support vector machine (FSSVM) - were compared with the Cox proportional hazards regression model. Model performance was assessed using C-index and time-dependent area under the curve (AUC). External validation was performed using an independent cohort from a fourth center. Statistical comparisons between models were conducted using paired tests.

[RESULTS] After a median follow up of 52 months, recurrence occurred in 43% of patients and median DFS was 15 months. In the training cohort, RSF achieved the highest discrimination (C-index = 0.721 ± 0.027) and predictive accuracy (AUC = 0.768 ± 0.039) among the four evaluated models. Cox, GBS, and FSSVM showed comparable performance (C-index = 0.684-0.699; AUC = 0.718-0.745). Paired tests demonstrated that RSF significantly outperformed Cox for both C-index ( = 0.023) and AUC ( = 0.039). In the external validation cohort, all ML models outperformed Cox regression, with RSF showing the highest performance (C-index 0.817, AUC 0.863), suggesting strong generalizability.

[CONCLUSIONS] RSF significantly improved DFS prediction compared with Cox regression and yielded the best discriminatory performance in both training and external validation cohorts. These findings highlight the value of ML-based survival models, particularly RSF, for enhancing individualized postoperative prognostication and refining patient selection for clinical trials evaluating adjuvant treatment to surveillance after LR for HCC.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

📖 전문 본문 읽기 PMC JATS · ~54 KB · 영문

Introduction

Introduction
Hepatocellular carcinoma (HCC) is a major global health problem, representing 90% of primary liver tumors [1]. Liver resection (LR), liver transplantation, and ablation are the 3 main curative treatments of HCC. LR remains a primary treatment option that offers a potential cure for patients with localized HCC [1]. Despite advances in surgical techniques and postoperative care, the risk of recurrence after LR is still high reaching 70% within 2 years after surgery [2]. Several factors have been identified in the literature to be associated with a high risk of recurrence after LR including tumor characteristics, underlying liver conditions, and surgical margins. Understanding the determinants and mechanisms behind HCC recurrence is crucial for improving long-term outcomes and developing effective strategies to monitor and manage patients after LR. Alongside, predicting disease-free survival (DFS) after LR may represent a cutting-edge advancement in the oncological management of HCC. Patients with high predictable risk of short DFS can be candidate to postoperative locoregional/systemic therapy [3–5]. The Cox proportional hazards model has been widely used in survival analysis with a good performance metrics [6]. More recently, artificial intelligence algorithms have been reported in survival analysis with very encouraging results outperforming Cox regression models [7–10]. Indeed, by leveraging vast datasets, including patient demographics, tumor characteristics, genetic information, and treatment protocols, AI models can analyze complex patterns and identify key predictors of relapse. Such sophisticated algorithms can be trained to predict DFS after LR of HCC, offering personalized prognostic insights that can guide therapeutic strategies and postoperative monitoring. Herein, we analyzed the performance of several ML survival algorithms to predict DFS after LR for HCC and compared them to Cox hazard regression model.

Methods

Methods

Study Population and Follow-Up
All patients treated between 2010 and 2020 in 4 French tertiary referral academic centers for hepatobiliary surgery (AP-HP Henri Mondor-Créteil, APHP Beaujon-Clichy, Reims and Montpellier University Hospitals) were included in a retrospective cohort to evaluate DFS after LR for HCC. Inclusion criteria were as follows:-Age >18 years old

-Treated with curative-intent upfront LR

-Performance status (OMS: 0–1)

-Preserved liver function (Child-Pugh A/B7) or MELD <9

Exclusion criteria were as follows:-Previous/preoperative locoregional or systemic treatment

-Macroscopic positive surgical resection margins (R2)

-Missing data on survival and/or recurrence data

Therapeutic management was discussed within local institutional multidisciplinary team (MDT) meetings. After LR, patients were followed according to the French guidelines for the management of HCC [11] with AFP serum level, thoraco-abdominal computerized tomography scan. In case of suspicion of liver recurrence, magnetic resonance imaging was performed. Confirmation and treatment of recurrence were also discussed within MDT. Death for any cause was considered.

Data Collection and Ethics
Each center was responsible for the ethical approval and the anonymization of the data before any analysis. The database was built following the MR004 protocol from the Commission Nationale de l’Informatique et des Libertés (CNIL) (No. 2206749, 13/09/2018) and was in accordance with the French authorities’ requirements. Observational studies conducted under the MR-004 framework must comply with the data-protection reference methodology established by the CNIL. Our study was in accordance with MR-004 guidelines: it is a non-interventional study that do not modify patient care and use health data collected during routine clinical practice. Patients were clearly informed of the study and their rights, including the right to object to the use of their data. No explicit written consent is required under MR-004.
All variables related to patients and tumor characteristics were recorded. Intraoperative data and in-hospital postoperative outcomes were also enrolled. Overall and disease-free survivals were assessed. This study followed the reporting guidelines for case-control studies of the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) [12].

Endpoints and Definitions
The primary endpoint was to compare the Cox’s proportional hazards model to survival machine learning (ML) algorithms in predicting DFS after LR. DFS was defined as the time from treatment until the recurrence of the disease (or death) after receiving curative-intent LR [13]. Recurrence diagnosis was based on imaging during conventional follow-up or biopsy proven, if required. Patients were uncensored in the case of HCC recurrence. Secondary outcomes were early recurrence defined as the occurrence of documented relapse of HCC within 2 years after surgery [13] and overall survival (OS) calculated from the date of surgery to the last follow-up or death. Resection was considered microscopically complete (R0 resection) if the resection margin was greater than 1 mm; otherwise, it was considered microscopically incomplete and quoted as an R1 resection.

Missing Data
Covariates were not considered in the final analysis when missing data rate was above 20%. Afterward, multiple imputation of missing data was handled using the fully conditional specification implemented by the chained equations algorithm (MICE). Five imputed datasets were generated. Primary and secondary outcome covariates were not considered into the imputation process and in case of missingness, the patient’s data were discarded.

Model Development
Data from 3 HPB referral centers (AP-HP Henri Mondor-Créteil, AP-HP Beaujon-Clichy, and Montpellier University Hospital) were used for model development. Model training and internal validation were performed using a 10-fold cross-validation procedure, ensuring that all patients from these centers contributed both to training and cross-validated performance estimation. To mitigate overfitting and assess generalizability, this 10-fold cross-validation was applied to the training set. Model performance from cross-validation was averaged to obtain robust estimates. Data from a fourth center, Reims University Hospital, were entirely withheld from model development and used exclusively as a fully independent external validation cohort. This design enabled assessment of each model’s generalizability across institutions and clinical practice settings. The objective was to predict DFS following LR by utilizing both preoperative and postoperative variables. Subsequently, models based solely on preoperative variables were constructed and evaluated.
As a benchmark, a Cox proportional hazards regression model was constructed. Univariable Cox analyses were first performed to identify covariates associated with DFS, and variables significant were subsequently entered into a multivariable Cox regression model. Three ML survival algorithms were also developed: random survival forest (RSF), gradient boosting survival (GBS), and fast survival support vector machine (FSSVM). A RSF is an ensemble of tree-based learners. Each tree learns patterns from different bootstrap sample of the training data. Trees split data based on features to form groups with similar survival outcomes. RSF uses the cumulative results to estimate survival probabilities. The GBS model is similar to RSF as it uses multiple base learners, typically decision trees, for prediction. However, unlike RSF, which trains trees independently and combines their results, GBS incrementally builds an ensemble of trees, each one correcting the errors of the previous, enhancing predictive accuracy and minimizing a loss function for survival data. The FSSVM is a modified version of the standard support vector machine, specifically designed for survival data. The model is not intended to predict exact survival time but rather to identify individuals who are likely to have a longer survival. It employs patient features to identify a threshold that distinguishes between high-risk and low-risk patients.
All models using pre- and postoperative features incorporated 23 clinical features (online suppl. Data 1; for all online suppl. material, see https://doi.org/10.1159/000550554). Fourteen variables were used to develop the preoperative models (online suppl. Data 1). Hyperparameters were tuned by randomized grid search. Calibration was performed within the training cohort.

Model Performance Metrics
Model discrimination was assessed using the concordance index (C-index) and the time-dependent area under the receiver operating characteristic (ROC) curve (AUC). Both metrics were computed during cross-validation (training cohort) and separately on the independent external validation cohort. Cross-validation results were used for statistical comparisons, while the independent external validation set provided an unbiased estimate of model generalization performance.
The C-index is the most commonly used evaluation metric for survival models. It is defined as the ratio of correctly ordered (concordant) pairs to comparable pairs. However, the C-index may not be the optimal performance metric when the primary focus is on a specific period. Therefore, the AUC evaluation was performed for ML models.
The area under the ROC curve is a common performance measure for binary classification regression models. It is often used to evaluate the performance of a risk score to predict a binary outcome. When the ROC curve is extended to include continuous outcomes, especially survival time, the disease status of patients is not fixed and may change over time. For instance, at a given time t, we can estimate how well the prediction model can distinguish between patients who will experience recurrence (sensitivity) and patients who will not (specificity). Thus, sensitivity and specificity become time-dependent measures. In this study, DFS was the primary outcome. By calculating the area under the ROC at a time t, we can determine the ability of the model to predict the risk of recurrence before t (ti ≤ t). Therefore, the time-dependent AUC was considered in addition to the C-index.

Statistical Analysis
The baseline characteristics of the study population are expressed as absolute numbers and relative frequency measurements for qualitative variables. Quantitative variables were presented as median (interquartile range) or mean (± standard deviation).
Survival rates were calculated using the Kaplan-Meier method. For Cox regression, p < 0.05 was considered significant in both univariable and multivariable analyses. To compare model performance, paired Student’s t tests were applied to cross-validation results of the C-index and AUC. The external validation cohort results were reported descriptively without statistical testing, to avoid overfitting and data reuse. Analyses were conducted using Microsoft Excel (version 16.4) for descriptive statistics and Python (version 3.11.5) with the scikit-survival library (version 0.22.2) for model development and evaluation.

Results

Results
The total cohort enrolled 778 patients. Among them, 48 patients were excluded owing to incomplete data regarding oncological survival. Eight patients died within 90 days after surgery and were also excluded from the cohort. Baseline characteristics of patients and tumors are presented in Table 1. A total of 722 patients from 4 HPB departments were included in the overall cohort.
Patients were predominantly male (N = 591; 81.8%). Median age was 64 years old (56–72). Overall, 229 patients (31.7%) are cirrhotic patients. All patients underwent upfront LR with no prior preoperative locoregional/systemic treatment. HCC was multifocal in 100 patients (13.8%).
The median size of the largest nodule was 40 mm (25–75). According to BCLC classification, patients were mainly BCLC A (N = 476; 66%). Median AFP level was 9 ng/mL (4–110).
The R1 resection rate was 10.4% (N = 75 patients). Macrotrabecular massive subtype was reported in 100 patients (13.5%). The rate of microvascular invasion in the resected specimen was 319 (44.1%) and satellite nodules were present in 217 patients (30%).
After a median follow-up of 52 (25–68) months, 310 patients (42.9%) experienced HCC recurrence with 84 patients (11.6%) having extrahepatic recurrence. Figure 1 illustrates DFS. The rate of early recurrence was 32.5% (N = 234) and the median DFS was 15 months (95% CI: 6–36). Details of patients and tumors characteristics of both training and external validation cohort are shown in Table 1.

Results of the Different Models for the Prediction of DFS

Results of the Different Models for the Prediction of DFS

Models Using Pre- and Postoperative Features

Comparison of the Different Models

Figure 2 and Table 2 summarize the C-index and mean time-dependent AUC for all models using the training and external validation cohorts.

Model Discrimination (C-Index)
In the training cohort, the mean C-index values for each model were as follows: Cox = 0.684 ± 0.038, RSF = 0.721 ± 0.027, GBS = 0.699 ± 0.034, and FSSVM = 0.699 ± 0.028. The Cox, GBS, and FSSVM models showed largely similar mean C-index results. RSF achieved the highest C-index among all models, significantly outperforming both the Cox model (p = 0.023) and FSSVM (p = 0.016) but not GBS (p = 0.095). There was no significant difference between the GBS and FSSVM models (p = 0.570).
In the external validation cohort, RSF again achieved the highest C-index (0.817), followed by GBS (0.810), FSSVM (0.809), and Cox (0.792). These findings demonstrate that RSF consistently shows strong discrimination ability in both training and external validation cohorts, underscoring its generalizability.

Model Predictive Accuracy (Time-Dependent AUC)
In the training cohort, the mean time-dependent AUC values were as follows: Cox = 0.718 ± 0.043, RSF = 0.768 ± 0.039, GBS = 0.745 ± 0.038, and FSSVM = 0.738 ± 0.050. Among these, RSF exhibited the highest predictive accuracy over time. Paired t tests indicated that RSF significantly outperformed both the Cox model (p = 0.0392) and GBS (p = 0.0226), while its difference with FSSVM was not statistically significant (p = 0.1685). Comparisons between the Cox model and FSSVM (p = 0.198) revealed no significant differences, suggesting comparable predictive performance among these models.
Within the external validation cohort, RSF again achieved the highest AUC (0.863), followed by GBS (0.828), FSSVM (0.828), and Cox (0.824), thus demonstrating its consistent predictive performance across independent datasets. Further detailed pairwise comparisons are provided in online supplementary Data 2.

Temporal Stability of Predictions
Throughout the postoperative follow-up period, RSF consistently outperformed GBS, FSSVM, and CoxPH in predictive accuracy (Fig. 3, 4). In both training and validation cohorts, all models achieved optimal discrimination within 40 months after LR, but their time-dependent AUC curves dropped and fluctuated beyond that period. Therefore, it appears that predicting DFS with these different models is reliable only within this specific postoperative timeframe (Table 3a, b).

Feature Importance
Feature weight for both the RSF and Cox regression models was assessed using permutation-based feature importance. Permutation feature importance results for the 10th most relevant variables are shown in Table 3a and b. This method evaluates the decrease in model accuracy or concordance index when the values of a given feature are randomly shuffled, reflecting its contribution to the survival prediction.
In the RSF model, the presence of satellite nodules emerged as the most significant variable, followed by the vessels encapsulating tumor clusters subtype, cirrhosis, pre-intervention BCLC stage, and microvascular invasion. Additionally, the number of nodules, size of the largest nodule, surgical margin, and preoperative AFP level were identified among the top ten most influential features in the model.
The Cox regression model identified satellite nodules, microvascular invasion, number of tumors on the specimen, cirrhosis, and vessels encapsulating tumor clusters subtype as the five most significant variables. Similar to RSF, satellite nodules had the greatest effect on survival prediction within the Cox model. Notably, preoperative AFP level, surgical margin, and importantly the size of the largest nodule were not among the top ten most important features.

Application to Predict DFS after LR Using RSF
An application has also been developed implementing the RSF model with the top 10 most relevant features to estimate DFS after LR for HCC. It is available through the following link: https://survival-prediction-nks3xhygu8pknc4rhpdjtr.streamlit.app/.

Models Using Only Preoperative Features

Models Using Only Preoperative Features
To evaluate the clinical feasibility of preoperative DFS prediction, all models were retrained using only preoperative clinical and radiologic variables. Performance decreased substantially across all approaches. In the training cohort, the C-index ranged from 0.591 ± 0.061 (Cox) to 0.635 ± 0.066 (RSF), with similarly modest AUC values (0.600–0.661). RSF achieved the best metrics. External validation confirmed these findings (C-index: 0.671–0.715, AUC: 0.649–0.738). Although RSF and GBS showed slightly higher discrimination than Cox, the absolute gain was small (C-index: ≤0.020, AUC: ≤0.061) and of limited clinical relevance. Detailed metrics are provided in online supplementary Materials 3 and 4. Online supplementary Material 5 represents the time-dependent AUC of all preoperative models within the external validation cohort.

Discussion

Discussion
This study compared the performance of a conventional Cox regression model with 3 ML approaches – RSF, GBS, and FSSVM – for predicting DFS after LR. Across all evaluated models, RSF consistently achieved the highest C-index and AUC, highlighting its robustness and flexibility in capturing complex, nonlinear relationships within the data. This finding aligns with previous studies demonstrating the advantage of ensemble tree-based survival methods in heterogeneous clinical datasets where variable interactions and nonproportional hazards may exist [14]. The Cox model, while interpretable and widely used, assumes proportional hazards and linear effects of covariates, which may limit its performance when these assumptions are violated. Both GBS and FSSVM models demonstrated comparable predictive performances, suggesting that these ML approaches can serve as practical alternatives to Cox regression, especially when prioritizing accuracy over interpretability. Paired statistical comparisons show that RSF significantly outperforms Cox regression in predictive discrimination and calibration.
These findings indicate that RSF provides the most reliable survival predictions for these datasets, especially with complex predictor relationships. However, given the comparable AUCs across models, Cox, GBS, and FSSVM may be suitable alternatives depending on the research context, data dimensionality, and need for interpretability.
The inclusion of both 10 cross-validated training results (with statistical comparisons) and external validation provides a balanced evaluation of the models’ performances. Cross-validation enabled statistical testing and model fine-tuning, whereas the external validation set provided an unbiased assessment of generalizability, which is critical for future clinical application. RSF model achieved the highest C-index and AUC in the external cohort, supporting its robustness across institutions with different perioperative practices, imaging workflows, and pathology protocols. The institutional and temporal differences likely enhanced the model’s stability and mirror real-world clinical variability, supporting the use of RSF in multicenter settings.
Instead of relying solely on the concordance index (C-index), time-dependent AUC was used as an additional metric to evaluate survival models [15]. This metric allows a dynamic perspective on model performance over different time intervals, revealing changes in predictive accuracy and highlighting optimal periods. It provides more detailed insight than other metrics, which is important for assessing the model’s practical value in clinical settings.
Although both Cox and RSF models identified similar key features, RSF highlighted a broader set of predictors – like BCLC before intervention and preoperative AFP level – reflecting its ability to handle complex, nonlinear relationships. In contrast, the Cox model predominantly focused on linear effects, with a more concentrated set of features driving survival outcomes. These findings underscore the strengths of each model: RSF’s flexibility with variable interactions and Cox regression’s strength in modelling simpler, linear associations.
In recent years, several studies have compared the traditional Cox regression model to ML survival algorithms in various clinical contexts: Astley et al. [16] reported the results of an explainable deep learning to predict survival of patients treated for non-small cell lung cancer (NSCLC) by radical radiotherapy. The dataset enrolled 471 stage I–IV NSCLC and compared the performance of Cox proportional hazards model, RSF, and a deep learning algorithm to predict the OS. The DL algorithm yielded a better C-index of 0.670 compared to the Cox model and improved integrated Brier score (IBS) of 0.121 compared to the CPH and RSF approaches. Similarly, She et al. [17] analyzed the results of 17,332 NSCLC patients and reported better results with a deep learning survival neural network (the DeepSurv model) compared to the TNM staging system to predict the cancer-specific survival (C statistic = 0.739 vs. 0.706). The model provided also an individual treatment recommendation and patients who received the recommended treatment by the model had a better survival than those who received other options [17]. Several reports have shown same findings with the DeepSurv model in various settings [18–20].
However, conflicting results were presented by Bae et al. [21] who compared various deep learning survival networks to Cox regression model in the prediction of 10 cancers. The authors used the blood-based cohort of the Korea Cancer Prevention Research-II Biobank. DeepSurv had the poorest predictive performance with a c-index of about 0.5 for every model. nDeep and multilayer nDeep achieved the best c-indices with more than 0.8 for all cancers and outperformed Cox model [21]. In a recent German registry-based study, Germer et al. [7] compared Cox model to RSF, DeepSurv, and TabNet to predict survival of patients with lung cancer. Cox regression model had similar metrics compared to the TabNet and RSF models. However, DeepSurv approach had no predictive power and the worst prediction performance [7].
In our study, ML models outperformed the Cox regression model. RSF was the best predictive model followed by GBS. Regarding the superiority of RSF over Cox model, few conflicting results have been published: Wang et al. [22] compared both models to predict mortality in patients with hemorrhagic stroke while Kar et al. [23] reported the results of RSF, Cox, and DeepSurv to predict recurrence after surgical resection of stage I NSCLC. In both studies, RSF yielded better predictive performance than Cox model. On the other hand, in a small dataset of 82 patients treated for high-grade glioma with proton and carbon ion radiotherapy, Cox regression model outperformed RSF [24].
To the best of our knowledge, this is the first study comparing different survival models after LR for HCC. Our results suggest that ML models may more effectively capture the feature complex interactions required for accurate DFS prediction.
The RSF model using pre- and postoperative features yielded good predictive performance for the first 40 months after surgery with consistent good prediction in the external validation set until 24 months. Such model can be valuable for postoperative risk stratification and patients counseling. In patients with predictable risk of poor DFS, adjuvant treatment and liver transplantation as a bridge strategy in eligible and very selected patients may be discussed within MDT. More importantly, with ongoing research on adjuvant therapies after LR for HCC, such models can refine inclusion criteria for clinical trials and to identify patients who may benefit from adjuvant treatment in case of predictable poor DFS after LR. A shiny application was developed and is accessible through a direct link. It generates a curve estimating the DFS after LR. These estimates should be interpreted as relative risk stratification tools rather than absolute decision thresholds for individual patient management.
Although our primary objective was to predict DFS after LR using both pre- and postoperative variables to achieve maximal accuracy, we acknowledge that models relying solely on preoperative data have higher clinical utility for surgical decision-making and patient selection. When restricted to preoperative information, model performance deteriorated noticeably. Despite reasonable discrimination ability, their performances did not achieve a level of accuracy sufficient for recommendation in clinical practice. This finding highlights that postoperative pathological features carry substantial prognostic weight that cannot be fully captured preoperatively. Full models integrating intra- and postoperative factors may better reflect long-term oncologic outcomes. External validation was performed to assess the generalizability of the model. Although there were notable differences in patient and tumor characteristics between the training cohort and the external validation group (see Table 1), the RSF still produced promising prediction results, with a C-index of 0.817 and an AUC of 0.863.
Importantly, the strongest predictors identified in the full RSF model, such as satellite nodules, microvascular invasion, and histological subtypes are pathological features identified after surgery. While their inclusion substantially improves predictive performance, it inherently limits the full model’s usefulness for preoperative decision-making. Thus, the primary clinical value of the RSF model using pre- and postoperative variables lies in postoperative risk stratification, identification of patients at high risk of a short DFS after LR, and enrichment of adjuvant therapy trials rather than in preoperative decision-making.
The heterogeneity inherent to our multicenter dataset (differences in referral patterns, imaging equipment, perioperative pathways, and pathology workflows) likely contributed to improved model robustness. Temporal changes between 2010 and 2020 introduce variability, but this strengthens the ecological validity of ML-based prognostic tools by reflecting real-world clinical landscape. RSF, in particular, is well suited to learn stable survival patterns in the presence of institutional and temporal heterogeneity.
Time-dependent AUC analyses showed strong performance during the early and intermediate postoperative period (5–40 months) but a progressive decline in accuracy beyond 50 months across all models. This reduction is expected given the relatively small number of late recurrences and the increasing proportion of censored observations. Additionally, it may be partly explained by the natural history of HCC recurrence following LR: early recurrence, typically occurring within the first 2 years, is mainly driven by residual microscopic disease and aggressive tumor features, whereas late recurrence arises from de novo tumorigenesis and is closely related to the severity of the underlying liver disease. In addition, competing risks such as non-cancer-related mortality increase over time and may further attenuate model discrimination. Together with the biological differences between early and late recurrence likely contribute to the reduced long-term predictive accuracy observed across all models. Importantly, this limitation does not substantially impact clinical applicability: more than 70% of recurrences in our cohort occurred within 24 months and over 85% before 36 months – findings that are consistent with the natural history of HCC. Accordingly, the timeframe during which model performance is optimal aligns with the period that informs most postoperative clinical decisions, such as surveillance planning, patient counseling, and eligibility for adjuvant therapy trials. The Shiny application accompanying this manuscript also clearly informs clinicians that predictions beyond 36 months should be interpreted with caution.
While RSF demonstrated the highest overall predictive accuracy, its advantage over Cox regression in the training cohort was limited. Nevertheless, the performance gains were notably greater in the external cohort, indicating that RSF may offer substantial benefits in real-world applications. It is recognized that ML models are generally less interpretable than Cox regression. To address this limitation, variable importance measures were included, allowing clinicians to assess the relative influence and directionality of key predictors. Indeed, satellite nodules were the most important feature in both RSF and Cox models. Relevant preoperative variables such as size of the largest nodule and preoperative AFP level were more important in RSF model than in the Cox model. This may highlight the clinical relevance of the RSF model in encoding the interaction between the features. Further investigation into variable importance for RSF models could provide valuable insights into the key factors influencing DFS.
Our objective is not to supplant classical regression methods but rather to evaluate whether ML approaches can enhance existing models by capturing complex and nonlinear survival associations. Within this framework, RSF provided added value, especially when working with diverse patient populations.
The absence of genomic, transcriptomic, radiomic, or digital pathology features represents a clear limitation of our study. These molecular and genetic signatures represent significant features for personalized prognostication. Nonetheless, such data were not routinely collected or standardized across centers during the inclusion period. We underscore that integrating multimodal data should be a central aim of future prospective multicenter research initiatives.
Furthermore, the retrospective design of this study represents an inherent limitation. Retrospective analyses are susceptible to selection bias, information bias, and heterogeneity in follow-up strategies and recurrence assessment across centers. Therefore, prospective validation is essential prior to any clinical application of the proposed models, especially when considering their use for patient selection or stratification in adjuvant therapy trials. In addition, the external validation cohort originated from a single French center and thus remains within the same national healthcare system, surgical standards, and pathology workflows. As a result, the generalizability of our results to non-French or non-European healthcare settings remains uncertain. Further international validation in diverse healthcare systems is warranted to confirm the broader applicability of the proposed models. Last but not least, DFS was defined as recurrence or death after LR, and competing-risk-specific methods were not applied. While this approach may be acceptable in the context of ML-based survival analysis, competing-risk models such as Fine-Gray regression [25] represent an alternative analytical strategy that may yield complementary insights and should be explored in future studies.
In conclusion, our results indicate that advanced AI models like RSF improve recurrence and DFS prediction after LR for HCC. The ability of RSF to handle complex data structures and interactions is an asset in clinical decision-making for long-term outcomes. Future multicenter studies using omics and radiomics data will be key to refining predictive accuracy and integrating ML-based methods into standard care.

Statement of Ethics

Statement of Ethics
The study followed the French national MR004 protocol from the Commission Nationale de l’Informatique et des Libertés (No. 2206749, 13/09/2018) and did not require neither written approval from the patients nor ethics approval.

Conflict of Interest Statement

Conflict of Interest Statement
The authors have no conflicts of interest to declare

Funding Sources

Funding Sources
This study was not supported by any sponsor or funder

Author Contributions

Author Contributions
Rami Rhaiem: conceptualization, methodology, software, formal analysis, and writing – original draft. Ammar Abdo and Rudy Merieux: methodology, software, validation, formal analysis, and writing – original draft. Julien Calderaro: conceptualization, methodology, validation, data curation, writing – review and editing, and supervision. Safi Dokmak, Astrid Herrero, Perrine Zimmermann, Alexandra Heurgué, Camille Boulagnon-Rombi, and Giuliana Amaddeo: data curation, formal analysis, investigation, and visualization. Cyprien Toubert, Alain Luciani, Hélène Regnault, and Marie Alaux: data curation, investigation, and visualization. Mickael Lesurtel: data curation, formal analysis, and writing – review and editing. Valérie Paradis, Daniele Sommacale, and Olivier Soubrane: data curation, investigation, and writing – review and editing. Stefano Caruso: conceptualization, methodology, validation, formal analysis, and writing – review and editing. Reza Kianmanesh: validation, formal analysis, and writing – review and editing. Raffaele Brustia: conceptualization, methodology, validation, and supervision.

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

🟢 PMC 전문 열기