Machine learning approaches for predicting breast cancer recurrence using clinical and histopathological data.
1/5 보강
Breast cancer remains the most common malignancy among women worldwide, with recurrence representing a major clinical challenge.
APA
Bhat MA, Mir MA, et al. (2025). Machine learning approaches for predicting breast cancer recurrence using clinical and histopathological data.. Clinical and experimental medicine, 26(1), 73. https://doi.org/10.1007/s10238-025-02018-x
MLA
Bhat MA, et al.. "Machine learning approaches for predicting breast cancer recurrence using clinical and histopathological data.." Clinical and experimental medicine, vol. 26, no. 1, 2025, pp. 73.
PMID
41467920 ↗
Abstract 한글 요약
Breast cancer remains the most common malignancy among women worldwide, with recurrence representing a major clinical challenge. Although significant progress has been made in early detection and treatment, recurrence affects up to 40% of patients in Brazil, influencing survival outcomes and therapeutic decisions. In this context, Machine Learning offers valuable potential for enhancing recurrence prediction by enabling data-driven risk assessment and personalized patient care. In the present study, clinical and histopathological information was extracted from unstructured medical records of breast cancer patients. A clustering technique (K-Means) was applied to identify patient subgroups with varying tumor aggressiveness profiles. Survival outcomes were further analyzed using the Cox proportional hazards model. Two distinct subgroups were identified: for less aggressive tumors, Quadratic Discriminant Analysis achieved a remarkably high recall of 0.9872, while for more aggressive tumors, Random Forest provided the most favorable trade-off between recall (0.7296) and precision (0.6811). Future research should explore validation across multiple institutions, incorporate molecular biomarkers, and leverage deep learning approaches to enhance predictive performance.
🏷️ 키워드 / MeSH 📖 같은 키워드 OA만
📖 전문 본문 읽기 PMC JATS · ~73 KB · 영문
Introduction
Introduction
The global burden of breast cancer continues to grow, with more than 2 million new cases expected each year, making it one of the leading causes of cancer-related deaths among women worldwide [1]. “In the United States, nearly 13% of women—about 1 in 8—are projected to develop invasive breast cancer during their lifetime” [2]. Similarly, in Brazil, breast cancer is the most commonly diagnosed cancer among women (excluding non-melanoma skin tumors), with approximately 74,000 new cases anticipated between 2023 and 2025 [3].
Although improvements in screening, early detection, and individualized treatment have enhanced survival outcomes [4]. recurrence remains a significant clinical concern. It affects nearly 30% of patients globally and up to 40% in Brazil, undermining both prognosis and quality of life [3, 5]. Notably, late recurrence can occur even decades after the initial diagnosis up to 32 years often influenced by the tumor’s baseline characteristics [6].
Emerging research highlights the growing role of Artificial Intelligence (AI) and Machine Learning (ML) in capturing hidden patterns within clinical datasets and improving recurrence prediction [7]. echniques such as Support Vector Machines (SVM), Boosted Decision Trees (BDT), and Naive Bayes (NB) have achieved accuracies between 70% and 90%, leveraging features like patient age, TNM staging, histopathology, hormonal receptor status, and HER2 expression [8, 9].
Considering the clinical significance and complexity of recurrence, the present study applies ML techniques to predict breast cancer relapse using essential clinical and histopathological variables, including hormone receptor profiles and TNM staging. By combining predictive modeling with survival analysis, this study aims to support personalized treatment strategies and long-term patient monitoring, emphasizing AI’s growing influence as a decision-support tool in modern oncology.
The global burden of breast cancer continues to grow, with more than 2 million new cases expected each year, making it one of the leading causes of cancer-related deaths among women worldwide [1]. “In the United States, nearly 13% of women—about 1 in 8—are projected to develop invasive breast cancer during their lifetime” [2]. Similarly, in Brazil, breast cancer is the most commonly diagnosed cancer among women (excluding non-melanoma skin tumors), with approximately 74,000 new cases anticipated between 2023 and 2025 [3].
Although improvements in screening, early detection, and individualized treatment have enhanced survival outcomes [4]. recurrence remains a significant clinical concern. It affects nearly 30% of patients globally and up to 40% in Brazil, undermining both prognosis and quality of life [3, 5]. Notably, late recurrence can occur even decades after the initial diagnosis up to 32 years often influenced by the tumor’s baseline characteristics [6].
Emerging research highlights the growing role of Artificial Intelligence (AI) and Machine Learning (ML) in capturing hidden patterns within clinical datasets and improving recurrence prediction [7]. echniques such as Support Vector Machines (SVM), Boosted Decision Trees (BDT), and Naive Bayes (NB) have achieved accuracies between 70% and 90%, leveraging features like patient age, TNM staging, histopathology, hormonal receptor status, and HER2 expression [8, 9].
Considering the clinical significance and complexity of recurrence, the present study applies ML techniques to predict breast cancer relapse using essential clinical and histopathological variables, including hormone receptor profiles and TNM staging. By combining predictive modeling with survival analysis, this study aims to support personalized treatment strategies and long-term patient monitoring, emphasizing AI’s growing influence as a decision-support tool in modern oncology.
Methodology
Methodology
Ethical compliance and participant consent
All methods were carried out in accordance with relevant national and international guidelines and regulations, including those outlined by the Declaration of Helsinki and Brazilian Resolution CNS 466/2012. The research protocol was reviewed and approved by two independent ethics committees: the Research Ethics Committee of Universidade Presbiteriana Mackenzie (approval number 5.483.395/2022) and the Research Ethics Committee of Instituto do Câncer Dr. Arnaldo Vieira de Carvalho (approval number 6.297.865/2022). Prior to participation, all individuals received comprehensive information about the study objectives, procedures, potential risks, and benefits. Written informed consent was obtained from all participants, ensuring their voluntary involvement in the study. The confidentiality and anonymity of participants were rigorously maintained throughout the research process.
Data Availability
The datasets generated and/or analysed during the current study are not publicly available due to privacy and confidentiality agreements with the participating hospital in Brazil, which prohibit the public sharing of patient-related data. These data contain sensitive health information and are protected under Brazilian data protection laws (such as the General Data Protection Law – LGPD). However, de-identified datasets may be made available from the corresponding author upon reasonable request and with appropriate ethical approvals.
Data
The data used in this study were obtained from the clinical records of patients treated for breast cancer at the Instituto do Câncer Doutor Arnaldo Vieira de Carvalho (CEP 58533622.9.1001.0084). The original database included 1,500 patients diagnosed between 1996 and 2023. During data preprocessing, records from three patients were merged due to duplicate entries identified across different hospital systems, resulting in a more consistent and accurate dataset. Only patients with complete and relevant clinical information were retained for analysis. Patients who did not complete treatment or lacked follow-up data within 5 years were excluded. Consequently, the final dataset comprised 1,390 patients, of whom 33.7% had documented follow-up information for at least five years.
A total of 22 distinct variables were collected, describing clinical characteristics, histopathological features, and prognostic factors. The endpoint was a local or distant recurrence. Among the included patients, 16% experienced breast cancer recurrence. We defined recurrence only when the tumor reappeared on the same side; if it occurred on the contralateral breast, it was classified as a new primary case, regardless of the time between the initial diagnosis and the appearance of the contralateral lesion.
In Fig. 1, the presented pipeline describes the data transformation process, developed in Python 3.10, in three main stages: Raw Data, Silver Data, and Gold Data. In the first stage, raw data, stored in PDF files for different patients, is collected and subjected to an extraction process, where relevant information is converted into an initial structured format, resulting in the Raw Data level. In the next stage, Silver Data, the extracted raw data undergoes a process of organization and standardization, being converted into JSON format through an extraction script. This intermediate data, already structured but still semi-processed, is optimized for greater accessibility. Finally, in the Gold Data stage, the standardized data is refined, and the features assigned by the physicians are extracted from the JSON files for each patient using a GPT-3.5 turbo model, resulting in a dataset ready for analysis with maximum quality to support decisions and insights. This systematic workflow ensures an efficient and structured transition from unprocessed data to finalized and usable data.
Reproducibility
To support transparency and reproducibility, we have made available all preprocessing scripts, model-training pipelines, and a synthetic dataset that preserves the statistical characteristics of the original cohort without disclosing any personal information. These resources can be accessed through the GitHub repository linked in the Data Availability section.
Detailed description of patient counts
A total of 1,500 patient records were initially collected from the institutional database. The complete preprocessing workflow involved multiple filtering, cleaning, and modeling steps, each of which resulted in a well-defined reduction in sample size. A detailed breakdown is provided below:
1. Initial dataset (N = 1,500)
These records represented all available breast cancer patient files from the study period. They included structured clinical data, unstructured PDF reports, pathology descriptions, and follow-up outcomes.
2. Removal of duplicates and incomplete cases (N = 1,390)
During the first stage of preprocessing, the following records were excluded:Duplicate entries representing repeated uploads or multiple records for the same patient.
Missing critical outcome data, such as recurrence status or follow-up duration.
Incomplete pathology reports without essential histopathological variables (e.g., ER/PR status, Ki67, tumor subtype).
After applying these filters, 110 records (7.3%) were removed, leaving 1,390 valid patient records for further processing.
3. Final cleaned dataset before clustering (N = 1,091)
An additional quality control step excluded cases where:GPT-based extraction returned unresolved inconsistencies,
pathology variables were outside clinical ranges,
follow-up time was insufficient for determining outcome labels.
This resulted in 299 additional removals, producing a final pre-clustering dataset of 1,091 patients. This dataset contained all variables required for machine-learning training and was considered the primary modeling dataset.
4. Cluster-Specific dataset after K-Means (N = 1,026)
The K-Means clustering algorithm (k = 2) was then applied to the 1,091 cases. However, 65 records (5.95%) were excluded from modeling for the following reasons:They were identified as cluster-outliers based on PCA and silhouette analysis.
Their feature profiles did not consistently align with either cluster centroid.
They contained borderline or conflicting values (e.g., missing Ki67 or inconsistent HR expression).
Thus, 1,026 patients remained for cluster-based model training and validation:Cluster 1 (Group 1): Low–intermediate risk profiles.
Cluster 2 (Group 2): Higher-risk, more aggressive profiles.
These clusters formed the basis for the development of group-specific predictive models.
5. Final test set (N = 305)
For performance evaluation, the dataset was split into:Training set: 721 patients (70%).
Test set: 305 patients (30%).
The split was stratified according to recurrence outcome to preserve class proportions.
SMOTE oversampling was applied only to the training set to prevent data leakage.
The test set of 305 patients was held out completely from the training process and was used solely for reporting unbiased model performance.
After completing the entire pipeline, a final file is obtained containing all the features identified by the physicians as relevant for building the model, extracted from the PDF documents for each patient. The following describes the features extracted.
Validation of automated variable extraction
To ensure the reliability of the automated extraction process, a stratified random subset of 150 patient PDF records was manually reviewed by two independent evaluators. The overall extraction accuracy achieved by the GPT-3.5 model was 94.1%, with categorical variables demonstrating an accuracy of 96.3% and numerical variables 91.8%. Minor discrepancies occurred primarily in free-text hormone receptor descriptions. These results confirmed that the automated extraction pipeline produced sufficiently accurate structured data for subsequent modeling.
Extracted variables
Two authors made a preliminary selection of relevant variables based on clinical evidence. The variables selected for building the predictive models were grouped into three main categories: numerical, categorical, and binary.
Numerical Variables:
diagnostic_age: Patient’s age at the time of diagnosis.
tumor_size: Dimensions of tumor.
n: Nodal stage of the tumor, based on the TNM system.
compromised_axillary_lymph_node: Number of compromised axillary lymph nodes.
status_re: Percentage of tumor cells expressing estrogen receptor (ER status).
status_rp: Percentage of tumor cells expressing progesterone receptor (PR status).
ki67: Percentage of proliferating tumor cells.
Categorical Variables:
surgical_type: Type of surgery performed.
surgical_margins: Status of surgical margins (clear or compromised).
histological_type: Histological type of the tumor.
tumor_degree: Histological grade of the tumor.
molecular_subtype: Molecular subtype of breast cancer.
t: Primary tumor stage, based on the TNM system.
m: Presence of distant metastases, based on the TNM system.
Binary Variables:
menopausal: Menopausal status of the patient (yes/no).
pr: Presence of progesterone receptor (yes/no).
er: Presence of estrogen receptor (yes/no).
her2: HER2 gene hyperexpression (yes/no).
vascular_embolization: Presence of vascular embolization (yes/no).
angiolymphatic_invasion: Presence of angiolymphatic invasion (yes/no).
involved_axilla: Axillary involvement (yes/no).
lymph_node_invasion: Lymph node invasion (yes/no).
axillary_dissection: Axillary dissection performed (yes/no).
The extraction of variables was based on clinical evidence and prognostic relevance. Numerical variables were included to provide quantitative information on the severity and extent of the disease. Categorical variables capture different tumor characteristics that impact the risk of recurrence, while binary variables help identify specific factors that are either present or absent in patients. Although the variables “lymph node invasion” and “axillary involvement” have the same meaning, we decided to use them separately, as the data in the medical records are unstructured, and sometimes one term appears while other times the alternative is used. In the next step, the variables were subjected to a cleaning process, normalization, and encoding, as detailed in Sect. 2.2.
Data preprocessing
Preprocessing was an essential step to ensure the quality and consistency of the data before its use in predictive models (see Fig. 2). Initially, features exhibiting more than 80% missing values were excluded to improve data quality and enhance model performance. Subsequently, missing values in numerical variables were addressed using median imputation, and the data were standardized using the StandardScaler method from the Scikit-Learn library [10], ensuring that all numerical variables were on the same scale. Missing values in categorical variables were imputed using the mode, and the variables were subsequently encoded into binary format using the OneHotEncoder method from the Scikit-Learn library [10], which generated new columns for each existing category while ignoring unknown categories. Binary variables were handled straightforwardly, with mode imputation to fill in missing values while maintaining their original format. The feature “tumor size” was excluded from the dataset because approximately 82% of its values were missing.
All these transformations were integrated into a Pipeline method from Scikit-Learn library [10], which organized and applied specific methods to each group of variables. This transformer was then incorporated into a pipeline, ensuring that preprocessing was performed automatically and consistently throughout all stages. After fitting the pipeline to the training data, the transformed data were converted into a dense matrix, resulting in a new DataFrame containing standardized numerical variables, expanded columns of categorical variables, and the original binary variables. Additionally, to address data imbalance, the SMOTE (Synthetic Minority Over-sampling Technique) [11] method was applied, improving the representation of underrepresented classes.
Handling of class imbalance
To address the imbalance in recurrence outcomes, the SMOTE algorithm was applied exclusively to the training partition after the dataset was split into training and test sets. This ensured that no synthetic observations influenced the test data, preventing data leakage and preserving the validity of model evaluation.
Modeling
The initial modeling process applies a clustering algorithm based on k-means [12]. For each identified group, recurrence prediction models were developed using supervised algorithms. The selection of the best model was based on the recall metric, ensuring higher sensitivity in identifying recurrence cases, as we can see in Fig. 3. The following procedure was used to train the models:After preprocessing, the data was separated by cluster, ensuring that each model was trained with the specific characteristics of its respective group.
The preprocessing pipeline was fitted to the data of each cluster.
Following this, a PyCaret [13] setup was initialized to compare various algorithms, which have been presented and described in Table 1.
Hyperparameter Tuning + Cluster Selection.
Hyperparameter optimization
To ensure optimal model configuration and reproducibility, the initial comparison of algorithms was performed using PyCaret’s compare_models() function, which applies standardized internal cross-validation procedures. Following this step, the top-performing algorithms for each cluster were further refined using PyCaret’s tune_model() function. This procedure employs Bayesian optimization with PyCaret’s default parameter search space, enabling systematic hyperparameter tuning without the need for manual or external grid-based searches. All tuning was conducted within PyCaret’s automated environment to preserve methodological consistency.
Determination of the optimal number of clusters
The selection of k = 2 for the K-Means algorithm was based on a combination of exploratory data analysis and quantitative clustering validation indices. Three established methods were used:
Elbow Method: The within-cluster sum of squares curve showed a clear point of inflection at k = 2, indicating diminishing returns for higher values of k.
Silhouette Coefficient: The silhouette score reached its maximum at k = 2 (0.61), suggesting the best overall cluster cohesion and separation at this value.
Davies–Bouldin Index: The index was lowest for k = 2, further confirming that two clusters provided the greatest discrimination between patient groups.
These results aligned with the bimodal patterns observed in Ki67 expression, hormone receptor distribution, and lymphovascular invasion during exploratory data analysis. Therefore, k = 2 was chosen as the most appropriate clustering structure for the dataset.
Disease-Free survival analysis
Disease-free survival analysis is a crucial statistical approach used in clinical and epidemiological studies to evaluate the time between the diagnosis of the disease and its recurrence. This method focuses on understanding how specific factors influence the likelihood of remaining free from disease after treatment or diagnosis. By modeling these time-to-event data, researchers can gain valuable insights into the effectiveness of therapies and the role of clinical and histopathological variables in predicting outcomes [27].
The disease-free survival analysis model was developed using the same variables employed in the recurrence prediction model, leveraging clinical and histopathological information from patients. For this purpose, a Cox proportional hazards model with Lasso regularization was applied, which automatically selects the most relevant variables by penalizing less significant coefficients [28]. This process ensures that the model focuses on the most impactful predictors while minimizing the risk of overfitting.
The implementation involved a pipeline starting with data standardization using the StandardScaler, ensuring compatibility among variables with different scales [10]. The survival analysis was performed with the CoxnetSurvivalAnalysis algorithm, configured with L1 regularization (Lasso), a penalty parameter = 0.1, and baseline model adjustment enabled. This setup allowed the model to capture nuanced relationships between predictor variables and the time to events, such as recurrence or treatment failure.
By training the model on dedicated datasets, the analysis focused on understanding the association between variables and the risks associated with survival outcomes. The inclusion of L1 regularization reduced the risk of overfitting, enhancing the model’s robustness and interpretability [16]. This approach is particularly advantageous for survival data, as it enables simultaneous modeling of survival time and the relative impact of variables on the associated risk, providing both clinical relevance and predictive accuracy.
Validation
The validation process employed in this study followed a rigorous methodology designed to ensure the reliability and accuracy of the results. The target variable, recurrence, was treated as a binary outcome, with the training set comprising 1,091 samples and the test set containing 305 samples. After preprocessing, a total of 136 numerical variables were retained for modeling. Preprocessing steps were applied exclusively to the training data to prevent data leakage. Subsequently, the cross-validation process was initiated using the StratifiedKFold strategy with 10 splits, ensuring a proportional distribution of the target variable’s classes across all subsets. This approach mitigated potential biases caused by class imbalance.
To optimize computational efficiency, all available CPU cores were leveraged during the validation process; however, GPU acceleration was not utilized. Additionally, experimental logs were not recorded, and the default experiment naming convention was retained for simplicity and reproducibility.
Metrics employed
The metrics used for validating the models included Accuracy, AUC (Area Under the ROC Curve), Recall, Precision, and F1-Score, each serving a specific role in performance evaluation. To optimize computational efficiency, all available CPU cores were leveraged during the validation process; however, GPU acceleration was not utilized. Additionally, experimental logs were not recorded, and the default experiment naming convention was retained for simplicity and reproducibility. This systematic configuration ensured that the model training and validation processes were both fair and reproducible. It effectively addressed critical challenges such as class imbalance, appropriate data partitioning, and the prevention of data leakage while emphasizing evaluation metrics aligned with the clinical significance of accurately identifying recurrence cases.
To further enhance the evaluation, the concordance index (C-index) was included as a key metric for assessing model performance. The C-index measures the ability of the model to rank predicted risks or survival times in alignment with actual outcomes, accounting for the order of events rather than just binary classification. This metric added an additional layer of rigor to the validation process, particularly for survival analysis tasks, ensuring that the models provided clinically relevant and actionable predictions.
Ethical compliance and participant consent
All methods were carried out in accordance with relevant national and international guidelines and regulations, including those outlined by the Declaration of Helsinki and Brazilian Resolution CNS 466/2012. The research protocol was reviewed and approved by two independent ethics committees: the Research Ethics Committee of Universidade Presbiteriana Mackenzie (approval number 5.483.395/2022) and the Research Ethics Committee of Instituto do Câncer Dr. Arnaldo Vieira de Carvalho (approval number 6.297.865/2022). Prior to participation, all individuals received comprehensive information about the study objectives, procedures, potential risks, and benefits. Written informed consent was obtained from all participants, ensuring their voluntary involvement in the study. The confidentiality and anonymity of participants were rigorously maintained throughout the research process.
Data Availability
The datasets generated and/or analysed during the current study are not publicly available due to privacy and confidentiality agreements with the participating hospital in Brazil, which prohibit the public sharing of patient-related data. These data contain sensitive health information and are protected under Brazilian data protection laws (such as the General Data Protection Law – LGPD). However, de-identified datasets may be made available from the corresponding author upon reasonable request and with appropriate ethical approvals.
Data
The data used in this study were obtained from the clinical records of patients treated for breast cancer at the Instituto do Câncer Doutor Arnaldo Vieira de Carvalho (CEP 58533622.9.1001.0084). The original database included 1,500 patients diagnosed between 1996 and 2023. During data preprocessing, records from three patients were merged due to duplicate entries identified across different hospital systems, resulting in a more consistent and accurate dataset. Only patients with complete and relevant clinical information were retained for analysis. Patients who did not complete treatment or lacked follow-up data within 5 years were excluded. Consequently, the final dataset comprised 1,390 patients, of whom 33.7% had documented follow-up information for at least five years.
A total of 22 distinct variables were collected, describing clinical characteristics, histopathological features, and prognostic factors. The endpoint was a local or distant recurrence. Among the included patients, 16% experienced breast cancer recurrence. We defined recurrence only when the tumor reappeared on the same side; if it occurred on the contralateral breast, it was classified as a new primary case, regardless of the time between the initial diagnosis and the appearance of the contralateral lesion.
In Fig. 1, the presented pipeline describes the data transformation process, developed in Python 3.10, in three main stages: Raw Data, Silver Data, and Gold Data. In the first stage, raw data, stored in PDF files for different patients, is collected and subjected to an extraction process, where relevant information is converted into an initial structured format, resulting in the Raw Data level. In the next stage, Silver Data, the extracted raw data undergoes a process of organization and standardization, being converted into JSON format through an extraction script. This intermediate data, already structured but still semi-processed, is optimized for greater accessibility. Finally, in the Gold Data stage, the standardized data is refined, and the features assigned by the physicians are extracted from the JSON files for each patient using a GPT-3.5 turbo model, resulting in a dataset ready for analysis with maximum quality to support decisions and insights. This systematic workflow ensures an efficient and structured transition from unprocessed data to finalized and usable data.
Reproducibility
To support transparency and reproducibility, we have made available all preprocessing scripts, model-training pipelines, and a synthetic dataset that preserves the statistical characteristics of the original cohort without disclosing any personal information. These resources can be accessed through the GitHub repository linked in the Data Availability section.
Detailed description of patient counts
A total of 1,500 patient records were initially collected from the institutional database. The complete preprocessing workflow involved multiple filtering, cleaning, and modeling steps, each of which resulted in a well-defined reduction in sample size. A detailed breakdown is provided below:
1. Initial dataset (N = 1,500)
These records represented all available breast cancer patient files from the study period. They included structured clinical data, unstructured PDF reports, pathology descriptions, and follow-up outcomes.
2. Removal of duplicates and incomplete cases (N = 1,390)
During the first stage of preprocessing, the following records were excluded:Duplicate entries representing repeated uploads or multiple records for the same patient.
Missing critical outcome data, such as recurrence status or follow-up duration.
Incomplete pathology reports without essential histopathological variables (e.g., ER/PR status, Ki67, tumor subtype).
After applying these filters, 110 records (7.3%) were removed, leaving 1,390 valid patient records for further processing.
3. Final cleaned dataset before clustering (N = 1,091)
An additional quality control step excluded cases where:GPT-based extraction returned unresolved inconsistencies,
pathology variables were outside clinical ranges,
follow-up time was insufficient for determining outcome labels.
This resulted in 299 additional removals, producing a final pre-clustering dataset of 1,091 patients. This dataset contained all variables required for machine-learning training and was considered the primary modeling dataset.
4. Cluster-Specific dataset after K-Means (N = 1,026)
The K-Means clustering algorithm (k = 2) was then applied to the 1,091 cases. However, 65 records (5.95%) were excluded from modeling for the following reasons:They were identified as cluster-outliers based on PCA and silhouette analysis.
Their feature profiles did not consistently align with either cluster centroid.
They contained borderline or conflicting values (e.g., missing Ki67 or inconsistent HR expression).
Thus, 1,026 patients remained for cluster-based model training and validation:Cluster 1 (Group 1): Low–intermediate risk profiles.
Cluster 2 (Group 2): Higher-risk, more aggressive profiles.
These clusters formed the basis for the development of group-specific predictive models.
5. Final test set (N = 305)
For performance evaluation, the dataset was split into:Training set: 721 patients (70%).
Test set: 305 patients (30%).
The split was stratified according to recurrence outcome to preserve class proportions.
SMOTE oversampling was applied only to the training set to prevent data leakage.
The test set of 305 patients was held out completely from the training process and was used solely for reporting unbiased model performance.
After completing the entire pipeline, a final file is obtained containing all the features identified by the physicians as relevant for building the model, extracted from the PDF documents for each patient. The following describes the features extracted.
Validation of automated variable extraction
To ensure the reliability of the automated extraction process, a stratified random subset of 150 patient PDF records was manually reviewed by two independent evaluators. The overall extraction accuracy achieved by the GPT-3.5 model was 94.1%, with categorical variables demonstrating an accuracy of 96.3% and numerical variables 91.8%. Minor discrepancies occurred primarily in free-text hormone receptor descriptions. These results confirmed that the automated extraction pipeline produced sufficiently accurate structured data for subsequent modeling.
Extracted variables
Two authors made a preliminary selection of relevant variables based on clinical evidence. The variables selected for building the predictive models were grouped into three main categories: numerical, categorical, and binary.
Numerical Variables:
diagnostic_age: Patient’s age at the time of diagnosis.
tumor_size: Dimensions of tumor.
n: Nodal stage of the tumor, based on the TNM system.
compromised_axillary_lymph_node: Number of compromised axillary lymph nodes.
status_re: Percentage of tumor cells expressing estrogen receptor (ER status).
status_rp: Percentage of tumor cells expressing progesterone receptor (PR status).
ki67: Percentage of proliferating tumor cells.
Categorical Variables:
surgical_type: Type of surgery performed.
surgical_margins: Status of surgical margins (clear or compromised).
histological_type: Histological type of the tumor.
tumor_degree: Histological grade of the tumor.
molecular_subtype: Molecular subtype of breast cancer.
t: Primary tumor stage, based on the TNM system.
m: Presence of distant metastases, based on the TNM system.
Binary Variables:
menopausal: Menopausal status of the patient (yes/no).
pr: Presence of progesterone receptor (yes/no).
er: Presence of estrogen receptor (yes/no).
her2: HER2 gene hyperexpression (yes/no).
vascular_embolization: Presence of vascular embolization (yes/no).
angiolymphatic_invasion: Presence of angiolymphatic invasion (yes/no).
involved_axilla: Axillary involvement (yes/no).
lymph_node_invasion: Lymph node invasion (yes/no).
axillary_dissection: Axillary dissection performed (yes/no).
The extraction of variables was based on clinical evidence and prognostic relevance. Numerical variables were included to provide quantitative information on the severity and extent of the disease. Categorical variables capture different tumor characteristics that impact the risk of recurrence, while binary variables help identify specific factors that are either present or absent in patients. Although the variables “lymph node invasion” and “axillary involvement” have the same meaning, we decided to use them separately, as the data in the medical records are unstructured, and sometimes one term appears while other times the alternative is used. In the next step, the variables were subjected to a cleaning process, normalization, and encoding, as detailed in Sect. 2.2.
Data preprocessing
Preprocessing was an essential step to ensure the quality and consistency of the data before its use in predictive models (see Fig. 2). Initially, features exhibiting more than 80% missing values were excluded to improve data quality and enhance model performance. Subsequently, missing values in numerical variables were addressed using median imputation, and the data were standardized using the StandardScaler method from the Scikit-Learn library [10], ensuring that all numerical variables were on the same scale. Missing values in categorical variables were imputed using the mode, and the variables were subsequently encoded into binary format using the OneHotEncoder method from the Scikit-Learn library [10], which generated new columns for each existing category while ignoring unknown categories. Binary variables were handled straightforwardly, with mode imputation to fill in missing values while maintaining their original format. The feature “tumor size” was excluded from the dataset because approximately 82% of its values were missing.
All these transformations were integrated into a Pipeline method from Scikit-Learn library [10], which organized and applied specific methods to each group of variables. This transformer was then incorporated into a pipeline, ensuring that preprocessing was performed automatically and consistently throughout all stages. After fitting the pipeline to the training data, the transformed data were converted into a dense matrix, resulting in a new DataFrame containing standardized numerical variables, expanded columns of categorical variables, and the original binary variables. Additionally, to address data imbalance, the SMOTE (Synthetic Minority Over-sampling Technique) [11] method was applied, improving the representation of underrepresented classes.
Handling of class imbalance
To address the imbalance in recurrence outcomes, the SMOTE algorithm was applied exclusively to the training partition after the dataset was split into training and test sets. This ensured that no synthetic observations influenced the test data, preventing data leakage and preserving the validity of model evaluation.
Modeling
The initial modeling process applies a clustering algorithm based on k-means [12]. For each identified group, recurrence prediction models were developed using supervised algorithms. The selection of the best model was based on the recall metric, ensuring higher sensitivity in identifying recurrence cases, as we can see in Fig. 3. The following procedure was used to train the models:After preprocessing, the data was separated by cluster, ensuring that each model was trained with the specific characteristics of its respective group.
The preprocessing pipeline was fitted to the data of each cluster.
Following this, a PyCaret [13] setup was initialized to compare various algorithms, which have been presented and described in Table 1.
Hyperparameter Tuning + Cluster Selection.
Hyperparameter optimization
To ensure optimal model configuration and reproducibility, the initial comparison of algorithms was performed using PyCaret’s compare_models() function, which applies standardized internal cross-validation procedures. Following this step, the top-performing algorithms for each cluster were further refined using PyCaret’s tune_model() function. This procedure employs Bayesian optimization with PyCaret’s default parameter search space, enabling systematic hyperparameter tuning without the need for manual or external grid-based searches. All tuning was conducted within PyCaret’s automated environment to preserve methodological consistency.
Determination of the optimal number of clusters
The selection of k = 2 for the K-Means algorithm was based on a combination of exploratory data analysis and quantitative clustering validation indices. Three established methods were used:
Elbow Method: The within-cluster sum of squares curve showed a clear point of inflection at k = 2, indicating diminishing returns for higher values of k.
Silhouette Coefficient: The silhouette score reached its maximum at k = 2 (0.61), suggesting the best overall cluster cohesion and separation at this value.
Davies–Bouldin Index: The index was lowest for k = 2, further confirming that two clusters provided the greatest discrimination between patient groups.
These results aligned with the bimodal patterns observed in Ki67 expression, hormone receptor distribution, and lymphovascular invasion during exploratory data analysis. Therefore, k = 2 was chosen as the most appropriate clustering structure for the dataset.
Disease-Free survival analysis
Disease-free survival analysis is a crucial statistical approach used in clinical and epidemiological studies to evaluate the time between the diagnosis of the disease and its recurrence. This method focuses on understanding how specific factors influence the likelihood of remaining free from disease after treatment or diagnosis. By modeling these time-to-event data, researchers can gain valuable insights into the effectiveness of therapies and the role of clinical and histopathological variables in predicting outcomes [27].
The disease-free survival analysis model was developed using the same variables employed in the recurrence prediction model, leveraging clinical and histopathological information from patients. For this purpose, a Cox proportional hazards model with Lasso regularization was applied, which automatically selects the most relevant variables by penalizing less significant coefficients [28]. This process ensures that the model focuses on the most impactful predictors while minimizing the risk of overfitting.
The implementation involved a pipeline starting with data standardization using the StandardScaler, ensuring compatibility among variables with different scales [10]. The survival analysis was performed with the CoxnetSurvivalAnalysis algorithm, configured with L1 regularization (Lasso), a penalty parameter = 0.1, and baseline model adjustment enabled. This setup allowed the model to capture nuanced relationships between predictor variables and the time to events, such as recurrence or treatment failure.
By training the model on dedicated datasets, the analysis focused on understanding the association between variables and the risks associated with survival outcomes. The inclusion of L1 regularization reduced the risk of overfitting, enhancing the model’s robustness and interpretability [16]. This approach is particularly advantageous for survival data, as it enables simultaneous modeling of survival time and the relative impact of variables on the associated risk, providing both clinical relevance and predictive accuracy.
Validation
The validation process employed in this study followed a rigorous methodology designed to ensure the reliability and accuracy of the results. The target variable, recurrence, was treated as a binary outcome, with the training set comprising 1,091 samples and the test set containing 305 samples. After preprocessing, a total of 136 numerical variables were retained for modeling. Preprocessing steps were applied exclusively to the training data to prevent data leakage. Subsequently, the cross-validation process was initiated using the StratifiedKFold strategy with 10 splits, ensuring a proportional distribution of the target variable’s classes across all subsets. This approach mitigated potential biases caused by class imbalance.
To optimize computational efficiency, all available CPU cores were leveraged during the validation process; however, GPU acceleration was not utilized. Additionally, experimental logs were not recorded, and the default experiment naming convention was retained for simplicity and reproducibility.
Metrics employed
The metrics used for validating the models included Accuracy, AUC (Area Under the ROC Curve), Recall, Precision, and F1-Score, each serving a specific role in performance evaluation. To optimize computational efficiency, all available CPU cores were leveraged during the validation process; however, GPU acceleration was not utilized. Additionally, experimental logs were not recorded, and the default experiment naming convention was retained for simplicity and reproducibility. This systematic configuration ensured that the model training and validation processes were both fair and reproducible. It effectively addressed critical challenges such as class imbalance, appropriate data partitioning, and the prevention of data leakage while emphasizing evaluation metrics aligned with the clinical significance of accurately identifying recurrence cases.
To further enhance the evaluation, the concordance index (C-index) was included as a key metric for assessing model performance. The C-index measures the ability of the model to rank predicted risks or survival times in alignment with actual outcomes, accounting for the order of events rather than just binary classification. This metric added an additional layer of rigor to the validation process, particularly for survival analysis tasks, ensuring that the models provided clinically relevant and actionable predictions.
Results
Results
This study included patients diagnosed with breast cancer comprised 1,500 patients diagnosed between 1996 and 2023, but was refined to 1,390 patients after excluding cases with insufficient clinical information or less than five years of follow-up. The dataset comprises clinical and pathological variables relevant to breast cancer recurrence prediction. The median age at diagnosis was 58 years, ranging from 18 to 98 years. Among the patients, 79.73% were postmenopausal. Regarding tumor staging, 56.09% of patients were diagnosed at Stage II and 14.04% at Stage III. Additionally, 71.85% of the patients exhibited lymph node involvement, reflecting a significant prognostic factor. The dataset also captures molecular characteristics of the tumors. 12.11% of the patients were classified as triple-negative breast cancer, a subtype known for its aggressive nature and limited treatment options. Furthermore, 47.77% of the patients with Stage II and 57.14% of those with Stage III experienced recurrence, emphasizing the importance of early detection and tailored treatment strategies. Tables 2 and 3 summarize the distribution of key numerical, categorical, and binary variables, providing insights into their respective ranges, frequencies, and central tendencies.
To arrive at the experimental stage, we began with a comprehensive exploratory data analysis (EDA) to understand the structure and patterns within the dataset. This analysis allowed us to identify meaningful distinctions in the patient population, laying the groundwork for subsequent clustering and predictive modeling. During this phase, two distinct patient groups emerged, characterized by unique clinical and molecular profiles. These groups were initially observed through descriptive statistics and exploratory visualizations, and their separation was later validated using the K-Means clustering algorithm, which consistently identified two clusters aligned with the initial observations.
Group 1:
Lower mean value of ki67 and a slightly higher mean value of idade_diagnostico. Patients in this cluster exhibited lower levels of Ki67 expression, a marker of cellular proliferation.
Higher expression percentage of hormonal receptors (receptor_progesterona and receptor_estrogenio). These patients tend to have greater hormonal sensitivity.
A lower prevalence of certain tumor subtypes and stages was observed, indicating a less aggressive tumor profile.
Group 2:
A higher mean value of Ki67 indicates increased cellular proliferation and potentially more aggressive disease behavior.
A reduced expression of hormonal receptors (receptor_progesterona and receptor_estrogenio) is indicative of a hormone-independent tumor profile.
A higher prevalence of features associated with tumor invasion and progression, such as invasao_angiolinfatica, was observed.
Comparison With a Unified (Non-Clustered) Model.
To evaluate the impact of clustering on predictive performance, a single non-clustered model was trained using the full dataset. This unified model achieved a Recall of 0.672 and an F1-score of 0.531, both substantially lower than the performance of the cluster-specific models. In contrast, the QDA model for Group 1 achieved a Recall of 0.9872, and the Random Forest model for Group 2 achieved a Recall of 0.7296 with an F1-score of 0.6998. These results demonstrate that clustering substantially improves predictive sensitivity and overall performance by accommodating underlying heterogeneity in tumor biology.
Recurrence prediction modeling
The identification of two distinct patient groups through exploratory data analysis and clustering laid the foundation for tailored predictive modeling of recurrence. These groups, characterized by unique clinical and molecular profiles, provided critical insights into the heterogeneity of the patient population, highlighting the need for personalized modeling approaches. By leveraging these stratified clusters, we were able to evaluate supervised learning algorithms with a focus on addressing the specific characteristics and challenges associated with each group. For each identified group, the PyCaret library was used to evaluate the performance of various supervised learning algorithms. Tables 4 and 5 present the results obtained for the tested models, considering metrics such as Accuracy, AUC, Recall, Precision, F1-Score, among others. These metrics were calculated independently for each group to determine the most suitable model for recurrence prediction according to the specific characteristics of each cluster.
The tested models included classical approaches such as Logistic Regression, Random Forest, and Support Vector Machine, as well as ensemble-based techniques like Gradient Boosting, AdaBoost, and LightGBM. Model performance was primarily evaluated based on Recall, given the importance of correctly identifying positive cases, and the F1-Score, which combines Recall and Precision to provide a more balanced measure of overall performance.
The results highlight the importance of tailoring models to the specific characteristics of each group. While algorithms like QDA may be useful in scenarios where identifying all positive cases is a priority, as seen in breast cancer modeling, the use of diverse algorithms allowed for a comprehensive and efficient analysis, enabling the identification of the most promising models for practical application. These results will be used to develop more robust and personalized predictive strategies, contributing to improved patient monitoring and treatment.[29]-[30]
Statistical validation of model performance
To ensure that model performance differences were not due to random variation, we conducted repeated 10-fold cross-validation to generate confidence intervals for all metrics. Additionally, McNemar’s test was applied to paired model predictions, confirming that QDA (Group 1) and Random Forest (Group 2) significantly outperformed alternative classifiers (p < 0.05). Bootstrap confidence intervals for AUC and Recall further verified model stability.[31]-[32]
In Group 1, composed of patients with low and intermediate risk, Quadratic Discriminant Analysis (QDA) demonstrated the best performance in Recall, achieving an average value of 0.9872 with a standard deviation of 0.0213, confirming its ability to consistently identify nearly all recurrence cases (see Table 4). QDA also showed an average F1 score of 0.6093, reflecting a moderate balance between Recall and Precision. However, its average Precision (0.4407) remained low, indicating a high number of false positives. This behavior is expected in a highly sensitive model and can be acceptable in this case, given that minimizing false negatives is a priority. The average AUC (0.5527) suggests that the model does not perform as well in overall class separation, but its high sensitivity compensates for this limitation in the context of Group 1.
In Group 2, composed of patients with more aggressive cancer, Random Forest (RF) stood out as the most robust model, with an average Recall of 0.7296 and a standard deviation of 0.1076 (see Table 5). Although RF’s Recall is lower compared to QDA in Group 1, it remains the best model for capturing most recurrence cases in patients with aggressive cancer. Additionally, RF achieved an average Precision of 0.6811 and an F1 score of 0.6998, well-balanced metrics that demonstrate its ability to reduce false positives while maintaining good sensitivity. The average AUC (0.6915) also reflects a greater ability to separate classes in this group, indicating that RF is more effective in handling the complexity of data from patients with aggressive cancer.
In Group 1, QDA is confirmed as the best model for maximizing Recall and avoiding false negatives, with an almost perfect performance in Recall (0.9872), even if this results in an increase in false positives. This model is ideal for patients with milder and moderate cancer, where capturing all recurrence cases is essential. In contrast, in Group 2, Random Forest emerges as the most effective model, balancing high Recall (0.7296) with solid complementary metrics like Precision and F1. RF proves to be particularly robust in dealing with the complexity of patients with aggressive cancer, making it the best choice for this scenario. These results demonstrate that different modeling strategies are necessary to address the specific characteristics of each patient group, ensuring more accurate predictions and better-informed clinical decisions.
Disease-Free survival analysis modeling
The tailored predictive models developed for recurrence prediction provided valuable insights into the unique risk profiles of each patient cluster. These models highlighted key differences in recurrence patterns and the associated clinical variables, underscoring the importance of stratified approaches in addressing the heterogeneity within the patient population. Building on these findings, the next phase of analysis focused on disease-free survival to further explore the long-term outcomes for each cluster and validate the clinical relevance of the stratified modeling approach.
In Fig. 4, the survival curve for Cluster 0 (patients with milder cancer) shows a higher survival rate over time, with a slower decline compared to Cluster 1. This is consistent with the more favorable prognosis expected for patients in this group. On the other hand, the curve for Cluster 1 (patients with more severe cancer) exhibits a steeper decline in survival rate, reflecting the severity and greater complexity of treatment for these patients. These differences are expected, as the group with more aggressive cancer (Cluster 1) tends to face a higher risk of recurrence and, consequently, a lower probability of survival over time. This scenario also justifies the choice of predictive models with specific characteristics tailored to each group.
Survival model outputs
The Cox proportional hazards model with Lasso regularization retained four variables with non-zero coefficients: Ki67, axillary lymph node involvement, vascular embolization, and molecular subtype. All four predictors demonstrated positive hazard ratios, indicating increased recurrence risk. The final model achieved a C-index of 0.742, suggesting good discriminative capacity for disease-free survival outcomes. The proportional hazards assumption was evaluated using Schoenfeld residuals, with no significant violations observed (global p = 0.18), confirming the suitability of the Cox model for this dataset.
Cluster 0 – Low risk (Group 1)
Quadratic Discriminant Analysis (QDA), which achieved an average Recall of 0.9872, is the most appropriate choice for this group. QDA’s high sensitivity is crucial for detecting recurrence cases in this group, ensuring that nearly all at-risk patients are identified for early treatment.
The survival curve for Cluster 0 reinforces that patients in this group have a relatively lower risk of death or progression. However, early identification of recurrence cases is essential to maintain the high survival rates observed.
Cluster 1 - More severe cancer (Group 2)
Random Forest (RF) was the best-performing model, with an average Recall of 0.7296. Although its Recall is lower than QDA’s in Cluster 0, it is suitable for Cluster 1 as it offers a better balance between Recall, Precision, and F1-score. This robustness is important for managing the complexity of the data in this group.
The survival curve for Cluster 1 highlights the impact of cancer severity, with a rapid decline in survival rates. Here, the priority is to identify patients at higher risk of recurrence to intervene quickly and attempt to improve survival outcomes.
The survival curves emphasize the difference in clinical behavior between the two groups, aligning with the results of the predictive models. The personalized approach, using QDA for Cluster 0 and RF for Cluster 1, reflects a strategy tailored to the characteristics of each group, maximizing the chances of effective interventions and improving clinical outcomes.
Cluster visualization and clinical interpretation
To further evaluate the distinctiveness of the two clusters, dimensionality reduction techniques were applied using Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE). Both visualizations demonstrated clear separation between the clusters. Group 2 displayed tighter clustering driven by high Ki67 levels, reduced hormone receptor expression, and higher frequencies of lympho vascular invasion. These findings reinforce the biological and clinical relevance of the two-cluster structure and support the subsequent development of cluster-specific predictive models.
This study included patients diagnosed with breast cancer comprised 1,500 patients diagnosed between 1996 and 2023, but was refined to 1,390 patients after excluding cases with insufficient clinical information or less than five years of follow-up. The dataset comprises clinical and pathological variables relevant to breast cancer recurrence prediction. The median age at diagnosis was 58 years, ranging from 18 to 98 years. Among the patients, 79.73% were postmenopausal. Regarding tumor staging, 56.09% of patients were diagnosed at Stage II and 14.04% at Stage III. Additionally, 71.85% of the patients exhibited lymph node involvement, reflecting a significant prognostic factor. The dataset also captures molecular characteristics of the tumors. 12.11% of the patients were classified as triple-negative breast cancer, a subtype known for its aggressive nature and limited treatment options. Furthermore, 47.77% of the patients with Stage II and 57.14% of those with Stage III experienced recurrence, emphasizing the importance of early detection and tailored treatment strategies. Tables 2 and 3 summarize the distribution of key numerical, categorical, and binary variables, providing insights into their respective ranges, frequencies, and central tendencies.
To arrive at the experimental stage, we began with a comprehensive exploratory data analysis (EDA) to understand the structure and patterns within the dataset. This analysis allowed us to identify meaningful distinctions in the patient population, laying the groundwork for subsequent clustering and predictive modeling. During this phase, two distinct patient groups emerged, characterized by unique clinical and molecular profiles. These groups were initially observed through descriptive statistics and exploratory visualizations, and their separation was later validated using the K-Means clustering algorithm, which consistently identified two clusters aligned with the initial observations.
Group 1:
Lower mean value of ki67 and a slightly higher mean value of idade_diagnostico. Patients in this cluster exhibited lower levels of Ki67 expression, a marker of cellular proliferation.
Higher expression percentage of hormonal receptors (receptor_progesterona and receptor_estrogenio). These patients tend to have greater hormonal sensitivity.
A lower prevalence of certain tumor subtypes and stages was observed, indicating a less aggressive tumor profile.
Group 2:
A higher mean value of Ki67 indicates increased cellular proliferation and potentially more aggressive disease behavior.
A reduced expression of hormonal receptors (receptor_progesterona and receptor_estrogenio) is indicative of a hormone-independent tumor profile.
A higher prevalence of features associated with tumor invasion and progression, such as invasao_angiolinfatica, was observed.
Comparison With a Unified (Non-Clustered) Model.
To evaluate the impact of clustering on predictive performance, a single non-clustered model was trained using the full dataset. This unified model achieved a Recall of 0.672 and an F1-score of 0.531, both substantially lower than the performance of the cluster-specific models. In contrast, the QDA model for Group 1 achieved a Recall of 0.9872, and the Random Forest model for Group 2 achieved a Recall of 0.7296 with an F1-score of 0.6998. These results demonstrate that clustering substantially improves predictive sensitivity and overall performance by accommodating underlying heterogeneity in tumor biology.
Recurrence prediction modeling
The identification of two distinct patient groups through exploratory data analysis and clustering laid the foundation for tailored predictive modeling of recurrence. These groups, characterized by unique clinical and molecular profiles, provided critical insights into the heterogeneity of the patient population, highlighting the need for personalized modeling approaches. By leveraging these stratified clusters, we were able to evaluate supervised learning algorithms with a focus on addressing the specific characteristics and challenges associated with each group. For each identified group, the PyCaret library was used to evaluate the performance of various supervised learning algorithms. Tables 4 and 5 present the results obtained for the tested models, considering metrics such as Accuracy, AUC, Recall, Precision, F1-Score, among others. These metrics were calculated independently for each group to determine the most suitable model for recurrence prediction according to the specific characteristics of each cluster.
The tested models included classical approaches such as Logistic Regression, Random Forest, and Support Vector Machine, as well as ensemble-based techniques like Gradient Boosting, AdaBoost, and LightGBM. Model performance was primarily evaluated based on Recall, given the importance of correctly identifying positive cases, and the F1-Score, which combines Recall and Precision to provide a more balanced measure of overall performance.
The results highlight the importance of tailoring models to the specific characteristics of each group. While algorithms like QDA may be useful in scenarios where identifying all positive cases is a priority, as seen in breast cancer modeling, the use of diverse algorithms allowed for a comprehensive and efficient analysis, enabling the identification of the most promising models for practical application. These results will be used to develop more robust and personalized predictive strategies, contributing to improved patient monitoring and treatment.[29]-[30]
Statistical validation of model performance
To ensure that model performance differences were not due to random variation, we conducted repeated 10-fold cross-validation to generate confidence intervals for all metrics. Additionally, McNemar’s test was applied to paired model predictions, confirming that QDA (Group 1) and Random Forest (Group 2) significantly outperformed alternative classifiers (p < 0.05). Bootstrap confidence intervals for AUC and Recall further verified model stability.[31]-[32]
In Group 1, composed of patients with low and intermediate risk, Quadratic Discriminant Analysis (QDA) demonstrated the best performance in Recall, achieving an average value of 0.9872 with a standard deviation of 0.0213, confirming its ability to consistently identify nearly all recurrence cases (see Table 4). QDA also showed an average F1 score of 0.6093, reflecting a moderate balance between Recall and Precision. However, its average Precision (0.4407) remained low, indicating a high number of false positives. This behavior is expected in a highly sensitive model and can be acceptable in this case, given that minimizing false negatives is a priority. The average AUC (0.5527) suggests that the model does not perform as well in overall class separation, but its high sensitivity compensates for this limitation in the context of Group 1.
In Group 2, composed of patients with more aggressive cancer, Random Forest (RF) stood out as the most robust model, with an average Recall of 0.7296 and a standard deviation of 0.1076 (see Table 5). Although RF’s Recall is lower compared to QDA in Group 1, it remains the best model for capturing most recurrence cases in patients with aggressive cancer. Additionally, RF achieved an average Precision of 0.6811 and an F1 score of 0.6998, well-balanced metrics that demonstrate its ability to reduce false positives while maintaining good sensitivity. The average AUC (0.6915) also reflects a greater ability to separate classes in this group, indicating that RF is more effective in handling the complexity of data from patients with aggressive cancer.
In Group 1, QDA is confirmed as the best model for maximizing Recall and avoiding false negatives, with an almost perfect performance in Recall (0.9872), even if this results in an increase in false positives. This model is ideal for patients with milder and moderate cancer, where capturing all recurrence cases is essential. In contrast, in Group 2, Random Forest emerges as the most effective model, balancing high Recall (0.7296) with solid complementary metrics like Precision and F1. RF proves to be particularly robust in dealing with the complexity of patients with aggressive cancer, making it the best choice for this scenario. These results demonstrate that different modeling strategies are necessary to address the specific characteristics of each patient group, ensuring more accurate predictions and better-informed clinical decisions.
Disease-Free survival analysis modeling
The tailored predictive models developed for recurrence prediction provided valuable insights into the unique risk profiles of each patient cluster. These models highlighted key differences in recurrence patterns and the associated clinical variables, underscoring the importance of stratified approaches in addressing the heterogeneity within the patient population. Building on these findings, the next phase of analysis focused on disease-free survival to further explore the long-term outcomes for each cluster and validate the clinical relevance of the stratified modeling approach.
In Fig. 4, the survival curve for Cluster 0 (patients with milder cancer) shows a higher survival rate over time, with a slower decline compared to Cluster 1. This is consistent with the more favorable prognosis expected for patients in this group. On the other hand, the curve for Cluster 1 (patients with more severe cancer) exhibits a steeper decline in survival rate, reflecting the severity and greater complexity of treatment for these patients. These differences are expected, as the group with more aggressive cancer (Cluster 1) tends to face a higher risk of recurrence and, consequently, a lower probability of survival over time. This scenario also justifies the choice of predictive models with specific characteristics tailored to each group.
Survival model outputs
The Cox proportional hazards model with Lasso regularization retained four variables with non-zero coefficients: Ki67, axillary lymph node involvement, vascular embolization, and molecular subtype. All four predictors demonstrated positive hazard ratios, indicating increased recurrence risk. The final model achieved a C-index of 0.742, suggesting good discriminative capacity for disease-free survival outcomes. The proportional hazards assumption was evaluated using Schoenfeld residuals, with no significant violations observed (global p = 0.18), confirming the suitability of the Cox model for this dataset.
Cluster 0 – Low risk (Group 1)
Quadratic Discriminant Analysis (QDA), which achieved an average Recall of 0.9872, is the most appropriate choice for this group. QDA’s high sensitivity is crucial for detecting recurrence cases in this group, ensuring that nearly all at-risk patients are identified for early treatment.
The survival curve for Cluster 0 reinforces that patients in this group have a relatively lower risk of death or progression. However, early identification of recurrence cases is essential to maintain the high survival rates observed.
Cluster 1 - More severe cancer (Group 2)
Random Forest (RF) was the best-performing model, with an average Recall of 0.7296. Although its Recall is lower than QDA’s in Cluster 0, it is suitable for Cluster 1 as it offers a better balance between Recall, Precision, and F1-score. This robustness is important for managing the complexity of the data in this group.
The survival curve for Cluster 1 highlights the impact of cancer severity, with a rapid decline in survival rates. Here, the priority is to identify patients at higher risk of recurrence to intervene quickly and attempt to improve survival outcomes.
The survival curves emphasize the difference in clinical behavior between the two groups, aligning with the results of the predictive models. The personalized approach, using QDA for Cluster 0 and RF for Cluster 1, reflects a strategy tailored to the characteristics of each group, maximizing the chances of effective interventions and improving clinical outcomes.
Cluster visualization and clinical interpretation
To further evaluate the distinctiveness of the two clusters, dimensionality reduction techniques were applied using Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE). Both visualizations demonstrated clear separation between the clusters. Group 2 displayed tighter clustering driven by high Ki67 levels, reduced hormone receptor expression, and higher frequencies of lympho vascular invasion. These findings reinforce the biological and clinical relevance of the two-cluster structure and support the subsequent development of cluster-specific predictive models.
Discussion
Discussion
Artificial Intelligence (AI) can significantly enhance breast cancer recurrence prediction by analyzing large volumes of clinical, histopathological, and imaging data to uncover relationships that traditional statistical methods may miss. Machine learning models integrate prognostic factors including tumor biology, treatment history, and patient demographics to deliver personalized, data-driven risk assessments. Advanced methods such as deep learning, NLP for medical records, and survival modeling further improve accuracy, enabling earlier interventions, tailored therapies, and optimized follow-up strategies.[33]-[34]
In this study, clinical records from the Instituto do Câncer Doutor Arnaldo Vieira de Carvalho were processed through a three-stage pipeline: raw PDF extraction, JSON standardization, and GPT-3.5–based variable extraction. Twenty-two clinically relevant features were retained after rigorous cleaning and the removal of variables with > 80% missing values. Numerical features were standardized, categorical features encoded, and SMOTE was applied to address class imbalance.
Exploratory analysis revealed two distinct patient subgroups. One cluster exhibited high Ki67, low ER/PR expression, and elevated angiolymphatic invasion, indicating aggressive tumors. The other showed lower proliferation and higher hormonal receptor expression. K-Means clustering was then used to model these groups separately. Supervised models were trained within each cluster, prioritizing Recall to minimize missed recurrence cases. Model comparison using PyCaret identified QDA as optimal for the less aggressive group (Recall = 0.9872) and Random Forest for the aggressive group (Recall = 0.7296), each balancing sensitivity and precision according to clinical needs.
A Cox-Lasso survival model further validated key predictors such as Ki67, lymph node invasion, and angiolymphatic invasion and demonstrated good performance through its C-index. The results aligned with established literature, confirming that high-proliferation, hormone-independent tumors carry greater recurrence risk.
Overall, stratifying patients into clinically meaningful clusters and applying tailored machine learning models improved predictive accuracy and sensitivity. The approach reinforces the importance of personalized modeling in oncology and demonstrates how AI can enhance recurrence prediction using routinely available clinical data. However, as the dataset reflects practices within Brazil’s public healthcare system (SUS), model recalibration may be required before applying it in different healthcare environments.
Influential predictors and comparison with previous studies
The Random Forest model trained on the more aggressive tumor group identified Ki67, lymph node invasion, angiolymphatic invasion, ER/PR expression, and tumor grade as the most influential predictors of recurrence. These findings align with established clinical evidence and underscore the importance of tumor proliferative capacity and nodal involvement.
The performance of our models was compared with key benchmark studies in breast cancer prediction [35–37]. Reported accuracies and sensitivities in prior machine learning approaches range between 70% and 90%. In comparison, our QDA model achieved a Recall of 0.9872 in the less aggressive group, and the Random Forest model achieved an F1-score of 0.6998 in the more aggressive group. These values are comparable or superior to previously reported metrics, despite relying exclusively on routinely collected clinical variables rather than genomic or imaging data.
Interpretation of QDA performance
Although QDA presented a lower Precision in Group 1, its extremely high Recall was prioritized because minimizing false negatives is clinically more important in this lower-risk population. False positives from this model typically result in shorter follow-up intervals or additional imaging, interventions that carry minimal risk compared to the potential consequences of an undetected recurrence. This trade-off makes QDA the clinically preferable model for Group 1.[38]
Comparison with established prognostic tools
Our predictive models were also compared conceptually with established clinical tools such as the Nottingham Prognostic Index (NPI) and Oncotype DX. While these tools incorporate tumor size, nodal involvement, and genomic markers, our machine-learning models achieved similar sensitivity using only standard clinical and histopathological variables. This makes the proposed models more accessible and cost-effective, particularly in resource-limited healthcare settings.[39]
Artificial Intelligence (AI) can significantly enhance breast cancer recurrence prediction by analyzing large volumes of clinical, histopathological, and imaging data to uncover relationships that traditional statistical methods may miss. Machine learning models integrate prognostic factors including tumor biology, treatment history, and patient demographics to deliver personalized, data-driven risk assessments. Advanced methods such as deep learning, NLP for medical records, and survival modeling further improve accuracy, enabling earlier interventions, tailored therapies, and optimized follow-up strategies.[33]-[34]
In this study, clinical records from the Instituto do Câncer Doutor Arnaldo Vieira de Carvalho were processed through a three-stage pipeline: raw PDF extraction, JSON standardization, and GPT-3.5–based variable extraction. Twenty-two clinically relevant features were retained after rigorous cleaning and the removal of variables with > 80% missing values. Numerical features were standardized, categorical features encoded, and SMOTE was applied to address class imbalance.
Exploratory analysis revealed two distinct patient subgroups. One cluster exhibited high Ki67, low ER/PR expression, and elevated angiolymphatic invasion, indicating aggressive tumors. The other showed lower proliferation and higher hormonal receptor expression. K-Means clustering was then used to model these groups separately. Supervised models were trained within each cluster, prioritizing Recall to minimize missed recurrence cases. Model comparison using PyCaret identified QDA as optimal for the less aggressive group (Recall = 0.9872) and Random Forest for the aggressive group (Recall = 0.7296), each balancing sensitivity and precision according to clinical needs.
A Cox-Lasso survival model further validated key predictors such as Ki67, lymph node invasion, and angiolymphatic invasion and demonstrated good performance through its C-index. The results aligned with established literature, confirming that high-proliferation, hormone-independent tumors carry greater recurrence risk.
Overall, stratifying patients into clinically meaningful clusters and applying tailored machine learning models improved predictive accuracy and sensitivity. The approach reinforces the importance of personalized modeling in oncology and demonstrates how AI can enhance recurrence prediction using routinely available clinical data. However, as the dataset reflects practices within Brazil’s public healthcare system (SUS), model recalibration may be required before applying it in different healthcare environments.
Influential predictors and comparison with previous studies
The Random Forest model trained on the more aggressive tumor group identified Ki67, lymph node invasion, angiolymphatic invasion, ER/PR expression, and tumor grade as the most influential predictors of recurrence. These findings align with established clinical evidence and underscore the importance of tumor proliferative capacity and nodal involvement.
The performance of our models was compared with key benchmark studies in breast cancer prediction [35–37]. Reported accuracies and sensitivities in prior machine learning approaches range between 70% and 90%. In comparison, our QDA model achieved a Recall of 0.9872 in the less aggressive group, and the Random Forest model achieved an F1-score of 0.6998 in the more aggressive group. These values are comparable or superior to previously reported metrics, despite relying exclusively on routinely collected clinical variables rather than genomic or imaging data.
Interpretation of QDA performance
Although QDA presented a lower Precision in Group 1, its extremely high Recall was prioritized because minimizing false negatives is clinically more important in this lower-risk population. False positives from this model typically result in shorter follow-up intervals or additional imaging, interventions that carry minimal risk compared to the potential consequences of an undetected recurrence. This trade-off makes QDA the clinically preferable model for Group 1.[38]
Comparison with established prognostic tools
Our predictive models were also compared conceptually with established clinical tools such as the Nottingham Prognostic Index (NPI) and Oncotype DX. While these tools incorporate tumor size, nodal involvement, and genomic markers, our machine-learning models achieved similar sensitivity using only standard clinical and histopathological variables. This makes the proposed models more accessible and cost-effective, particularly in resource-limited healthcare settings.[39]
limitations
limitations
Despite the promising results achieved in this study, some limitations must be acknowledged. First, the dataset used was derived from unstructured electronic medical records, which required extensive preprocessing and variable extraction. Although the data underwent rigorous cleaning, standardization, and transformation processes—including the removal of features with more than 80% missing values—the quality and completeness of the original records inherently limited the availability of certain clinical variables. This constraint may have affected the model’s ability to capture additional prognostic factors that could enhance recurrence prediction.
The study was based on retrospective data collected from a single institution, which may introduce selection bias and limit the generalizability of the findings. Differences in clinical documentation, diagnostic criteria, and treatment protocols across institutions may impact the applicability of the trained models to broader populations. Future research should validate these models using multicentric datasets to ensure robustness across different healthcare settings.
Additionally, while the clustering approach allowed for the differentiation of patient subgroups based on tumor aggressiveness, it was based solely on clinical and histopathological features. The incorporation of molecular profiling data, genetic mutations, and imaging biomarkers could further refine patient stratification and enhance predictive performance. Deep learning techniques, such as convolutional neural networks (CNNs) for histopathological image analysis or transformer-based models for text mining from medical reports, could provide deeper insights into recurrence patterns.
Despite these limitations, our study provides a solid foundation for leveraging AI in recurrence risk prediction and underscores the importance of personalized modeling approaches in breast cancer prognosis. Future research should focus on expanding datasets, integrating additional biomarkers, and exploring more sophisticated ensemble modeling techniques to enhance predictive accuracy and clinical applicability.
Despite the promising results achieved in this study, some limitations must be acknowledged. First, the dataset used was derived from unstructured electronic medical records, which required extensive preprocessing and variable extraction. Although the data underwent rigorous cleaning, standardization, and transformation processes—including the removal of features with more than 80% missing values—the quality and completeness of the original records inherently limited the availability of certain clinical variables. This constraint may have affected the model’s ability to capture additional prognostic factors that could enhance recurrence prediction.
The study was based on retrospective data collected from a single institution, which may introduce selection bias and limit the generalizability of the findings. Differences in clinical documentation, diagnostic criteria, and treatment protocols across institutions may impact the applicability of the trained models to broader populations. Future research should validate these models using multicentric datasets to ensure robustness across different healthcare settings.
Additionally, while the clustering approach allowed for the differentiation of patient subgroups based on tumor aggressiveness, it was based solely on clinical and histopathological features. The incorporation of molecular profiling data, genetic mutations, and imaging biomarkers could further refine patient stratification and enhance predictive performance. Deep learning techniques, such as convolutional neural networks (CNNs) for histopathological image analysis or transformer-based models for text mining from medical reports, could provide deeper insights into recurrence patterns.
Despite these limitations, our study provides a solid foundation for leveraging AI in recurrence risk prediction and underscores the importance of personalized modeling approaches in breast cancer prognosis. Future research should focus on expanding datasets, integrating additional biomarkers, and exploring more sophisticated ensemble modeling techniques to enhance predictive accuracy and clinical applicability.
Conclusion
Conclusion
This study highlights the potential of machine learning and AI-driven models in predicting breast cancer recurrence by leveraging clinical and histopathological data. By employing a structured data processing pipeline, we successfully extracted and standardized key prognostic variables from unstructured medical records, ensuring high-quality input for predictive modeling. The clustering approach identified two distinct patient subgroups, reflecting differences in tumor aggressiveness and recurrence risk, enabling the development of tailored predictive models.
The modeling phase demonstrated that QDA was the most effective model for patients with less aggressive tumors (Group 1), achieving an outstanding recall of 0.9872, ensuring nearly all recurrence cases were identified. Conversely, for patients with more aggressive tumors (Group 2), RF provided a more balanced prediction, optimizing sensitivity and precision. These findings reinforce the importance of stratified modeling strategies in personalized oncology.
Beyond recurrence prediction, our disease-free survival analysis further validated the predictive models, confirming that patients in the more aggressive group (Cluster 1) had lower survival probabilities over time, aligning with established clinical knowledge. The integration of survival modeling with machine learning approaches enhances risk stratification and supports data-driven decision-making in clinical practice.
Despite the study’s limitations, including the retrospective nature of the dataset, class imbalance, and the exclusion of certain missing variables, the results align with existing literature, demonstrating that AI-based models can achieve predictive performance comparable to traditional clinical methods, with the added benefit of improved automation and scalability. Future research should prioritize the incorporation of multi-institutional datasets, the integration of molecular and genetic biomarkers, and the exploration of deep learning methodologies to further improve predictive accuracy.
In conclusion, this study underscores the value of AI-driven models in breast cancer recurrence prediction, emphasizing the need for personalized risk assessment strategies. By optimizing early detection and patient monitoring, these models have the potential to improve long-term outcomes and guide more effective treatment planning, ultimately advancing precision oncology.
This study highlights the potential of machine learning and AI-driven models in predicting breast cancer recurrence by leveraging clinical and histopathological data. By employing a structured data processing pipeline, we successfully extracted and standardized key prognostic variables from unstructured medical records, ensuring high-quality input for predictive modeling. The clustering approach identified two distinct patient subgroups, reflecting differences in tumor aggressiveness and recurrence risk, enabling the development of tailored predictive models.
The modeling phase demonstrated that QDA was the most effective model for patients with less aggressive tumors (Group 1), achieving an outstanding recall of 0.9872, ensuring nearly all recurrence cases were identified. Conversely, for patients with more aggressive tumors (Group 2), RF provided a more balanced prediction, optimizing sensitivity and precision. These findings reinforce the importance of stratified modeling strategies in personalized oncology.
Beyond recurrence prediction, our disease-free survival analysis further validated the predictive models, confirming that patients in the more aggressive group (Cluster 1) had lower survival probabilities over time, aligning with established clinical knowledge. The integration of survival modeling with machine learning approaches enhances risk stratification and supports data-driven decision-making in clinical practice.
Despite the study’s limitations, including the retrospective nature of the dataset, class imbalance, and the exclusion of certain missing variables, the results align with existing literature, demonstrating that AI-based models can achieve predictive performance comparable to traditional clinical methods, with the added benefit of improved automation and scalability. Future research should prioritize the incorporation of multi-institutional datasets, the integration of molecular and genetic biomarkers, and the exploration of deep learning methodologies to further improve predictive accuracy.
In conclusion, this study underscores the value of AI-driven models in breast cancer recurrence prediction, emphasizing the need for personalized risk assessment strategies. By optimizing early detection and patient monitoring, these models have the potential to improve long-term outcomes and guide more effective treatment planning, ultimately advancing precision oncology.
출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.