Integrative transcriptomic and machine learning framework reveals candidate genes and potential mechanisms of aflatoxin B1 exposure in breast cancer.

Wang W; Liu M; Li X

doi:10.1038/s41598-026-39844-2

← 뒤로

Integrative transcriptomic and machine learning framework reveals candidate genes and potential mechanisms of aflatoxin B1 exposure in breast cancer.

1/5 보강

Scientific reports 📖 저널 OA 95.5% 2026 Vol.16(1)

Wang W, Liu M, Li X

📖 무료 전문 🟢 PMC 전문 PMC12982753

PubMed ↗ DOI ↗ BibTeX ↓ RIS ↓

📝 환자 설명용 한 줄

Aflatoxin B1 (AFB1), a known mycotoxin and environmental hazard, has been linked to breast cancer, yet the exact biological pathways remain poorly characterized.

이 논문을 인용하기

↓ .bib ↓ .ris

APA Wang W, Liu M, Li X (2026). Integrative transcriptomic and machine learning framework reveals candidate genes and potential mechanisms of aflatoxin B1 exposure in breast cancer.. Scientific reports, 16(1). https://doi.org/10.1038/s41598-026-39844-2

MLA Wang W, et al.. "Integrative transcriptomic and machine learning framework reveals candidate genes and potential mechanisms of aflatoxin B1 exposure in breast cancer.." Scientific reports, vol. 16, no. 1, 2026.

PMID 41688730 ↗

DOI 10.1038/s41598-026-39844-2

Abstract

Aflatoxin B1 (AFB1), a known mycotoxin and environmental hazard, has been linked to breast cancer, yet the exact biological pathways remain poorly characterized. We performed a comprehensive multi-omics assessment to investigate how AFB1 may influence breast tumor biology. This encompassed transcriptomic analysis, co-expression network modeling (WGCNA), immune landscape profiling, transcription factor regulatory mapping, and spatial plus single-cell transcriptomics. Predictive biomarkers were determined through a machine learning pipeline. Twenty-two genes were identified at the intersection of AFB1-predicted targets and disease-associated expression modules. A refined panel of seven biomarkers (EGFR, MIF, MET, PPARG, MME, NQO2, NR3C2) was established through model optimization. A composite classifier using glmBoost and StepGLM achieved high discriminative accuracy (area under the curve = 0.996). SHAP interpretability indicated PPARG may act protectively, while MIF showed risk-promoting characteristics. Expression heterogeneity was observed across cell populations and spatial regions. Our integrated analytical framework offers new insights into the oncogenic potential of AFB1 in breast cancer. The identified gene set may serve as both mechanistic mediators and diagnostic markers, underscoring the value of multi-omics and machine learning approaches in environmental carcinogenesis research.

🏷️ 키워드 / MeSH

같은 제1저자의 인용 많은 논문 (5)

Flap Failure and Salvage in Head and Neck Reconstruction.
Seminars in plastic surgery 2020 cited 1
Integrative Analysis Combining Machine Learning and Functional Experiments Uncovers ISG15 As a Key Determinant of Cisplatin Resistance in Gastric Cancer.
Anticancer research 2026
A multi-branch ensemble learning framework for detection of non-small cell lung cancer via T-cell receptor sequencing.
BMC cancer 2026
Regenerative strategies for post-prostatectomy incontinence: stem cells, exosomes, and the path to clinical resolution.
American journal of clinical and experimental urology 2026
Management of pleural relapse after breast cancer resection in a middle-aged man: a case report.
Frontiers in oncology 2026

📖 전문 본문 읽기 PMC JATS · ~80 KB · 영문

Introduction

Introduction
Aflatoxin B1 (AFB1), a secondary fungal metabolite produced by Aspergillus spp., mainly contaminates major crops such as corn and peanut in a warm and humid climate1, and is listed as a class 1 carcinogen by the international agency for research on cancer (IARC). AFB1 has good stability, and it is difficult to remove AFB1 from food by conventional heating and disinfection methods2. Moreover, AFB1 is also closely related to a variety of toxicities, including growth disorders, malnutrition and immune regulation1. The main target organ of AFB1 is the liver, which is one of the important promoting factors of liver cancer3. AFB1 can be metabolized in the liver under the action of microsomal mixed function oxidase (MFO) enzymes belonging to the Cytochrome P450 (CYP450) family, thus forming the related carcinogen afb1-exo-8,9-epoxide (AFBO), thereby promoting carcinogenesis2,4,5. And AFB1 can also induce Reactive Oxygen Species (ROS) production and oxidative stress, DNA damage, induce the generation of immune microenvironment conducive to tumorigenesis and other ways to promote the occurrence and development of cancer6–8. AFB1 can not only promote the occurrence of liver cancer, but also be a risk factor for lung cancer, gastrointestinal cancer, etc2.
Breast cancer is one of the most prevalent malignancies among women, with its global incidence steadily rising. In 2020 alone, approximately 2.3 million new cases were reported9. Although the incidence of breast cancer is relatively lower in developing countries compared to developed nations, the associated mortality rate is significantly higher9. The onset and progression of breast cancer are influenced by a complex interplay of factors, including genetic predisposition, environmental exposures, and hormonal regulation10. Recent studies have identified AFB1 as a potential risk contributor to breast cancer11,12. AFB1 has been shown to modulate breast cancer development by regulating miRNAs that affect cell proliferation, apoptosis, and invasiveness2,13,14. In addition, AFB1 can alter the expression of estrogen receptors and exert cytotoxic effects via the MAPK signaling pathway, thereby playing a key role in tumor initiation and progression15. Furthermore, AFB1 may induce breast cell death through multiple mechanisms, including oxidative stress, mitochondrial dysfunction, and immune-inflammatory signaling crosstalk16.
Recent studies have employed machine learning approaches to elucidate environmental carcinogen-related oncogenesis, such as AFB1-induced hepatocellular carcinoma17. Building on this framework, our study expands the investigation to breast cancer, integrating multi-omics data, single-cell transcriptomics, and spatial transcriptomics to uncover AFB1-associated molecular mechanisms specific to the breast tissue microenvironment. In addition, we constructed transcription factor regulatory networks and quantitatively assessed the tumor immune microenvironment using immune infiltration scoring models, to reveal their potential roles in breast cancer (BRCA) progression.

Materials and methods

Materials and methods

Acquisition of AFB1 structure and computational target profiling
Data on AFB1 were compiled from several public repositories. Its structural fingerprint was obtained from PubChem (https://pubchem.ncbi.nlm.nih.gov/), and the canonical SMILES string was used as the input descriptor for further analysis. Three independent platforms were applied for target identification. ChEMBL (https://www.ebi.ac.uk/chembl/) provided experimentally curated molecular interaction data. SwissTargetPrediction (https://www.swisstargetprediction.ch/) generated predictions by comparing the compound with structurally similar ligands, while PharmMapper (http://lilab-ecust.cn/pharmmapper/) screened potential binding proteins through a pharmacophore-based approach. Only human proteins were considered in the final dataset to restrict predictions to a relevant biological context.

Acquisition and preprocessing of GEO datasets
Five breast cancer transcriptome datasets (GSE15852, GSE10810, GSE109169, GSE29431, and GSE45827) were downloaded from the Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/). Among them, GSE15852, GSE10810, and GSE109169 were assigned as the training cohort, while GSE29431 and GSE45827 were used for validation. GSE15852: 43 healthy samples (control group) vs. 43 breast cancer patients (disease group), GSE10810: 27 healthy samples (control group) vs. 31 breast cancer patients (disease group), GSE109169: 25 healthy samples (control group) vs. 25 breast cancer patients (disease group), GSE29431: 12 healthy samples (control group) vs. 54 breast cancer patients (disease group), GSE45827: 11 healthy samples (control group) vs. 130 breast cancer patients (disease group). For preprocessing, expression matrices were aligned by the intersection of common genes across datasets, and redundant probes were averaged. The merged data were then corrected for batch effects using the ComBat algorithm implemented in the sva (v 3.56.0) R package18,19. Notably, batch correction was applied solely to the training dataset, and the validation datasets remained unaffected by this process.Principal component analysis (PCA) was performed before and after correction to evaluate the effectiveness of batch adjustment. The resulting normalized matrices, batch information, and PCA scores were exported for subsequent analyses. Sample groups were assigned based on the sample names, with samples ending in ‘_con’ designated as control and those ending in ‘_tra’ designated as treatment. This approach allowed for automated identification of groups for differential expression analysis. After preprocessing, we performed propensity score matching to address the potential class imbalance in the validation datasets (GSE29431 and GSE45827). Propensity scores were calculated using logistic regression with gene expression features as predictors, and the matching was performed using the nearest-neighbor method. However, after matching, the number of samples in the validation datasets decreased significantly to approximately 20 samples per dataset (including both healthy and disease groups). This reduction in sample size may limit the statistical power of the model evaluation and introduce the risk of overfitting. Therefore, we decided to use the original, pre-matching data for the final model evaluation, rather than relying on the matched data with reduced sample sizes.

Differential gene expression analysis
Differential expression analysis was conducted in R using the limma package20 (v 3.64.3), which applies linear modeling combined with empirical Bayes moderation to improve variance estimation. This analysis was performed solely on the training dataset. Additionally, differential expression analysis was also performed on the training dataset before batch correction to assess the impact of batch effects on gene expression. Genes were considered differentially expressed when meeting both criteria: adjusted P value (FDR) < 0.05 and |log2 fold change|> 0.585 (equivalent to a 1.5-fold difference). Data visualization, including volcano plots and expression distribution graphics, was performed with ggplot221 (v 4.0.0).

Construction of weighted gene co-expression networks
Normalized expression profiles obtained after batch correction were subjected to weighted gene co-expression network analysis using the WGCNA (v 1.73) R package22. This analysis was performed solely on the training dataset. Prior to network construction, genes with low variance (standard deviation ≤ 0.5) were excluded to reduce noise and focus on genes with higher variability, which are more likely to contribute to meaningful biological patterns. Sample quality was assessed through hierarchical clustering, and outlier samples were removed. A similarity matrix was generated based on Pearson correlations among all gene pairs. To approximate a scale-free topology, the soft-thresholding power (β) was optimized using the pickSoftThreshold function, with the optimal value chosen based on the scale-free topology criterion, where R2 ≥ 0.9. The adjacency matrix was subsequently transformed into a topological overlap matrix (TOM), which was used to compute gene–gene dissimilarity. Genes were hierarchically clustered, and modules were identified via dynamic tree cutting using the cutTreeDynamic function, with a minimum module size of 50 and a cut height of 0.25. This threshold was chosen to balance module specificity and the ability to capture biologically meaningful gene sets. Modules with highly correlated eigengenes were merged. Module eigengenes were then correlated with clinical traits (disease vs. control groups), and the significance of module–trait associations was evaluated using Pearson correlation and Student asymptotic P values. A P value threshold of < 0.05 was considered statistically significant. Gene significance (GS) and module membership (MM) metrics were computed to identify hub genes within biologically relevant modules, with hub genes identified based on high GS and MM values, indicating genes most strongly associated with clinical traits.

Identification of candidate genes via multi-set intersection
To refine potential biomarkers, an integrative strategy was employed. Differentially expressed genes (DEGs) obtained from limma-based analysis were intersected with genes belonging to the most significant WGCNA modules, yielding co-expression–associated DEGs. This intermediate gene set was subsequently overlapped with the predicted molecular targets of aflatoxin B1 (AFB1) derived from three complementary databases: ChEMBL, SwissTargetPrediction, and PharmMapper. For PharmMapper, only predictions with a z-score greater than 0.6 were retained to ensure the reliability of the predicted targets. The results from ChEMBL and SwissTargetPrediction were included without further filtering. All predicted targets were restricted to human proteins to ensure biological relevance to human diseases. The final target list was further refined by merging the results from the three databases and eliminating redundant entries, resulting in a non-redundant set of potential targets. The resulting intersection represented candidate genes simultaneously supported by transcriptomic evidence, network topology, and compound–target prediction.

Integrative analysis of PPI networks and functional enrichment
Candidate genes were imported into the STRING database (https://string-db.org/) to construct a protein–protein interaction (PPI) network, with the interaction confidence score set to ≥ 0.4. The resulting network was visualized using Cytoscape software. No further advanced network analysis, such as centrality metrics, cluster detection, or hub identification, was performed in this study. The primary focus was on visualizing the interactions between the candidate genes. To explore the biological significance of candidate genes, Gene Ontology (GO) annotation and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway23,24 enrichment analyses were conducted. Enrichment was performed using the clusterProfiler R package, with statistical significance defined as adjusted P < 0.05. Results were categorized into three GO domains (biological process, cellular component, molecular function), while KEGG terms were grouped by functional pathways. Enrichment outcomes were visualized through bar plots and bubble plots to highlight the most significant biological processes and signaling pathways.

Construction and evaluation of machine learning models
To develop a robust model for distinguishing breast cancer from non-breast cancer, machine learning algorithms were employed in a two-stage feature selection and classification process. In the first stage, candidate predictors were identified using a combination of penalized regression approaches, including Lasso, Ridge, and Elastic Net with varying alpha parameters, as well as stepwise generalized linear models (Stepglm) (forward, backward, and both). ("Stepglm[both]" refers to the combination of both forward and backward stepwise generalized linear model approaches, where variables are added or removed iteratively based on the Akaike Information Criterion (AIC) to select the most relevant features). These methods were applied to the training set for feature selection, ensuring that the validation set remained independent of this process. For Elastic Net regularization, the alpha parameter was optimized using 10-fold cross-validation, selecting an alpha value of 0.4, which balances the L1 and L2 penalties to minimize the cross-validation error. Stepwise generalized linear models were employed to iteratively add or remove predictors based on the AIC, selecting the most relevant features from the training set. Additionally, boosting-based variable screening was performed to evaluate feature importance and refine the feature set. In the second stage, the selected features were used to train various classifiers, including support vector machines (SVM), random forest (RF), gradient boosting machine (GBM), extreme gradient boosting (XGBoost), linear discriminant analysis (LDA), Naïve Bayes, and logistic regression. Hybrid models were created by combining feature selection approaches with classifiers, such as Lasso + SVM, glmBoost + RF, and StepGLM + XGBoost, as specified in the reference method list. This two-stage strategy enabled the selection of the most predictive features and the combination of models to improve performance. To evaluate model performance, 127 distinct machine learning model combinations were assessed using 10-fold cross-validation and independent test sets. The training set was split into 10 folds, where 90% of the data were used for training and the remaining 10% for validation. This cross-validation procedure was repeated 10 times, with each fold serving as the validation set once, ensuring the results were reproducible with a random seed of 123. The models were evaluated using multiple metrics, including the area under the receiver operating characteristic curve (AUC), calibration curves, nomogram analysis, and decision curve analysis (DCA). AUC was used to assess the model’s ability to discriminate between breast cancer and non-breast cancer groups. Calibration curves were plotted to evaluate the agreement between predicted probabilities and actual outcomes, and DCA was used to assess the net benefit of the models at different decision thresholds, further validating their clinical applicability. For interpretability, SHapley Additive exPlanations (SHAP) were used to quantify gene-level contributions to model predictions. SHAP values were calculated to identify the most influential biomarkers, thereby providing biological insights into the model’s predictions. This approach helped explain the predictive features and enhanced the transparency of the models. Prior to model training, data preprocessing was performed, including normalization and noise addition. The training set was standardized to have zero mean and unit variance using the scaleData function, ensuring that the test set was not normalized to avoid data leakage. Additionally, Gaussian noise with a mean of 0 and a standard deviation of 0.01 was added to the training set to simulate real-world variability and improve model robustness. To further confirm the biological stability of the selected biomarkers, a simple logistic regression model was fitted using the final seven biomarkers on the pre-Combat datasets. The data was Z-scaled prior to fitting to ensure the comparability of the coefficients.

Integration of TF databases with gene expression correlation
Potential transcription factor (TF)–gene regulatory interactions were explored by integrating curated resources and correlation analysis. Candidate TFs associated with the selected feature genes were retrieved from two well-established databases: TRRUST (https://www.grnpedia.org/trrust/) and GeneMANIA (https://genemania.org/). To further characterize the regulatory architecture, pairwise correlations between TFs and hub genes were assessed using expression profiles derived from our study cohort. Correlation matrices were generated in R with the GGally and ggplot2 packages. For each TF–gene pair, Pearson’s correlation coefficients were computed, and statistical significance was evaluated. Enhanced scatterplots with fitted regression lines and density distributions were applied to visualize gene–gene relationships. Results were summarized in a customized correlation heatmap, where significance levels and effect sizes were annotated directly on the matrix panels. This integrated approach enabled identification of TFs most strongly associated with the expression patterns of diagnostic candidate genes.

Integration of immune cell infiltration patterns with gene expression data
To explore the immune microenvironment, we performed immune cell infiltration analysis using a modified version of the CIBERSORT algorithm. The LM22 reference signature was applied to estimate the relative fractions of 22 immune cell subsets. Prior to deconvolution, gene expression data were normalized by quantile normalization, and permutation testing (1000 iterations) was conducted to assess the robustness of the results. Only samples with P values < 0.05 were retained for downstream analyses. To visualize the infiltration patterns, multiple approaches were employed, including boxplots to compare immune cell distributions, heatmaps to display infiltration landscapes, and bar plots to summarize immune composition across groups. The statistical significance of immune cell differences between groups was assessed using Wilcoxon rank-sum tests, and only immune cells with P values < 0.05 were considered significantly different between the control and treatment groups. For each immune cell type, the relative fraction change between the control and treatment groups was used as a criterion for significant alteration, in addition to the p-value threshold. Immune cells with large changes in relative fraction (as shown in the boxplots) were identified as significantly altered, and the p-values were labeled in the plots to highlight these differences.

Single-cell and spatial transcriptomic profiling
Single-cell transcriptomic data from the GSE161529 cohort were processed using the Seurat25 framework (v5.3.0). To account for computational constraints, samples were stratified into three breast cancer subtypes (HER2-positive (Human Epidermal Growth Factor Receptor 2), ER-positive (Estrogen Receptor), and triple-negative (TNBC) prior to analysis. Low-quality cells were excluded according to standard quality control thresholds, including the number of detected genes per cell and the percentage of mitochondrial transcripts. Data were subsequently normalized, and highly variable genes were identified. Principal component analysis (PCA) was applied for dimensionality reduction, followed by clustering and visualization using the Uniform Manifold Approximation and Projection (UMAP) algorithm. Marker genes were identified by differential expression testing across clusters. Cell type annotation was conducted through the SingleR algorithm with the Human Primary Cell Atlas as reference. Lineage reconstruction and pseudotime inference were performed using monocle3. For selected candidate genes, multiple visualization approaches—including UMAP feature plots, violin plots, dot plots, and bubble plots—were generated to illustrate their expression profiles across cell types and pseudotemporal trajectories.
Spatially resolved transcriptomic data from GSE203612 were analyzed using Seurat (v5.3.0). Raw data in 10 × Genomics format were reformatted to standard input files, and tissue images were aligned with the expression matrix. Normalization was performed with the SCTransform algorithm, followed by dimensionality reduction through PCA and UMAP, and graph-based clustering. Differentially expressed genes and spatially variable features were identified using log-fold change and adjusted P-value thresholds. Cell type annotation was assigned via SingleR, and spatial distribution of clusters and gene expression was visualized using SpatialFeaturePlot and SpatialDimPlot. In addition, expression patterns of key candidate genes (e.g., EGFR, NR3C2, MIF, PPARG, MET, MME, NQO2) were projected onto tissue coordinates to reveal their spatial heterogeneity.

Gene expression analysis
Gene expression data for the selected gene across 33 cancer types were retrieved from TCGA (The Cancer Genome Atlas, https://portal.gdc.cancer.gov/). Expression levels in normal and tumor tissues were compared using the Wilcoxon rank-sum test, and for cancer types with multiple tumor groups, the Kruskal–Wallis test was applied. Statistical significance was determined at P < 0.05, with Benjamini–Hochberg correction for multiple comparisons. Analysis was conducted using the R 4.5.1 environment.

Results

Results

Prediction of potential targets of AFB1
The molecular structure of AFB1 was obtained from the PubChem database (Fig. 1A). To explore its possible molecular targets, three complementary platforms were applied: ChEMBL, PharmMapper, and SwissTargetPrediction. After merging and eliminating duplicate entries, a total of 170 unique targets were collected (Fig. 1B). dditionally, the specific genes found in the overlapping regions of the database screening (as shown in Fig. 1B) are provided in the supplementary Table S1-S3 for reference.

Identification of differentially expressed genes
Principal component analysis (PCA) revealed strong batch effects across datasets before correction (Fig. 1C), with samples from different batches clustering separately, indicating significant technical variation. These batch effects were effectively removed after batch correction (Combat) (Fig. 1D), resulting in a clear separation between control and treatment groups. Based on the criteria of adjusted P < 0.05 and |log2FC|> 0.585, a number of differentially expressed genes (DEGs) were identified. The volcano plot (Fig. 1E) highlights significantly upregulated (red) and downregulated (blue) genes, with the top 25 labeled. Notably, after batch correction, a number of genes that were not identified as differentially expressed before correction became significant. This indicates that the batch correction process (Combat) revealed previously obscured biological differences. The heatmap (Fig. 1F) shows distinct gene expression patterns between control and treatment groups, confirming clear transcriptional differences. In the supplementary materials (Table S4-S6), we provide a detailed DEG analysis both before and after the Combat step, illustrating how batch effects influenced the original results. The comparison between pre- and post-correction results further emphasizes the effectiveness of batch correction in revealing true biological effects.

The turquoise co-expression module exhibits strong association with disease phenotype
To explore gene expression modules associated with disease phenotypes, we performed WGCNA. As shown in Fig. 2A, a soft-thresholding power of 4 was chosen to approximate scale-free topology (signed R2 > 0.85) while preserving sufficient mean connectivity. Sample clustering confirmed the absence of outliers across all subjects (Fig. 2B). Following dynamic tree cutting and eigengene-based merging, a total of four merged modules were obtained (Fig. 2C), including the blue, brown, grey, and turquoise modules. The turquoise module exhibited the most significant correlation with disease status (cor = − 0.84, P = 1.9 × 10−52; Fig. 2D), and was thus selected for further analysis. The correlation between gene significance (GS) and module membership (MM) for the turquoise module was remarkably strong (cor = − 0.99, P < 1e−10; Fig. 2F), suggesting this module harbors key disease-related genes. Gene expression profiles in the turquoise module demonstrated clear differential patterns between disease and normal groups (Fig. 2E). To further evaluate the turquoise module’s network integrity and biological coherence, we examined its topological structure and module placement: Supplementary Fig. S1A reveals the turquoise module to be largely independent from other modules, while Supplementary Fig. S2 highlights its strong internal connectivity as evidenced by the TOM.Additionally, Supplementary Fig. S1B provides the gene counts per module, and Supplementary Fig. S1C shows the dendrogram used to define and merge modules via eigengene clustering, supporting the reliability of the module definition process.

Integrated network and pathway analysis of genes co-involved in AFB1 exposure and breast cancer
To explore potential molecular links between AFB1 exposure and breast cancer, we first intersected differentially expressed genes with WGCNA-derived turquoise module genes, yielding 989 overlapping candidates (Fig. 3A). A further comparison between known AFB1-related targets and BRCA-associated genes identified 22 shared genes (Fig. 3B), suggesting a small but potentially critical set of intersection targets. Subsequent PPI network analysis (Fig. 3C) revealed several hub genes with high connectivity, including CDK1, MMP9, EGFR, and AURKA, which may act as central regulators in AFB1-mediated breast carcinogenesis. KEGG pathway23,24 enrichment of these 22 genes (Fig. 3D) demonstrated significant involvement in cancer-related and signaling pathways, such as MAPK signaling, EGFR resistance, and PI3K-Akt signaling. GO enrichment analysis (Fig. 3E) further highlighted biological processes related to cell proliferation, oxidative stress response, and localization in membrane-associated complexes. These results collectively suggest that AFB1 may influence breast cancer development through specific oncogenic signaling networks.

Integrative machine learning approach identifies a robust seven-gene signature with high predictive accuracy for clinical risk assessment
Fig. 4A shows the AUC values for 127 models composed of the machine learning methods utilized. It is important to note that this classifier is designed to distinguish between breast cancer and non-breast cancer. Fig. S3 displays the top ten genes among these 127 combined models. We selected the glmBoost + Stepglm[both] model for subsequent analysis, which includes EGFR, NR3C2, MIF, PPARG, MET, MME, and NQO2. This model achieved an AUC of 0.985, which was very close to that of other more complex models but with fewer genes. The decision to select the seven-gene model was based on its ability to strike a balance between model performance and simplicity. The model with seven genes performed similarly to more complex models in terms of AUC but was less prone to overfitting, ensuring stability and generalization to new data. Compared to the predictive ability of single-gene ROC curves (Fig. 4B), the AUC value of the combined model (95% CI 0.992, 0.996AUC = 0.996) is higher (Fig. 4C). The calibration curve indicates that the predicted probabilities of this model are highly consistent with actual observations, with the curve approaching the ideal line (Fig. 4E). Decision curve analysis (DCA) shows that this model outperforms the “treat-all” and “treat-none” strategies across a wide range of thresholds (Fig. 4D). Further cost–benefit analysis indicates that the model maintains a high sensitivity while controlling the false positive rate (Fig. 4F). The volcano plot further verifies that the differences in the seven key genes are statistically significant between groups (Fig. 4G). Based on these results, we constructed a nomogram for clinical application (Fig. 4H). Each gene’s score is derived from its regression coefficient in a logistic regression model, which reflects the gene’s contribution to disease risk prediction. The scores are based on the gene expression levels, and the cumulative score corresponds to an estimated probability of disease.

Model performance evaluation and SHAP-based interpretation of gene contributions
As shown in Fig. 5A, we also plotted ROC curves based on five mainstream machine learning methods, including RF, SVM, XGB, GBM, and KNN. The AUC values of all models range from 0.948 to 0.979. To further reveal the basis of model predictions, we introduced SHAP values for interpretive analysis. The cumulative SHAP contribution curve displayed in Fig. 5B shows that the top five features contribute over 80% of the model’s explanatory power, indicating the critical role of certain genes in the model. According to the ranking of average absolute SHAP values (Fig. 5C), PPARG, MIF, and MME are the most influential genes in the model. The single gene dependence plot (Fig. 5D) shows that the high expression of PPARG, MME, EGFR, and NQO2 presents a negative SHAP contribution, suggesting they may act as protective factors; whereas increased expression of MIF shows a positive impact, possibly participating in the pathological promotion process. The intergroup SHAP importance comparison between the control group and treatment group (Fig. 5E) further confirms that PPARG is the most critical feature in both subgroups. The global SHAP beeswarm plot (Fig. 5F) intuitively displays the breadth and direction of influence of each gene across all samples, while the individual SHAP force plot (Fig. 5G) further validates that high expression of PPARG and MIF significantly drives the model output towards negative predictions (non-disease categories), consistent with their SHAP weights. We assessed the stability and biological relevance of the influential biomarkers by fitting a simple logistic regression model using the final seven biomarkers (EGFR, NR3C2, MIF, PPARG, MET, MME, NQO2) across multiple datasets. As shown in Supplementary Table S7, the coefficients for these biomarkers varied across datasets, reflecting their potential role in different cohorts. For instance, EGFR exhibited a strong negative coefficient in GSE10810 (− 9.14071), suggesting its significant contribution to breast cancer prediction in this dataset. However, in GSE109169 and GSE15852, the effect of EGFR was less pronounced, with coefficients of 4.08654 and − 1.99625, respectively, indicating that its biological role may differ between datasets. Similarly, NR3C2 showed large negative coefficients in GSE10810 (− 26.3588) and GSE109169 (− 16.1603), but had a much smaller effect in GSE15852 (− 0.88841), pointing to dataset-dependent variations in its role. In contrast, MIF consistently showed a positive contribution across all datasets, though the magnitude of the effect varied, with coefficients of 0.10759 in GSE10810, 3.41252 in GSE109169, and 1.74748 in GSE15852. Despite these variations, PPARG consistently showed a negative effect across all datasets, though its influence diminished in GSE15852, with coefficients of − 12.2157, − 10.9871, and − 2.34601 for GSE10810, GSE109169, and GSE15852, respectively. These findings highlight the importance of these biomarkers in predicting breast cancer, but also underscore the dataset-specific variations that should be considered when interpreting their biological significance. Furthermore, while the results from the pre-Combat data suggested some dataset-specific variations, we observed consistent trends in the SHAP analysis, where PPARG, MIF, and other biomarkers showed relatively stable contributions across datasets, further supporting their role in breast cancer prediction.

Construction of TF-target gene regulatory networks based on co-expression and functional annotation.
Subsequently, we conducted TF (transcription factor) related analysis for the genes in the model. As shown in Table 1, transcription factors such as SP1, YBX1, HIF1A, EGR1, and TP53 all have significant relevance (both P value and Q-value are less than 0.05). In Fig. 6A, we constructed a correlation matrix between TFs and model genes. The results show that there is a significant correlation between model genes and transcription factors. For example, EGR1 is positively correlated with MME (r = 0.3, P < 0.01) and PPARG (r = 0.37, P < 0.001), and negatively correlated with MIF (r = − 0.39, P < 0.001). This indicates that there is a complex regulatory network between model genes and transcription factors, jointly involved in the occurrence and development of AFB1 and breast cancer. In Fig. 6B, key TFs such as RELA, NFKB1, TP53, HIF1A, EGR1, and SP1 not only have interactions with multiple model genes but are also involved in multiple functional pathways such as "response to oxidative stress," "regulation of pri-miRNA transcription by RNA polymerase II," "positive regulation of defense response," and "regulation of hormone biosynthetic process." Furthermore, we employed Bootstrap resampling to validate the statistical significance of the observed correlations, ensuring that the findings are not due to chance (Table S8).

Alterations in macrophage infiltration and its association with model gene expression
In the training and validation sets, the infiltration of M1 macrophages in the tumor group is significantly increased compared to the normal group (Fig. S4, Fig. S5, Fig. S6A, Fig. S7A). Moreover, the infiltration level of resting mast cells decreased in the training set, while the infiltration level of M0 macrophages is increased (Fig. S4, Fig. S6A,). In the model genes, MME and NR3C2 both show significant and consistent negative correlation with M1 macrophage infiltration in both the training and validation cohorts (Fig. S6C, Fig. S7C, Fig. S7E).

Single-cell pseudotime and expression landscape of model genes in TNBC, HER2+, and ER+ breast cancer
To further investigate the expression patterns and dynamic changes of model genes in different cell types at the single-cell level, we performed pseudotime trajectory and cell annotation analyses based on breast cancer subtypes (ER+, HER2+, TNBC). Fig. 7 presents the visualization analysis results for TNBC. After clustering and annotating the single-cell data from TNBC (Fig. 7A), we further utilized the SingleR method to annotate cell types for each cluster and displayed the main cell types through UMAP plots (Fig. 7F). The results showed that the single-cell samples mainly consisted of various cell types, including epithelial cells, macrophages, fibroblasts, and endothelial cells. MIF was significantly elevated in macrophage and epithelial cell clusters, while genes such as MME and NR3C2 showed low expression in most cells (Fig. 7E, Fig. S8). Most model genes (e.g., MME, NR3C2, PPARG) exhibited higher expression at early pseudotime stages and then progressively decreased (Fig. 7B–D, Fig. S9A-E). Figure 7G,H indicated that cells with high expression of genes like MME and NR3C2 were primarily concentrated in the early trajectory branches, while low-expressing cells tended to expand along the direction of tumor progression (Fig. S9F-G). Additionally, similar results were observed in ER+ and HER2+ breast cancers (Fig. S10-13).

Spatial transcriptomics analysis reveals different gene expression patterns and cellular heterogeneity in AFB1-related breast cancer
To study the impact of AFB1 on the spatial heterogeneity of gene expression in breast cancer tissues, we conducted spatial transcriptomic sequencing on tumor sections and performed a visual analysis of gene expression distribution within the tissue architecture. Unsupervised clustering identified 9 spatial clusters with characteristic transcriptional expression patterns, which exhibited clear differences in tissue structure on both UMAP dimensionality reduction and spatial localization maps (Fig. 8A, B, D). Automated cell type annotation using SingleR revealed a rich variety of cell types in the tumor tissue, including epithelial cells, fibroblasts, endothelial cells, macrophages, and multiple lymphocyte lineages (T cells, B cells, NK cells), which were spatially organized (Fig. 8C, E). We performed expression analysis of model genes (EGFR, MET, MIF, MME, NQO2, NR3C2, PPARG) at the spatial level (Fig. S14A-G). The results showed that EGFR, MET, and MIF exhibited significant local high expression in tumor enrichment areas and immune infiltration regions (Fig. S14A-C), while NR3C2 had extremely low expression (Fig. S14F). Moreover, genes with strong spatial variability, such as CXCL10, IGHA2, and FDCSP, displayed highly localized expression patterns (Fig. S15), while the expression of marker genes like COL1A1 and POSTN indicated potentially active fibroblast activity and extracellular matrix remodeling in specific regions (Fig. S16).

Gene expression profiling in breast cancer and across multiple cancer types: a pan-cancer analysis
We analyzed the expression profiles of seven model genes (EGFR, MET, MIF, MME, NQO2, NR3C2, and PPARG) across various cancer types using TCGA data. The expression levels of these genes were compared between normal and tumor tissues across 33 cancer types (Fig. S17-S18). MIF was upregulated in BRCA and other cancer types, EGFR and MET showed variable expression across cancer types, with EGFR upregulated in cancers like HNSC (head and neck squamous cell carcinoma), and MET showing downregulation in BRCA, but upregulated in cancers like LUAD (lung adenocarcinoma) and PAAD (pancreatic adenocarcinoma). MME, NR3C2, and PPARG exhibited significant downregulation in BRCA.

Discussion

Discussion
AFB1 may promote breast cancer development through multiple interrelated molecular mechanisms. One such pathway involves oxidative stress (OS), a key driver of DNA damage and protein dysregulation in cancer cells26. The p53 signaling axis plays a central role in cellular responses to oxidative damage and genomic instability27, AFB1 has been shown to induce both oxidative stress and DNA damage via activation of the p53 pathway, which may contribute to tumor initiation28. In addition, AFB1 enhances tumor cell survival and proliferation by modulating immune regulatory mechanisms and activating oncogenic cascades such as NF-κB and PI3K29. Another critical oncogenic route involves the PKCα → ERK → PRMT5 axis, through which AFB1 upregulates cancer-associated methyltransferases, thereby facilitating tumor progression30. Immune evasion also represents a hallmark of AFB1-mediated carcinogenesis, with the IL-6/STAT3 pathway acting as a central hub in immune suppression and tumor-promoting inflammation8. Furthermore, AFB1 exerts cytotoxic effects through the p53–MAPK–IL6 axis, disrupting cellular homeostasis and promoting malignant transformation16. Notably, the immunomodulatory microenvironment plays a pivotal role in AFB1-driven tumorigenesis. AFB1 can activate the TLR2/TLR4 → NF-κB signaling cascade, resulting in the upregulation of pro-inflammatory cytokines such as IL-1β, IL-6, IL-8, IFN-γ, and TNF-α, thereby fostering chronic inflammation and an immune milieu conducive to cancer development31. These potential carcinogenic mechanisms are consistent with the pathways identified in our study.
This study demonstrates that AFB1 potentially influences breast cancer progression by modulating gene expression profiles, immune infiltration patterns, transcriptional regulation, and the spatial organization of the transcriptome. By integrating multi-omics data with machine learning models, we identified seven key genes (EGFR, MIF, MET, PPARG, MME, NQO2, NR3C2), which play crucial roles in MAPK signaling, EGFR resistance, and PI3K-Akt signaling. These genes are also significantly involved in biological processes related to cell proliferation, oxidative stress response, and membrane-related complexes. These findings suggest that these genes serve as critical mediators linking AFB1 exposure to breast cancer development and may have potential as diagnostic and prognostic biomarkers. The machine learning model (AUC = 0.985) based on the seven-gene panel demonstrated high predictive accuracy, providing a promising tool for early breast cancer detection. Further immune infiltration analysis revealed that MME and NR3C2 were negatively correlated with M1 macrophage infiltration, implying their potential role in immune regulation within the tumor microenvironment. Single-cell and spatial transcriptomics analyses offered deeper insights, showing that candidate genes exhibit distinct spatial and temporal expression patterns. MIF was found to be highly expressed in macrophage-rich regions, while NR3C2 and MME were predominantly expressed in early pseudotemporal stages, suggesting their involvement in early tumorigenesis. Additionally, a pan-cancer expression analysis of the seven genes (EGFR, MET, MIF, MME, NQO2, NR3C2, and PPARG) across various cancer types, using TCGA data, highlighted diverse expression patterns. MIF showed significant upregulation in multiple cancer types, reinforcing its potential as a therapeutic target. On the other hand, NR3C2 and PPARG exhibited lower expression in certain cancer types, possibly indicating a suppressive role, warranting further investigation in future studies.
Seven model genes exhibit distinct expression trends in breast cancer. MIF is the only gene identified in the model with an increased expression trend, while other genes generally show lower expression trends. Importantly, the model does not rely strictly on whether a gene is upregulated or downregulated. Instead, it uses the overall expression trends of these genes to predict risk. While EGFR and MET are typically recognized as cancer-promoting genes, their downregulation in our dataset could be associated with the heterogeneity of breast cancer subtypes and regulatory feedback in the tumor microenvironment. In our model, the expression trends of these genes, rather than their upregulation or downregulation in isolation, play a critical role in predicting patient outcomes and risk. MIF can regulate the expression of (Aldolase C) ALDOC through the MIF-β536 CATENIN-MYC axis, increasing glucose uptake in cancer cells and inducing metabolic reprogramming of macrophages in the tumor microenvironment, while suppressing T cell activity, leading to immune evasion and promoting the proliferation, migration, and invasion of cancer cells32. Previous studies have widely acknowledged GFR and MET as cancer-promoting genes33,34. EGFR can promote cancer progression through the PI3K/AKT pathway35. MET is an important oncogene, with abnormal activation closely related to poor prognosis in breast cancer, enhanced invasiveness, and drug resistance36. Circulating RNA circNR3C2 derived from NR3C2 also plays a significant role in breast cancer development; it can absorb miR-513a-3p and promote HRD1-mediated degradation of wave proteins through the ubiquitin–proteasome pathway, ultimately inhibiting tumor growth and metastasis37,38. Enhanced expression of MME can inhibit the metastatic ability of triple-negative breast cancer and lead to better Relapse-Free Survival (RFS) and chemotherapy resistance for patients39,40. NQO2 not only participates in the regulation of cell death in oxidative stress responses but also in the metabolism of estrogen carcinogenic metabolites41,42. PPARG levels are positively correlated with the expression of immune-related genes and immune checkpoints, and PPARG may reduce breast cancer (BC) development by modulating the immune microenvironment43.
Several anti-inflammatory and detoxification-related tumor suppressor genes (e.g., MME, NR3C2, NQO2) are significantly downregulated in breast cancer, indicating that AFB1 may contribute to tumor immune microenvironment imbalance through these genes. Specifically, AFB1 could promote an inflammatory microenvironment by downregulating MME and NR3C2, which may affect immune regulation and tumor progression. This suggests a complex interplay of oncogenic and tumor-suppressive signals in breast cancer. Enrichment and transcription factor analyses further reveal that these genes are linked to critical pathways, such as MAPK signaling and defense response regulation, emphasizing that AFB1 may influence these pathways to drive breast cancer development. This panel of genes holds potential as biomarkers for understanding AFB1’s role in breast cancer and could be utilized for early diagnosis and therapeutic intervention strategies.
Given the high predictive performance of our seven-gene signature model (AUC = 0.996) and the development of a nomogram for risk prediction, we define the model as a diagnostic tool for the early detection of breast cancer. This model aims to assist clinicians in identifying breast cancer patients at an early stage, enabling more informed clinical decision-making and personalized treatment planning. To facilitate the clinical implementation of this model, we propose the following steps: Prospective Validation on New Independent Cohorts: To confirm the robustness and generalizability of the model across diverse populations and clinical settings, we will conduct prospective validation on new, independent cohorts of breast cancer patients. Comparison with Existing Clinical Gold Standards: The model will be compared against established clinical gold standards, such as Oncotype DX and MammaPrint, which are widely used in clinical practice for assessing the recurrence risk and guiding treatment decisions. Oncotype DX analyzes 21 genes, while MammaPrint evaluates 70 genes to predict recurrence and inform chemotherapy decisions. By comparing our model with these widely accepted methods, we aim to evaluate how our model could improve diagnostic accuracy and clinical decision-making, particularly in early-stage breast cancer detection. Development of a Clinical Assay: To make the model clinically actionable, we plan to develop an easy-to-use clinical assay, such as a qPCR panel, which can be integrated into routine clinical practice. This would enable rapid, cost-effective testing, enhancing the model’s feasibility and applicability in clinical settings.
As noted by previous studies, many random gene signatures correlate with breast cancer outcome, likely because a large fraction of the transcriptome is influenced by proliferation‑related programs such as those captured by the meta‑PCNA signature. The meta‑PCNA signature consists of genes that show the strongest positive correlation with PCNA, a canonical proliferation marker (e.g., CKS2, NUSAP1, RRM2, AURKA)44. These genes predominantly reflect tumor cell proliferation. To contextualize our findings within this framework, we directly compared our classifier with the meta‑PCNA signature. Whereas the meta‑PCNA signature primarily encodes proliferation‑related signals, our classifier incorporates biomarkers associated with multiple biological processes, including immune regulation, signaling, metabolism, and transcriptional control (e.g., EGFR, MIF, MET, PPARG, MME, NQO2, NR3C2). This reflects a broader representation of tumor biology beyond proliferation alone. Although both approaches capture aspects of breast cancer biology, our analysis indicates that the pathology and prognostic information embedded in our classifier are not limited to proliferation signals. The comparison with meta‑PCNA highlights that our integrated multi‑omics and machine learning strategy captures additional dimensions of tumor heterogeneity and microenvironmental interactions that are not represented in a proliferation‑centric signature.
This study presents several limitations that may impact the clinical translation and biological significance of the 7-gene panel. One of the main challenges was the dominance of a single platform in the final list of AFB1 targets (~ 51%). This bias could affect the reliability of the selected targets, potentially skewing the biological insights derived from the model. To ensure more robust clinical applications, future studies should aim to incorporate a more balanced approach, integrating multiple platforms to enhance the diversity and biological relevance of the target list. Another limitation arises from the small sample sizes in the validation datasets, which were reduced after propensity score matching. The small number of samples (approximately 20 per dataset) introduces the risk of overfitting, compromising both the model’s stability and its generalizability. This, in turn, could impact its clinical reliability, especially in predicting patient outcomes. Moreover, the imbalance between healthy and disease groups in the validation datasets could affect model performance, making it less applicable in clinical settings where group distributions can vary. Despite employing evaluation metrics such as AUC, calibration curves, and decision curve analysis (DCA), which are valuable when class imbalance is not as pronounced, these techniques were applied to the original, pre-matching datasets for more reliable results. This choice was made to ensure a more representative assessment of the model’s real-world performance. However, the inclusion of larger datasets and more advanced data augmentation techniques in future studies will help address these issues and strengthen the model’s applicability across different populations. Additionally, the application of batch correction (Combat) to only the training data introduces potential risks when merging datasets. While we ensured that the validation set remained independent and unaltered, future research should include validations on pre-Combat datasets to mitigate any bias introduced by batch effects. Furthermore, further in vivo and in vitro validations, as well as larger cohort studies, are essential to confirm the biological significance, robustness, and generalizability of the findings. These efforts will help ensure that the 7-gene panel is not only accurate but also biologically meaningful, offering a solid foundation for its clinical use. Moreover, this study did not fully implement certain methods that could enhance classifier stability, such as continuous random sampling to balance the data and iterating performance assessments multiple times. Future work could consider incorporating these methods along with techniques like Bayesian Model Averaging to further improve the classifier’s performance and interpretability. Lastly, the results were derived from the analysis of publicly available data. Due to the limited sample size in the training dataset, further validation on pre-Combat individual datasets is required. The lack of validation on pre-Combat datasets introduces potential risks when merging datasets. To mitigate this, the validation set was kept independent and was not subjected to batch correction, preserving the integrity of external validation. However, this limitation warrants further investigation in future studies. Additional validation through larger cohort analyses and subsequent in vivo and in vitro studies will be crucial to confirm the biological significance, robustness, and generalizability of the findings. These validations will help avoid overestimation of the model’s performance, especially considering the AUC of 0.996, which is based on batch-corrected training data.

Conclusion

Conclusion
Overall, our findings provide multi-dimensional evidence supporting the hypothesis that AFB1 exposure contributes to breast cancer pathogenesis through intricate immunological, and spatial mechanisms. By integrating transcriptomic profiling, co-expression network analysis, immune cell deconvolution, spatial transcriptomics, and machine learning frameworks, we identified seven key genes—EGFR, MIF, MET, PPARG, MME, NQO2, and NR3C2—with both strong predictive performance and biological relevance. These results not only deepen our understanding of environmental carcinogens in breast cancer but also offer novel biomarkers and robust predictive models for individualized risk stratification.

Supplementary Information

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.