Exploring the association between air pollutants and non-small cell lung cancer using network toxicology and machine learning.
1/5 보강
[BACKGROUND] Non-small cell lung cancer (NSCLC) is a leading cause of cancer-related mortality worldwide.
APA
Zhao Q, Zhao Z, Du K (2025). Exploring the association between air pollutants and non-small cell lung cancer using network toxicology and machine learning.. Discover oncology, 16(1), 2327. https://doi.org/10.1007/s12672-025-04143-1
MLA
Zhao Q, et al.. "Exploring the association between air pollutants and non-small cell lung cancer using network toxicology and machine learning.." Discover oncology, vol. 16, no. 1, 2025, pp. 2327.
PMID
41307603 ↗
Abstract 한글 요약
[BACKGROUND] Non-small cell lung cancer (NSCLC) is a leading cause of cancer-related mortality worldwide. The urgent need to understand its risk factors and develop effective treatment strategies drives ongoing research in this field. Among various environmental factors, air pollutants have emerged as potential risk factors. Therefore, in-depth exploration is necessary to elucidate their impact on NSCLC pathogenesis.
[METHODS] This study employs a multifaceted approach combining transcriptomic data analysis, machine learning, and molecular docking simulations to assess the association between air pollutants-carbon monoxide (CO), nitric oxide (NO), nitrogen dioxide (NO₂), sulfur dioxide (SO₂), benzo[a]anthracene (BaA), benzo[a]pyrene (BaP), and 3-methylcholanthrene (3-MC)-and NSCLC. We identified a total of 30 gene targets associated with air pollutants in NSCLC. These findings highlight significant molecular alterations. Pathway enrichment analysis was then performed to identify crucial pathways implicated in tumorigenesis. Particular emphasis was placed on the cell cycle and p53 signaling pathways.
[RESULTS] Using machine learning, seven core genes (CKS1B, GAPDH, TYMS, AURKA, CCNE1, PARP1, and MGLL) were identified as promising diagnostic markers, achieving an area under the curve (AUC) value exceeding 0.95 during validation. Additionally, molecular docking revealed strong binding interactions between these core genes and selected air pollutants, with molecular dynamics simulations confirming the stability of these interactions.
[CONCLUSIONS] Our findings suggest a significant association between air pollutants and the development of NSCLC and propose potential biomarkers for enhanced diagnostic accuracy, alongside potential therapeutic targets. Future research should prioritize the clinical validation of these findings and the investigation of targeted therapies that consider environmental risk factors, thereby enhancing NSCLC management strategies and patient outcomes.
[METHODS] This study employs a multifaceted approach combining transcriptomic data analysis, machine learning, and molecular docking simulations to assess the association between air pollutants-carbon monoxide (CO), nitric oxide (NO), nitrogen dioxide (NO₂), sulfur dioxide (SO₂), benzo[a]anthracene (BaA), benzo[a]pyrene (BaP), and 3-methylcholanthrene (3-MC)-and NSCLC. We identified a total of 30 gene targets associated with air pollutants in NSCLC. These findings highlight significant molecular alterations. Pathway enrichment analysis was then performed to identify crucial pathways implicated in tumorigenesis. Particular emphasis was placed on the cell cycle and p53 signaling pathways.
[RESULTS] Using machine learning, seven core genes (CKS1B, GAPDH, TYMS, AURKA, CCNE1, PARP1, and MGLL) were identified as promising diagnostic markers, achieving an area under the curve (AUC) value exceeding 0.95 during validation. Additionally, molecular docking revealed strong binding interactions between these core genes and selected air pollutants, with molecular dynamics simulations confirming the stability of these interactions.
[CONCLUSIONS] Our findings suggest a significant association between air pollutants and the development of NSCLC and propose potential biomarkers for enhanced diagnostic accuracy, alongside potential therapeutic targets. Future research should prioritize the clinical validation of these findings and the investigation of targeted therapies that consider environmental risk factors, thereby enhancing NSCLC management strategies and patient outcomes.
🏷️ 키워드 / MeSH 📖 같은 키워드 OA만
같은 제1저자의 인용 많은 논문 (5)
- Sintilimab-induced toxic epidermal necrolysis complicated in advanced gastric cancer: a case report and literature review.
- CDKN3 promoted triple-negative breast cancer by inhibiting ferroptosis through the upregulation of HSP90.
- Tumor PD-L1 induces β2m ubiquitylation and degradation for cancer cell immune evasion.
- Circ-0030167/IGF2BP1 Induces Mitophagy-Mediated Ferroptosis via HMOX1 mRNA Stabilization in Pancreatic Cancer.
- Contrast-Enhanced CT Shell Features and Deep Learning for Predicting Early Transarterial Chemoembolization Refractoriness in Hepatocellular Carcinoma.
📖 전문 본문 읽기 PMC JATS · ~56 KB · 영문
Introduction
Introduction
Non-small cell lung cancer (NSCLC) is a significant health concern globally, ranking as one of the primary causes of cancer-related mortality [1–3]. NSCLC imposes a considerable economic burden arising from treatment costs and the loss of productivity associated with patient morbidity and mortality. Current diagnostic and therapeutic strategies, including surgical intervention, chemotherapy, and targeted therapies, have yielded suboptimal outcomes [4]. These outcomes are often hampered by late-stage diagnoses and the emergence of drug resistance. These limitations underscore the pressing need for innovative approaches to improve NSCLC management and patient outcomes.
Previous research has established a connection between environmental factors, particularly air pollution, and the incidence and progression of various cancers, with a focus on lung cancer [5–7]. Studies have indicated that exposure to air pollutants can increase the risk of developing NSCLC and may play a role in its advancement [8–14]. Therefore, integrating diverse analytical techniques is essential to elucidate the molecular mechanisms linking air pollution to NSCLC, which represents a critical research gap in the field.
This study employs an integrative approach that combines transcriptomic data analysis, machine learning, and molecular docking to investigate the effect of air pollutants on NSCLC. Transcriptomic data analysis is used to identify differential gene expression patterns associated with air pollution exposure. Machine learning techniques help to pinpoint potential biomarkers from large datasets. Molecular docking is employed to explore potential molecular interactions between identified biomarkers and therapeutic compounds. By leveraging these methodologies, we aim to systematically investigate the complex relationship between environmental exposures and cancer development.
The primary objective of this research is to identify core genes whose expression is influenced by air pollutants, which may serve as diagnostic markers for NSCLC. The identification of such markers could provide valuable insights into the underlying mechanisms that link environmental exposure to air pollutants with cancer progression, and may pave the way for the development of novel therapeutic strategies. Additionally, this study aims to contribute evidence emphasizing the need for public health interventions aimed at reducing air pollution and its associated respiratory risks.
In summary, the interaction between air pollution and NSCLC represents a critical area of research that necessitates further exploration. By employing an integrative approach that combines data analysis techniques and molecular modeling, this study aims to elucidate the specific mechanisms underlying the relationship between environmental factors and cancer pathogenesis. Ultimately, it seeks to contribute to improved diagnostic and therapeutic strategies for NSCLC. This research holds the potential to address significant gaps in our understanding and management of this prevalent and deadly disease, thereby fostering advancements in public health and clinical practice.
Non-small cell lung cancer (NSCLC) is a significant health concern globally, ranking as one of the primary causes of cancer-related mortality [1–3]. NSCLC imposes a considerable economic burden arising from treatment costs and the loss of productivity associated with patient morbidity and mortality. Current diagnostic and therapeutic strategies, including surgical intervention, chemotherapy, and targeted therapies, have yielded suboptimal outcomes [4]. These outcomes are often hampered by late-stage diagnoses and the emergence of drug resistance. These limitations underscore the pressing need for innovative approaches to improve NSCLC management and patient outcomes.
Previous research has established a connection between environmental factors, particularly air pollution, and the incidence and progression of various cancers, with a focus on lung cancer [5–7]. Studies have indicated that exposure to air pollutants can increase the risk of developing NSCLC and may play a role in its advancement [8–14]. Therefore, integrating diverse analytical techniques is essential to elucidate the molecular mechanisms linking air pollution to NSCLC, which represents a critical research gap in the field.
This study employs an integrative approach that combines transcriptomic data analysis, machine learning, and molecular docking to investigate the effect of air pollutants on NSCLC. Transcriptomic data analysis is used to identify differential gene expression patterns associated with air pollution exposure. Machine learning techniques help to pinpoint potential biomarkers from large datasets. Molecular docking is employed to explore potential molecular interactions between identified biomarkers and therapeutic compounds. By leveraging these methodologies, we aim to systematically investigate the complex relationship between environmental exposures and cancer development.
The primary objective of this research is to identify core genes whose expression is influenced by air pollutants, which may serve as diagnostic markers for NSCLC. The identification of such markers could provide valuable insights into the underlying mechanisms that link environmental exposure to air pollutants with cancer progression, and may pave the way for the development of novel therapeutic strategies. Additionally, this study aims to contribute evidence emphasizing the need for public health interventions aimed at reducing air pollution and its associated respiratory risks.
In summary, the interaction between air pollution and NSCLC represents a critical area of research that necessitates further exploration. By employing an integrative approach that combines data analysis techniques and molecular modeling, this study aims to elucidate the specific mechanisms underlying the relationship between environmental factors and cancer pathogenesis. Ultimately, it seeks to contribute to improved diagnostic and therapeutic strategies for NSCLC. This research holds the potential to address significant gaps in our understanding and management of this prevalent and deadly disease, thereby fostering advancements in public health and clinical practice.
Methods and materials
Methods and materials
Workflow and data preparation
The complete analytical workflow is illustrated in Fig. 1.
Six transcriptomic datasets relevant to NSCLC were carefully curated from the NCBI GEO database. These datasets include GSE19188, GSE10072, GSE43458, GSE151102, GSE7670, and GSE19804. The datasets GSE19188, GSE10072, and GSE43458 were designated as the training cohort, and GSE151102, GSE7670, and GSE19804 constituted the validation cohort. To mitigate potential batch effects, a thorough multi-stage normalization protocol was implemented.
Surrogate Variable Analysis (SVA) was applied to identify latent confounding variables within the training cohort, which were then identified and adjusted using the SVA package in R. ComBat harmonization was subsequently applied to correct residual variances attributed to batch effects using a parametric empirical Bayes framework. After correction, the principal component analysis (PCA) results showed improved sample clustering across batches in the two-dimensional space defined by the first two principal components, supporting the effectiveness of the data harmonization procedure.
Identification of the toxicity of air pollutants
The chemical structures and relevant molecular data for the seven identified air pollutants, namely carbon monoxide (CO), nitrogen monoxide (NO), nitrogen dioxide (NO₂), sulfur dioxide (SO₂), benzo[a]anthracene (BaA), benzo[a]pyrene (BaP), and 3-methylcholanthrene (3-MC), were acquired from the PubChem database (https://pubchem.ncbi.nlm.nih.gov). Then, the carcinogenic potential of these pollutants was evaluated employing the ADMETLAB 3.0 platform (https://admetlab3.scbdd.com), alongside the ProTox3 database (https://tox.charite.de/protox3) (Table 1).
Collection of genes targeted by air pollutants
The identification of targets for the seven air pollutants employed a comprehensive three-pronged approach. This included the ChEMBL Database (https://www.ebi.ac.uk/chembl/) for assessing ligand-receptor interactions, SwissTarget Prediction for predicting targets grounded in chemical genomics (http://www.swisstargetprediction.ch), and PharmMapper for matching 3D pharmacophores (http://lilab-ecust.cn/pharmmapper). All identified targets were limited to the homo sapiens proteome. Subsequently, the gene information obtained from these three sources was integrated, with duplicate entries eliminated, resulting in a consolidated set of target genes associated with the air pollutants.
Analysis of differentially expressed genes (DEGs)
The transcriptomic data were analyzed using the limma package. DEGs were identified based on an FDR-adjusted p-value of less than 0.05 and an absolute value of log2
fold change (|log2FC|) greater than or equal to 0.585, corresponding to a 1.5-fold change. The findings were subsequently visualized using ggplot2.
Weighted gene co-expression network analysis (WGCNA)
A scale-free co-expression network was established using the WGCNA package through the following procedures:Sample quality control: identifying outliers and excluding them through hierarchical clustering
Soft thresholding: determining the optimal power to ensure the scale-free topology fit index (R²) > 0.9
Detecting modules: performing hierarchical clustering based on the topological overlap matrix (TOM) using the dynamic tree cut method, with minModuleSize set to 30 and mergeCutHeight set to 0.25;
Associating modules with traits: evaluating the correlation between module eigengenes and phenotypes by Pearson’s correlation coefficient, considering correlations with |r| >0.5 and p < 0.05 as meaningful;
Measuring intramodular connectivity: identifying hub genes with module eigengene-based connectivity (kME) > 0.8
Identification of air pollutants-associated targets in NSCLC
We performed intersection analysis between DEGs/WGCNA hub genes, and genes whose expression is affected by air pollutants; this analysis was visualized using Venn diagrams.
GO/KEGG enrichment analysis
Using R software, we employed the ClusterProfiler, Enrichplot, and Org.Hs.eg.db packages to perform Gene Ontology (GO) functional annotation and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis on the genes overlapping between air pollutant-related and NSCLC-related datasets. These analyses revealed the potential functions of key genes in various biological processes (BP), molecular functions (MF), and cellular components (CC), as well as the signaling pathways they participate in.
Machine learning-based screening of core genes
To systematically identify diagnostic markers associated with air pollutants for NSCLC, we developed a comprehensive prediction framework utilizing machine learning techniques that incorporate various algorithms. Using expression profile data from our training set, we applied 12 established machine learning algorithms-namely Lasso, Ridge, Elastic Net (Enet), stepwise generalized linear model (stepGLM), Support Vector Machine (SVM), glmBoost, Linear Discriminant Analysis (LDA), partial least squares regression for generalized linear models (plsRglm), Random Forest, Gradient Boosting Machine (GBM), XGBoost, and Naive Bayes-to build a total of 130 predictive models. The performance of these models was meticulously assessed based on the area under the receiver operating characteristic curve (AUC), overall accuracy, and F1-score. Subsequently, the best-performing individual models were combined using a stacking ensemble learning method. Models with high performance (AUC > 0.9) were selected, and the genes identified by these models were ranked according to their frequency of selection across models to pinpoint candidate key genes.
Model interpretation
Given the inherent opacity of many machine learning models, we employed the SHAP (SHapley Additive exPlanations) algorithm to elucidate the feature importance within our predictive model. SHAP provides a model-agnostic approach to quantify the contribution of each input variable to the final prediction. The use of SHAP enhances the transparency and trustworthiness of our findings by providing insights beyond simple feature ranking.
Specifically, samples were labeled as control or treatment based on sample names, and the dataset was split into training (70%) and testing (30%) sets using stratified sampling. SHAP assigns each feature a SHAP value, representing its marginal contribution to the mode’s output. SHAP values were computed for the test set using kernel-based SHAP explanations for the final model. Global feature importance was derived from mean absolute SHAP values across samples, and top features were identified. Local explanations were examined for representative samples via SHAP dependence and force plots.
Immune infiltration analysis
The CIBERSORT algorithm was employed to assess the infiltration scores of immune-related cells in both NSCLC and normal tissues. To determine statistical significance, we utilized the Wilcoxon rank-sum test. Following this, we further examined the relationships between the core target genes and immune cell infiltration.
Molecular docking
We employed molecular docking to investigate the binding interactions between seven air pollutants and their target proteins. Three-dimensional protein structures were obtained from the Protein Data Bank (PDB, http://www.rcsb.org/), and structures of seven air pollutants were sourced from PubChem. These structures were converted to Mol2 format using Open Babel software. Preprocessing steps, including water removal and adding hydrogens, were performed on the protein structures prior to docking, which was carried out using AutoDock Vina. Docking results, in pdbqt file format, were converted to pdb format and visualized using PyMOL 3.2. The analysis focused on identifying protein-pollutant complexes exhibiting the highest binding affinities. This provided detailed insights into the specific interaction mechanisms and guided further investigations into the biological effects of these pollutants.
Molecular dynamics simulation
Molecular dynamics simulation (MD) was performed using the GROMACS 2022 program. Small molecule ligands were parameterized using the GAFF force field, while protein parameters were assigned using the AMBER14SB force field, and the TIP3P water model was employed for solvent molecules. The protein and ligand structures were combined to construct the complex simulation system. The simulation was conducted under isothermal-isobaric conditions and periodic boundary conditions. During the MD simulation, all hydrogen bonds were constrained using the LINCS algorithm. The integration time step was set to 2 fs. Electrostatic interactions were calculated using the Particle-mesh Ewald (PME) method, with a cutoff distance set to 1.2 nm. The non-bonded interaction cutoff distance was set to 10 Å, and the neighbor list was updated every 10 steps. The V-rescale thermostat was used to control the simulation temperature at 298 K. The Berendsen barostat was employed to maintain the pressure at 1 bar. At 298 K, a 100-ps NVT equilibrium simulation was performed, followed by a 100-ps NPT equilibrium simulation. Subsequently, 100 ns of MD simulation of the complex system was performed, with trajectory frames saved every 10 ps. After the simulation was completed, VMD and PyMOL were used to analyze the simulation trajectory, and the g_mmpbsa program was used to calculate the MMPBSA binding free energy of the complex between the protein and small molecule ligands.
Workflow and data preparation
The complete analytical workflow is illustrated in Fig. 1.
Six transcriptomic datasets relevant to NSCLC were carefully curated from the NCBI GEO database. These datasets include GSE19188, GSE10072, GSE43458, GSE151102, GSE7670, and GSE19804. The datasets GSE19188, GSE10072, and GSE43458 were designated as the training cohort, and GSE151102, GSE7670, and GSE19804 constituted the validation cohort. To mitigate potential batch effects, a thorough multi-stage normalization protocol was implemented.
Surrogate Variable Analysis (SVA) was applied to identify latent confounding variables within the training cohort, which were then identified and adjusted using the SVA package in R. ComBat harmonization was subsequently applied to correct residual variances attributed to batch effects using a parametric empirical Bayes framework. After correction, the principal component analysis (PCA) results showed improved sample clustering across batches in the two-dimensional space defined by the first two principal components, supporting the effectiveness of the data harmonization procedure.
Identification of the toxicity of air pollutants
The chemical structures and relevant molecular data for the seven identified air pollutants, namely carbon monoxide (CO), nitrogen monoxide (NO), nitrogen dioxide (NO₂), sulfur dioxide (SO₂), benzo[a]anthracene (BaA), benzo[a]pyrene (BaP), and 3-methylcholanthrene (3-MC), were acquired from the PubChem database (https://pubchem.ncbi.nlm.nih.gov). Then, the carcinogenic potential of these pollutants was evaluated employing the ADMETLAB 3.0 platform (https://admetlab3.scbdd.com), alongside the ProTox3 database (https://tox.charite.de/protox3) (Table 1).
Collection of genes targeted by air pollutants
The identification of targets for the seven air pollutants employed a comprehensive three-pronged approach. This included the ChEMBL Database (https://www.ebi.ac.uk/chembl/) for assessing ligand-receptor interactions, SwissTarget Prediction for predicting targets grounded in chemical genomics (http://www.swisstargetprediction.ch), and PharmMapper for matching 3D pharmacophores (http://lilab-ecust.cn/pharmmapper). All identified targets were limited to the homo sapiens proteome. Subsequently, the gene information obtained from these three sources was integrated, with duplicate entries eliminated, resulting in a consolidated set of target genes associated with the air pollutants.
Analysis of differentially expressed genes (DEGs)
The transcriptomic data were analyzed using the limma package. DEGs were identified based on an FDR-adjusted p-value of less than 0.05 and an absolute value of log2
fold change (|log2FC|) greater than or equal to 0.585, corresponding to a 1.5-fold change. The findings were subsequently visualized using ggplot2.
Weighted gene co-expression network analysis (WGCNA)
A scale-free co-expression network was established using the WGCNA package through the following procedures:Sample quality control: identifying outliers and excluding them through hierarchical clustering
Soft thresholding: determining the optimal power to ensure the scale-free topology fit index (R²) > 0.9
Detecting modules: performing hierarchical clustering based on the topological overlap matrix (TOM) using the dynamic tree cut method, with minModuleSize set to 30 and mergeCutHeight set to 0.25;
Associating modules with traits: evaluating the correlation between module eigengenes and phenotypes by Pearson’s correlation coefficient, considering correlations with |r| >0.5 and p < 0.05 as meaningful;
Measuring intramodular connectivity: identifying hub genes with module eigengene-based connectivity (kME) > 0.8
Identification of air pollutants-associated targets in NSCLC
We performed intersection analysis between DEGs/WGCNA hub genes, and genes whose expression is affected by air pollutants; this analysis was visualized using Venn diagrams.
GO/KEGG enrichment analysis
Using R software, we employed the ClusterProfiler, Enrichplot, and Org.Hs.eg.db packages to perform Gene Ontology (GO) functional annotation and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis on the genes overlapping between air pollutant-related and NSCLC-related datasets. These analyses revealed the potential functions of key genes in various biological processes (BP), molecular functions (MF), and cellular components (CC), as well as the signaling pathways they participate in.
Machine learning-based screening of core genes
To systematically identify diagnostic markers associated with air pollutants for NSCLC, we developed a comprehensive prediction framework utilizing machine learning techniques that incorporate various algorithms. Using expression profile data from our training set, we applied 12 established machine learning algorithms-namely Lasso, Ridge, Elastic Net (Enet), stepwise generalized linear model (stepGLM), Support Vector Machine (SVM), glmBoost, Linear Discriminant Analysis (LDA), partial least squares regression for generalized linear models (plsRglm), Random Forest, Gradient Boosting Machine (GBM), XGBoost, and Naive Bayes-to build a total of 130 predictive models. The performance of these models was meticulously assessed based on the area under the receiver operating characteristic curve (AUC), overall accuracy, and F1-score. Subsequently, the best-performing individual models were combined using a stacking ensemble learning method. Models with high performance (AUC > 0.9) were selected, and the genes identified by these models were ranked according to their frequency of selection across models to pinpoint candidate key genes.
Model interpretation
Given the inherent opacity of many machine learning models, we employed the SHAP (SHapley Additive exPlanations) algorithm to elucidate the feature importance within our predictive model. SHAP provides a model-agnostic approach to quantify the contribution of each input variable to the final prediction. The use of SHAP enhances the transparency and trustworthiness of our findings by providing insights beyond simple feature ranking.
Specifically, samples were labeled as control or treatment based on sample names, and the dataset was split into training (70%) and testing (30%) sets using stratified sampling. SHAP assigns each feature a SHAP value, representing its marginal contribution to the mode’s output. SHAP values were computed for the test set using kernel-based SHAP explanations for the final model. Global feature importance was derived from mean absolute SHAP values across samples, and top features were identified. Local explanations were examined for representative samples via SHAP dependence and force plots.
Immune infiltration analysis
The CIBERSORT algorithm was employed to assess the infiltration scores of immune-related cells in both NSCLC and normal tissues. To determine statistical significance, we utilized the Wilcoxon rank-sum test. Following this, we further examined the relationships between the core target genes and immune cell infiltration.
Molecular docking
We employed molecular docking to investigate the binding interactions between seven air pollutants and their target proteins. Three-dimensional protein structures were obtained from the Protein Data Bank (PDB, http://www.rcsb.org/), and structures of seven air pollutants were sourced from PubChem. These structures were converted to Mol2 format using Open Babel software. Preprocessing steps, including water removal and adding hydrogens, were performed on the protein structures prior to docking, which was carried out using AutoDock Vina. Docking results, in pdbqt file format, were converted to pdb format and visualized using PyMOL 3.2. The analysis focused on identifying protein-pollutant complexes exhibiting the highest binding affinities. This provided detailed insights into the specific interaction mechanisms and guided further investigations into the biological effects of these pollutants.
Molecular dynamics simulation
Molecular dynamics simulation (MD) was performed using the GROMACS 2022 program. Small molecule ligands were parameterized using the GAFF force field, while protein parameters were assigned using the AMBER14SB force field, and the TIP3P water model was employed for solvent molecules. The protein and ligand structures were combined to construct the complex simulation system. The simulation was conducted under isothermal-isobaric conditions and periodic boundary conditions. During the MD simulation, all hydrogen bonds were constrained using the LINCS algorithm. The integration time step was set to 2 fs. Electrostatic interactions were calculated using the Particle-mesh Ewald (PME) method, with a cutoff distance set to 1.2 nm. The non-bonded interaction cutoff distance was set to 10 Å, and the neighbor list was updated every 10 steps. The V-rescale thermostat was used to control the simulation temperature at 298 K. The Berendsen barostat was employed to maintain the pressure at 1 bar. At 298 K, a 100-ps NVT equilibrium simulation was performed, followed by a 100-ps NPT equilibrium simulation. Subsequently, 100 ns of MD simulation of the complex system was performed, with trajectory frames saved every 10 ps. After the simulation was completed, VMD and PyMOL were used to analyze the simulation trajectory, and the g_mmpbsa program was used to calculate the MMPBSA binding free energy of the complex between the protein and small molecule ligands.
Results
Results
Evaluation of the toxicological effects of seven air pollutants
Table 1 presents a comprehensive overview of the molecular weights, SMILES (Simplified Molecular Input Line Entry System) structures, and carcinogenicity assessments for the seven identified air pollutants (CO, NO, NO₂, SO₂, BaA, BaP, and 3-MC). Carcinogenic potential was evaluated using the ADMETLAB 3.0 and ProTox-3 databases. We conducted toxicity assessments for these pollutants through these two distinct prediction platforms. Based on the established selection criteria, a pollutant was classified as toxic if either platform predicted it to be carcinogenic. Ultimately, all seven air pollutants were determined to be toxic according to these assessments.
Identification of genes associated with air pollutants
The molecular structures of seven air pollutants were retrieved from the PubChem database (Fig. 2A). We then identified genes targeted by these air pollutants using the ChEMBL, PharmMapper, and SwissTarget Prediction databases. Specifically, we found 1078 potential target genes from ChEMBL, 152 from PharmMapper, and 125 from SwissTarget Prediction (Supplementary Tables 1–3). After removing duplicate entries, we successfully identified a total of 1,242 target genes related to the air pollutants (Fig. 2B).
Collection of NSCLC-related genes
To mitigate batch effects, we combined the datasets GSE10072, GSE19188, and GSE43458. Subsequently, we performed a thorough normalization of the gene expression matrices. PCA indicated a more favorable data distribution post-normalization. The modified dataset showcased more discernible clustering patterns (Fig. 3A, B, Supplementary Fig. 1A-D). The analysis of differential expression revealed 1730 genes exhibiting significant alterations in NSCLC. These expression variations were visualized through volcano plots (Fig. 3C). For WGCNA, we initially established the optimal soft-thresholding power (β) necessary for maintaining a scale-free network topology. A systematic assessment of power values from 1 to 20 indicated that a β value of 5 was the lowest threshold meeting the scale-free topology requirement (R2 ≥ 0.9; Supplementary Fig. 2A-B). Utilizing this parameter, we constructed a topological overlap matrix (TOM) and conducted hierarchical clustering to delineate co-expression modules. Following hierarchical clustering, this investigation resulted in the identification of six unique gene modules, each distinctly color-coded for enhanced visualization (Fig. 3D, Supplementary Fig. 2C). The module-trait relationship analysis revealed significant correlations (p < 0.05) between specific modules and NSCLC (Fig. 3E). To integrate these findings, we first merged DEGs from standard differential expression analyses (Supplementary Table 4) with module genes obtained from WGCNA (MEblue) (Supplementary Table 5). We then eliminated duplicates. This process ultimately identified a final set of 247 genes associated with NSCLC (Fig. 3F).
Identification and enrichment analysis of air pollutants-associated gene targets in NSCLC
The analysis of the intersection between genes influenced by air pollutants and those associated with NSCLC was conducted in two steps. First, we identified 30 potential key targets that may be involved in the pathogenesis of NSCLC triggered by air pollutants (Fig. 4A). These 30 intersecting target genes were subsequently submitted to the STRING database for protein-protein interaction (PPI) analysis. To facilitate the visualization of the PPI network, we employed Cytoscape version 3.10.3 (Fig. 4B). This graphical representation provides a comprehensive depiction of the interaction dynamics among the key targets and offers essential insights for further exploration of the molecular mechanisms connecting air pollutants to NSCLC. Additionally, functional characterization through KEGG and GO enrichment analyses (Fig. 4C, D, Supplementary Table 6) yielded extensive molecular insights. The KEGG pathway analysis revealed significant enrichment in pathways related to cell cycle, cellular senescence, p53 and oocycle meiosis signalling pathways. Meanwhile, the GO analysis indicated substantial enrichment in BP, CC, and MF, including mitotic cell cycle phase transition, cyclin-dependent protein kinase holoenzyme complex, and cyclin-dependent protein serine/threonine kinase regulator activity, along with associated signaling pathways.
Development and validation of a prognostic model
Through comprehensive machine learning analysis of the 30 candidate targets, we constructed 130 predictive models to identify core genes involved in NSCLC associated with air pollutants. Considering the accuracy and simplicity of the models’predictions in the training and validation groups, the Lasso + RF algorithm was then used for feature selection, leading to the identification of seven key genes: CKS1B, GAPDH, TYMS, AURKA, CCNE1, PARP1, and MGLL (Fig. 5A, Supplementary Fig. 3, Supplementary Table 7). The diagnostic potential of these core genes was demonstrated by ROC curve analysis (AUC > 0.95; Fig. 5B, C, Supplementary Table 8), while their differential expression patterns in NSCLC tissues were visualized through a volcano plot (Fig. 5D). The nomogram plot (Fig. 5E) and calibration curve (Fig. 5F) further evaluated the predictive performance of the model based on these seven key genes.
The SHAP interpretability analysis unveiled distinct functional contributions, with CKS1B and GAPDH recognized as the most significant predictors (Fig. 6A). Complementing this, the force-directed analysis provided additional evidence indicating that CKS1B (4.01, Δ = − 0.259) and GAPDH (6.1, Δ = − 0.198) acted as key negative regulators, resulting in prediction values (f(x) = 0.0419) fell well below the established benchmark expectation (E[f(x)] = 0.583) (Fig. 6B). Significantly, we discovered essential non-linear associations, specifically an inverse relationship between the expression levels of MGLL and CKS1B (Fig. 6C-I).
Correlation analysis of core genes and immune cell infiltration in NSCLC
We performed the CIBERSORT algorithm to examine the composition of 22 immune cell types in normal tissues and NSCLC lesions, and their correlations with the seven core gene. We found significant differences in the infiltration of B cells (naive and memory), plasma cells; CD4 + memory T cells (resting and activated); follicular helper T cells; regulatory T cells (Tregs); resting NK cells; monocytes; M1 macrophages; dendritic cells (resting and activated); mast cells (resting and activated); eosinophils; and neutrophils (Fig. 7A, B). We also observed specific correlations between immune cell types (Fig. 7C). For example, CD4 + memory resting T cells were negatively correlated with CD8 + T cells. Subsequently, Fig. 7D presents the association of the seven core genes with different immune cells, indicating that the expression levels of the seven core genes are correlated with distinct immune cell types.
Molecular docking
To assess the potential interactions between seven air pollutants (CO, NO, NO2, SO₂, BaA, BaP, and 3-MC) and the seven identified core genes (CKS1B, GAPDH, TYMS, AURKA, CCNE1, PARP1, and MGLL), we conducted extensive molecular docking analyses. The results demonstrated that all seven core genes showed the ability to spontaneously bind with the pollutants (Supplementary Fig. 4). Based on this analysis, we selected six complexes with the most favorable binding energies for further visualization of their binding conformations. These include 3-MC interacting with both GAPDH and MGLL, BaA interacting with AURKA, CCNE1, and PARP1, and BaP interacting with GAPDH, as illustrated in Fig. 8.
Leu-203 and Ala-238 on the GAPDH form alkyl and pi-alkyl hydrophobic interactions with 3-MC. Amino acids including Thr-237, Pro-236, and Asn-284 engage in van der Waals interactions (VDW) with 3-MC, as shown in Fig. 8A.
His-279, Leu-194, Ala-61, Leu-215, Ile-189, Leu-251, and Val-280 on the MGLL form pi-pi T-shaped, alkyl, and pi-alkyl hydrophobic interactions with 3-MC. Meanwhile, Ser-132 and Tyr-204 participate in VDW with 3-MC, as shown in Fig. 8B.
Ala-213, Ala-160, Leu-263, Val-147, and Leu-139 on the AURKA form pi-alkyl hydrophobic interactions with BaA. Additionally, Tyr-212, Asn-261, and Leu-210 form VDW interactions with BaA, as shown in Fig. 8C.
Leu-134, Phe-80, Ala-144, Ile-10, Ala-31, and Val-64 on the CCNE1 form pi-sigma, pi-pi T-shaped, and pi-alkyl hydrophobic interactions with BaA. Lys-33, Lys-89, and Glu-81 engage in VDW interactions with BaA, as shown in Fig. 8D.
Trp-589 on the PARP1 forms pi-pi-stacked hydrophobic interactions with BaA. Meanwhile, Phe-44, Phe-638, and Arg-587 participate in VDW interactions with BaA, as shown in Fig. 8E.
Leu-203 and Ala-238 on the GAPDH form pi-alkyl hydrophobic interactions with BaP. Amino acids such as Gln-204, Pro-236, and Asn-287 form VDW interactions with BaP, as shown in Fig. 8F.
According to established guidelines in molecular docking, a binding energy below − 5.0 kcal/mol indicates a strong binding affinity [15]. The findings presented in Fig. 8 demonstrate stable and energetically favorable docking orientations for each of the air pollutant-protein complexes. Taken together, these results provide structural support for the direct molecular interactions between the seven air pollutants and the core targets associated with NSCLC, which were identified through our machine learning methodology.
Molecular dynamics simulation
The six complexes shown in Fig. 8, obtained after molecular docking were subsequently subjected to molecular dynamics simulations to further explore the stability of the protein-ligand interactions. The binding free energies (ΔGbind) and their component energies of the six complexes under equilibrium conditions are presented in Table 2. Among these terms, ΔEele represents the electrostatic interaction between small molecules and proteins, ΔEvdw represents the van der Waals interaction, ΔEpol corresponds to the polar solvation energy related to electrostatic interactions, while ΔEnonpol corresponds to the non-polar solvation energy associated with hydrophobic effects. ΔEMMPBSA is the sum of ΔEele, ΔEvdw, ΔEpol, and ΔEnonpol. The ΔGbind is the sum of ΔEMMPBSA and -TΔS. Due to significant uncertainties in calculating -TΔS, it is often excluded when comparing binding energies, and ΔEMMPBSA is used directly as an approximation of the binding energy. Additionally, the calculation of tΔEpol can involve considerable errors, therefore, it is advisable to focus on the other energy components-ΔEele, ΔEvdw, and ΔEnonpol.
3-MC-GAPDH complexe (Fig. 9), have lower ΔGbind compared to 3-MC-MGLL, BaA-AURKA, and BaA-PARP1, BaA-CCNE1, and BaP-GAPDH (Supplementary Fig. 5–9), indicating relatively more stable protein-molecule interactions in a physiological environment. Analyses of the root mean square deviation (RMSD), radius of gyration (Rg), root mean square fluctuation (RMSF), docking site-ligand distance, buried solvent accessible surface area (bSASA), conformational overlap, and electrostatic (ELE) and VDW interactions demonstrate that the small molecules stably bind to the protein binding sites. Moreover, binding fluctuations decrease over the simulation time, indicating that the binding affinity reaches equilibrium.
In addition, we decomposed ΔEMMPBSA of the six complexes to determine the contribution of each amino acid residue to the overall binding energy, thereby identifying key amino acid residues in each protein. The residues that contribute significantly to the binding energy in each protein are shown in Fig. 9H, as well as Supplementary Fig. 5–9 H.
Considering that hydrogen bonds are related to electrostatic interactions and can reflect the strength of these interactions-an important force in protein-molecule binding-we further analyzed the number of hydrogen bonds (H-bond number) formed by the complexs.The results indicate that none of the small molecules formed hydrogen bonds with the proteins. Overall, VDW interactions play a major role in the six complexes, hydrophobic interactions play a secondary role, and ELE interactions play a supplementary role.
Evaluation of the toxicological effects of seven air pollutants
Table 1 presents a comprehensive overview of the molecular weights, SMILES (Simplified Molecular Input Line Entry System) structures, and carcinogenicity assessments for the seven identified air pollutants (CO, NO, NO₂, SO₂, BaA, BaP, and 3-MC). Carcinogenic potential was evaluated using the ADMETLAB 3.0 and ProTox-3 databases. We conducted toxicity assessments for these pollutants through these two distinct prediction platforms. Based on the established selection criteria, a pollutant was classified as toxic if either platform predicted it to be carcinogenic. Ultimately, all seven air pollutants were determined to be toxic according to these assessments.
Identification of genes associated with air pollutants
The molecular structures of seven air pollutants were retrieved from the PubChem database (Fig. 2A). We then identified genes targeted by these air pollutants using the ChEMBL, PharmMapper, and SwissTarget Prediction databases. Specifically, we found 1078 potential target genes from ChEMBL, 152 from PharmMapper, and 125 from SwissTarget Prediction (Supplementary Tables 1–3). After removing duplicate entries, we successfully identified a total of 1,242 target genes related to the air pollutants (Fig. 2B).
Collection of NSCLC-related genes
To mitigate batch effects, we combined the datasets GSE10072, GSE19188, and GSE43458. Subsequently, we performed a thorough normalization of the gene expression matrices. PCA indicated a more favorable data distribution post-normalization. The modified dataset showcased more discernible clustering patterns (Fig. 3A, B, Supplementary Fig. 1A-D). The analysis of differential expression revealed 1730 genes exhibiting significant alterations in NSCLC. These expression variations were visualized through volcano plots (Fig. 3C). For WGCNA, we initially established the optimal soft-thresholding power (β) necessary for maintaining a scale-free network topology. A systematic assessment of power values from 1 to 20 indicated that a β value of 5 was the lowest threshold meeting the scale-free topology requirement (R2 ≥ 0.9; Supplementary Fig. 2A-B). Utilizing this parameter, we constructed a topological overlap matrix (TOM) and conducted hierarchical clustering to delineate co-expression modules. Following hierarchical clustering, this investigation resulted in the identification of six unique gene modules, each distinctly color-coded for enhanced visualization (Fig. 3D, Supplementary Fig. 2C). The module-trait relationship analysis revealed significant correlations (p < 0.05) between specific modules and NSCLC (Fig. 3E). To integrate these findings, we first merged DEGs from standard differential expression analyses (Supplementary Table 4) with module genes obtained from WGCNA (MEblue) (Supplementary Table 5). We then eliminated duplicates. This process ultimately identified a final set of 247 genes associated with NSCLC (Fig. 3F).
Identification and enrichment analysis of air pollutants-associated gene targets in NSCLC
The analysis of the intersection between genes influenced by air pollutants and those associated with NSCLC was conducted in two steps. First, we identified 30 potential key targets that may be involved in the pathogenesis of NSCLC triggered by air pollutants (Fig. 4A). These 30 intersecting target genes were subsequently submitted to the STRING database for protein-protein interaction (PPI) analysis. To facilitate the visualization of the PPI network, we employed Cytoscape version 3.10.3 (Fig. 4B). This graphical representation provides a comprehensive depiction of the interaction dynamics among the key targets and offers essential insights for further exploration of the molecular mechanisms connecting air pollutants to NSCLC. Additionally, functional characterization through KEGG and GO enrichment analyses (Fig. 4C, D, Supplementary Table 6) yielded extensive molecular insights. The KEGG pathway analysis revealed significant enrichment in pathways related to cell cycle, cellular senescence, p53 and oocycle meiosis signalling pathways. Meanwhile, the GO analysis indicated substantial enrichment in BP, CC, and MF, including mitotic cell cycle phase transition, cyclin-dependent protein kinase holoenzyme complex, and cyclin-dependent protein serine/threonine kinase regulator activity, along with associated signaling pathways.
Development and validation of a prognostic model
Through comprehensive machine learning analysis of the 30 candidate targets, we constructed 130 predictive models to identify core genes involved in NSCLC associated with air pollutants. Considering the accuracy and simplicity of the models’predictions in the training and validation groups, the Lasso + RF algorithm was then used for feature selection, leading to the identification of seven key genes: CKS1B, GAPDH, TYMS, AURKA, CCNE1, PARP1, and MGLL (Fig. 5A, Supplementary Fig. 3, Supplementary Table 7). The diagnostic potential of these core genes was demonstrated by ROC curve analysis (AUC > 0.95; Fig. 5B, C, Supplementary Table 8), while their differential expression patterns in NSCLC tissues were visualized through a volcano plot (Fig. 5D). The nomogram plot (Fig. 5E) and calibration curve (Fig. 5F) further evaluated the predictive performance of the model based on these seven key genes.
The SHAP interpretability analysis unveiled distinct functional contributions, with CKS1B and GAPDH recognized as the most significant predictors (Fig. 6A). Complementing this, the force-directed analysis provided additional evidence indicating that CKS1B (4.01, Δ = − 0.259) and GAPDH (6.1, Δ = − 0.198) acted as key negative regulators, resulting in prediction values (f(x) = 0.0419) fell well below the established benchmark expectation (E[f(x)] = 0.583) (Fig. 6B). Significantly, we discovered essential non-linear associations, specifically an inverse relationship between the expression levels of MGLL and CKS1B (Fig. 6C-I).
Correlation analysis of core genes and immune cell infiltration in NSCLC
We performed the CIBERSORT algorithm to examine the composition of 22 immune cell types in normal tissues and NSCLC lesions, and their correlations with the seven core gene. We found significant differences in the infiltration of B cells (naive and memory), plasma cells; CD4 + memory T cells (resting and activated); follicular helper T cells; regulatory T cells (Tregs); resting NK cells; monocytes; M1 macrophages; dendritic cells (resting and activated); mast cells (resting and activated); eosinophils; and neutrophils (Fig. 7A, B). We also observed specific correlations between immune cell types (Fig. 7C). For example, CD4 + memory resting T cells were negatively correlated with CD8 + T cells. Subsequently, Fig. 7D presents the association of the seven core genes with different immune cells, indicating that the expression levels of the seven core genes are correlated with distinct immune cell types.
Molecular docking
To assess the potential interactions between seven air pollutants (CO, NO, NO2, SO₂, BaA, BaP, and 3-MC) and the seven identified core genes (CKS1B, GAPDH, TYMS, AURKA, CCNE1, PARP1, and MGLL), we conducted extensive molecular docking analyses. The results demonstrated that all seven core genes showed the ability to spontaneously bind with the pollutants (Supplementary Fig. 4). Based on this analysis, we selected six complexes with the most favorable binding energies for further visualization of their binding conformations. These include 3-MC interacting with both GAPDH and MGLL, BaA interacting with AURKA, CCNE1, and PARP1, and BaP interacting with GAPDH, as illustrated in Fig. 8.
Leu-203 and Ala-238 on the GAPDH form alkyl and pi-alkyl hydrophobic interactions with 3-MC. Amino acids including Thr-237, Pro-236, and Asn-284 engage in van der Waals interactions (VDW) with 3-MC, as shown in Fig. 8A.
His-279, Leu-194, Ala-61, Leu-215, Ile-189, Leu-251, and Val-280 on the MGLL form pi-pi T-shaped, alkyl, and pi-alkyl hydrophobic interactions with 3-MC. Meanwhile, Ser-132 and Tyr-204 participate in VDW with 3-MC, as shown in Fig. 8B.
Ala-213, Ala-160, Leu-263, Val-147, and Leu-139 on the AURKA form pi-alkyl hydrophobic interactions with BaA. Additionally, Tyr-212, Asn-261, and Leu-210 form VDW interactions with BaA, as shown in Fig. 8C.
Leu-134, Phe-80, Ala-144, Ile-10, Ala-31, and Val-64 on the CCNE1 form pi-sigma, pi-pi T-shaped, and pi-alkyl hydrophobic interactions with BaA. Lys-33, Lys-89, and Glu-81 engage in VDW interactions with BaA, as shown in Fig. 8D.
Trp-589 on the PARP1 forms pi-pi-stacked hydrophobic interactions with BaA. Meanwhile, Phe-44, Phe-638, and Arg-587 participate in VDW interactions with BaA, as shown in Fig. 8E.
Leu-203 and Ala-238 on the GAPDH form pi-alkyl hydrophobic interactions with BaP. Amino acids such as Gln-204, Pro-236, and Asn-287 form VDW interactions with BaP, as shown in Fig. 8F.
According to established guidelines in molecular docking, a binding energy below − 5.0 kcal/mol indicates a strong binding affinity [15]. The findings presented in Fig. 8 demonstrate stable and energetically favorable docking orientations for each of the air pollutant-protein complexes. Taken together, these results provide structural support for the direct molecular interactions between the seven air pollutants and the core targets associated with NSCLC, which were identified through our machine learning methodology.
Molecular dynamics simulation
The six complexes shown in Fig. 8, obtained after molecular docking were subsequently subjected to molecular dynamics simulations to further explore the stability of the protein-ligand interactions. The binding free energies (ΔGbind) and their component energies of the six complexes under equilibrium conditions are presented in Table 2. Among these terms, ΔEele represents the electrostatic interaction between small molecules and proteins, ΔEvdw represents the van der Waals interaction, ΔEpol corresponds to the polar solvation energy related to electrostatic interactions, while ΔEnonpol corresponds to the non-polar solvation energy associated with hydrophobic effects. ΔEMMPBSA is the sum of ΔEele, ΔEvdw, ΔEpol, and ΔEnonpol. The ΔGbind is the sum of ΔEMMPBSA and -TΔS. Due to significant uncertainties in calculating -TΔS, it is often excluded when comparing binding energies, and ΔEMMPBSA is used directly as an approximation of the binding energy. Additionally, the calculation of tΔEpol can involve considerable errors, therefore, it is advisable to focus on the other energy components-ΔEele, ΔEvdw, and ΔEnonpol.
3-MC-GAPDH complexe (Fig. 9), have lower ΔGbind compared to 3-MC-MGLL, BaA-AURKA, and BaA-PARP1, BaA-CCNE1, and BaP-GAPDH (Supplementary Fig. 5–9), indicating relatively more stable protein-molecule interactions in a physiological environment. Analyses of the root mean square deviation (RMSD), radius of gyration (Rg), root mean square fluctuation (RMSF), docking site-ligand distance, buried solvent accessible surface area (bSASA), conformational overlap, and electrostatic (ELE) and VDW interactions demonstrate that the small molecules stably bind to the protein binding sites. Moreover, binding fluctuations decrease over the simulation time, indicating that the binding affinity reaches equilibrium.
In addition, we decomposed ΔEMMPBSA of the six complexes to determine the contribution of each amino acid residue to the overall binding energy, thereby identifying key amino acid residues in each protein. The residues that contribute significantly to the binding energy in each protein are shown in Fig. 9H, as well as Supplementary Fig. 5–9 H.
Considering that hydrogen bonds are related to electrostatic interactions and can reflect the strength of these interactions-an important force in protein-molecule binding-we further analyzed the number of hydrogen bonds (H-bond number) formed by the complexs.The results indicate that none of the small molecules formed hydrogen bonds with the proteins. Overall, VDW interactions play a major role in the six complexes, hydrophobic interactions play a secondary role, and ELE interactions play a supplementary role.
Discussion
Discussion
NSCLC poses a significant challenge because it is one of the leading causes of cancer-related mortality worldwide [2]. This disease includes a heterogeneous group of carcinomas, mainly consisting of adenocarcinoma, squamous cell carcinoma, and large cell carcinoma, which together account for about 80% of all lung cancer cases [16–18]. Although some progress has been made in the research on new therapeutic targets and small molecule inhibitors for NSCLC, such as the study of c-MYC G-quadruplexes and their inhibitors [19, 20], the prognosis for patients with NSCLC is often poor due to late-stage diagnosis and inherent resistance to conventional therapies such as chemotherapy and targeted treatments [21, 22]. This situation highlights the critical need to improve our understanding of the disease’s underlying mechanisms and risk factors.
This study aims to investigate the association between exposure to air pollution and NSCLC by applying advanced methodologies, including transcriptomic data analysis, machine learning, and molecular docking. We focus on identifying key genes linked to air pollutants and exploring their potential as diagnostic markers for NSCLC, we seek to uncover novel insights into the molecular pathways driving cancer progression related to environmental exposure. The findings presented here are expected to contribute to the development of improved diagnostic and therapeutic strategies, thereby addressing existing gaps in NSCLC management.
A comprehensive analysis of gene expression in NSCLC identified 30 target genes related to exposure to air pollutants, highlighting significant molecular alterations associated with the disease. This finding aligns with previous literature indicating that the tumor microenvironment, which is directly affected by air pollutants, can induce substantial changes in gene expression profiles and thereby contribute to cancer progression [23]. Furthermore, pathway enrichment analysis revealed critical pathways such as the cell cycle, cellular senescence, and p53 signaling, all previously implicated in NSCLC tumorigenesis [25, 26]. These enriched pathways not only provide insights into the mechanisms underlying NSCLC development but also highlight potential therapeutic targets, emphasizing the need for further preclinical and clinical studies of targeted therapies against these pathways.
The immune response analysis performed using the CIBERSORT algorithm demonstrated significant variations in immune cell infiltration within NSCLC tissues, particularly in B cells, CD4 + T cells, and macrophages. These findings suggest a pronounced role of the immune microenvironment in influencing tumor behavior and treatment response, aligning with recent studies that emphasize the impact of immune cell composition on NSCLC prognosis [27–29]. Additionally, the correlations observed between core genes and immune cell types provide a framework for understanding how environmental factors, such as air pollution, might modulate immune dynamics in NSCLC. This understanding could potentially inform future immunotherapeutic approaches.
In recent years, advances in molecular biology and immunology have transformed NSCLC diagnosis and treatment, particularly through biological markers like circRNA, lncRNA, miRNA, metabolomic profiles, and immunotherapy predictors. CircBRIP1 shows high expression in NSCLC tissues and plasma, correlating with tumor stage, metastasis, and differentiation, offering strong diagnostic potential [30]. LncRNA dysregulation links to chemotherapy resistance, highlighting its value as a non-invasive biomarker-though large-scale validation and mechanistic studies are lacking [31]. miRNAs such as miR-486, miR-7, and miR-34 and miR-21, miR-224 are associated with tumor staging, prognosis, and targeted therapy [32]. Metabolomic studies identify 46 key metabolites involved in glucose, amino acid, lipid, and nucleotide metabolism, revealing NSCLC metabolic dysfunction [33]. PD-L1 and TMB remain key immunotherapy predictors, but limited efficacy due to immune tolerance or hyperprogression persists [34, 35]. Emerging research integrates hematological markers, gene mutations, and multi-omics data to enhance precision in immunotherapy and patient selection. Future studies should focus on comprehensive, multi-omics integration with clinical data for improved accuracy and clinical applicability [35].
The machine learning-based discovery of seven core genes (CKS1B, GAPDH, TYMS, AURKA, CCNE1, PARP1, and MGLL) marks a significant advancement in identifying diagnostic biomarkers associated with air pollution exposure in NSCLC, a disease influenced by environmental factors affecting its pathogenesis. This machine learning-based method improves both the precision and reliability of selecting potential biomarkers that reflect environmental impacts on disease progression. The robustness of the machine learning models-demonstrated by AUC values exceeding 0.95 in validation datasets-highlights the potential of these genes as biomarkers of disease status. Given the influence of air pollution, the relevance of these genes underscores the necessity for clinical validation across diverse populations-considering differences in genetic backgrounds and environmental exposures-to confirm their utility in clinical settings.
Lastly, the molecular docking and dynamics simulations provided structural insights into the binding interactions between identified core genes and air pollutants. The finding that all seven core genes are capable of spontaneously binding to air pollutants suggests possible molecular mechanisms through which environmental exposures may influence NSCLC development and progression. This integrative approach, combining computational methods with biological insights, highlights the significance of interdisciplinary research in unraveling the complex interplay between environmental factors and cancer biology. Overall, this study not only contributes to the understanding of NSCLC etiology but also paves the way for future research aimed at improving diagnostic and therapeutic strategies for NSCLC.
The limitations of this study warrant careful consideration. First, the absence of experimental validation for the identified core genes restricts the translational potential of our findings, emphasizing the need for future experimental confirmation in biological systems. Additionally, the relatively small sample sizes within the utilized datasets may introduce variability that influences the robustness of our results. These limitations highlight the necessity for subsequent studies to incorporate larger, more diverse populations and experimental validations to enhance the reliability of the observed associations.
NSCLC poses a significant challenge because it is one of the leading causes of cancer-related mortality worldwide [2]. This disease includes a heterogeneous group of carcinomas, mainly consisting of adenocarcinoma, squamous cell carcinoma, and large cell carcinoma, which together account for about 80% of all lung cancer cases [16–18]. Although some progress has been made in the research on new therapeutic targets and small molecule inhibitors for NSCLC, such as the study of c-MYC G-quadruplexes and their inhibitors [19, 20], the prognosis for patients with NSCLC is often poor due to late-stage diagnosis and inherent resistance to conventional therapies such as chemotherapy and targeted treatments [21, 22]. This situation highlights the critical need to improve our understanding of the disease’s underlying mechanisms and risk factors.
This study aims to investigate the association between exposure to air pollution and NSCLC by applying advanced methodologies, including transcriptomic data analysis, machine learning, and molecular docking. We focus on identifying key genes linked to air pollutants and exploring their potential as diagnostic markers for NSCLC, we seek to uncover novel insights into the molecular pathways driving cancer progression related to environmental exposure. The findings presented here are expected to contribute to the development of improved diagnostic and therapeutic strategies, thereby addressing existing gaps in NSCLC management.
A comprehensive analysis of gene expression in NSCLC identified 30 target genes related to exposure to air pollutants, highlighting significant molecular alterations associated with the disease. This finding aligns with previous literature indicating that the tumor microenvironment, which is directly affected by air pollutants, can induce substantial changes in gene expression profiles and thereby contribute to cancer progression [23]. Furthermore, pathway enrichment analysis revealed critical pathways such as the cell cycle, cellular senescence, and p53 signaling, all previously implicated in NSCLC tumorigenesis [25, 26]. These enriched pathways not only provide insights into the mechanisms underlying NSCLC development but also highlight potential therapeutic targets, emphasizing the need for further preclinical and clinical studies of targeted therapies against these pathways.
The immune response analysis performed using the CIBERSORT algorithm demonstrated significant variations in immune cell infiltration within NSCLC tissues, particularly in B cells, CD4 + T cells, and macrophages. These findings suggest a pronounced role of the immune microenvironment in influencing tumor behavior and treatment response, aligning with recent studies that emphasize the impact of immune cell composition on NSCLC prognosis [27–29]. Additionally, the correlations observed between core genes and immune cell types provide a framework for understanding how environmental factors, such as air pollution, might modulate immune dynamics in NSCLC. This understanding could potentially inform future immunotherapeutic approaches.
In recent years, advances in molecular biology and immunology have transformed NSCLC diagnosis and treatment, particularly through biological markers like circRNA, lncRNA, miRNA, metabolomic profiles, and immunotherapy predictors. CircBRIP1 shows high expression in NSCLC tissues and plasma, correlating with tumor stage, metastasis, and differentiation, offering strong diagnostic potential [30]. LncRNA dysregulation links to chemotherapy resistance, highlighting its value as a non-invasive biomarker-though large-scale validation and mechanistic studies are lacking [31]. miRNAs such as miR-486, miR-7, and miR-34 and miR-21, miR-224 are associated with tumor staging, prognosis, and targeted therapy [32]. Metabolomic studies identify 46 key metabolites involved in glucose, amino acid, lipid, and nucleotide metabolism, revealing NSCLC metabolic dysfunction [33]. PD-L1 and TMB remain key immunotherapy predictors, but limited efficacy due to immune tolerance or hyperprogression persists [34, 35]. Emerging research integrates hematological markers, gene mutations, and multi-omics data to enhance precision in immunotherapy and patient selection. Future studies should focus on comprehensive, multi-omics integration with clinical data for improved accuracy and clinical applicability [35].
The machine learning-based discovery of seven core genes (CKS1B, GAPDH, TYMS, AURKA, CCNE1, PARP1, and MGLL) marks a significant advancement in identifying diagnostic biomarkers associated with air pollution exposure in NSCLC, a disease influenced by environmental factors affecting its pathogenesis. This machine learning-based method improves both the precision and reliability of selecting potential biomarkers that reflect environmental impacts on disease progression. The robustness of the machine learning models-demonstrated by AUC values exceeding 0.95 in validation datasets-highlights the potential of these genes as biomarkers of disease status. Given the influence of air pollution, the relevance of these genes underscores the necessity for clinical validation across diverse populations-considering differences in genetic backgrounds and environmental exposures-to confirm their utility in clinical settings.
Lastly, the molecular docking and dynamics simulations provided structural insights into the binding interactions between identified core genes and air pollutants. The finding that all seven core genes are capable of spontaneously binding to air pollutants suggests possible molecular mechanisms through which environmental exposures may influence NSCLC development and progression. This integrative approach, combining computational methods with biological insights, highlights the significance of interdisciplinary research in unraveling the complex interplay between environmental factors and cancer biology. Overall, this study not only contributes to the understanding of NSCLC etiology but also paves the way for future research aimed at improving diagnostic and therapeutic strategies for NSCLC.
The limitations of this study warrant careful consideration. First, the absence of experimental validation for the identified core genes restricts the translational potential of our findings, emphasizing the need for future experimental confirmation in biological systems. Additionally, the relatively small sample sizes within the utilized datasets may introduce variability that influences the robustness of our results. These limitations highlight the necessity for subsequent studies to incorporate larger, more diverse populations and experimental validations to enhance the reliability of the observed associations.
Conclusions
Conclusions
This study presents an integrative framework that links exposure to air pollutants with NSCLC through network toxicology and machine learning. By synthesizing environmental epidemiology signals with mechanistic network analyses, we identify pollutant-biomarker-pathway associations. These associations provide strong mechanistic hypotheses for NSCLC risk and progression. Furthermore, the approach demonstrates robustness across datasets and provides translational insights for biomarker discovery and targeted interventions. While computational evidence supports well-founded mechanisms, causal inferences require further in vitro, in vivo, and clinical experimental validation. Overall, our integrative strategy advances understanding of air pollution-NSCLC biology and highlights avenues for precision prevention and therapeutics.
This study presents an integrative framework that links exposure to air pollutants with NSCLC through network toxicology and machine learning. By synthesizing environmental epidemiology signals with mechanistic network analyses, we identify pollutant-biomarker-pathway associations. These associations provide strong mechanistic hypotheses for NSCLC risk and progression. Furthermore, the approach demonstrates robustness across datasets and provides translational insights for biomarker discovery and targeted interventions. While computational evidence supports well-founded mechanisms, causal inferences require further in vitro, in vivo, and clinical experimental validation. Overall, our integrative strategy advances understanding of air pollution-NSCLC biology and highlights avenues for precision prevention and therapeutics.
Supplementary Information
Supplementary Information
출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.
🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반
- Nanotechnology-Assisted Molecular Profiling: Emerging Advances in Circulating Tumor DNA Detection.
- Building Hybrid Pharmacometric-Machine Learning Models in Oncology Drug Development: Current State and Recommendations.
- Virtual screening of novel alkaloids as potent inhibitors for G2032R-mutant ROS1 kinase in non-small-cell lung cancer.
- Machine learning integrating MRI and clinical features predicts early recurrence of hepatocellular carcinoma after resection.
- Novel Active Homo-Aza (Lactam) Steroidal Antimetabolites for the Treatment of Human Pancreatic and Colorectal Cancer.
- Decoding the Anti-Tumour Mechanism of ɑ-Solanine: SRC Inhibition and Ferroptosis Induction in Colon Cancer.