본문으로 건너뛰기
← 뒤로

Integrative machine learning of hypoxia and centrosome-related gene signatures enables prognostic stratification and therapeutic insights in lung adenocarcinoma.

1/5 보강
Translational cancer research 📖 저널 OA 100% 2021: 1/1 OA 2023: 10/10 OA 2024: 23/23 OA 2025: 166/166 OA 2026: 124/124 OA 2021~2026 2026 Vol.15(1) p. 40
Retraction 확인
출처

Zhao Z, Du H, Jia C, Zhao W, Huang H, Hou S

📝 환자 설명용 한 줄

[BACKGROUND] Lung adenocarcinoma (LUAD), a major subtype of non-small cell lung cancer (NSCLC), exhibits significant clinical heterogeneity and commonly observed therapeutic resistance.

이 논문을 인용하기

↓ .bib ↓ .ris
APA Zhao Z, Du H, et al. (2026). Integrative machine learning of hypoxia and centrosome-related gene signatures enables prognostic stratification and therapeutic insights in lung adenocarcinoma.. Translational cancer research, 15(1), 40. https://doi.org/10.21037/tcr-2025-1594
MLA Zhao Z, et al.. "Integrative machine learning of hypoxia and centrosome-related gene signatures enables prognostic stratification and therapeutic insights in lung adenocarcinoma.." Translational cancer research, vol. 15, no. 1, 2026, pp. 40.
PMID 41674975 ↗

Abstract

[BACKGROUND] Lung adenocarcinoma (LUAD), a major subtype of non-small cell lung cancer (NSCLC), exhibits significant clinical heterogeneity and commonly observed therapeutic resistance. Although hypoxia-driven tumor adaptation and centrosome-mediated genomic instability are established microenvironmental drivers, their synergistic molecular contributions to LUAD progression remain poorly characterized. Therefore, this study aims to develop an integrative machine learning (ML) model based on hypoxia and centrosome-related genes to enable prognostic stratification and provide therapeutic insights for LUAD.

[METHODS] We developed an integrative multi-omics framework that combines weighted gene co-expression network analysis (WGCNA) to identify key regulatory modules and single-sample gene set enrichment analysis (ssGSEA) for assessing the hypoxia and-centrosome pathway. Differential expression analysis of The Cancer Genome Atlas Lung Adenocarcinoma (TCGA-LUAD) cohorts identified hypoxia-centrosome-associated genes, which were refined via univariate Cox regression and ML to construct a prognostic signature. Clinical relevance was validated through nomogram development, tumor microenvironment (TME) profiling, mutational burden assessment, and therapeutic response prediction.

[RESULTS] A 16-gene prognostic signature was established using 306 differentially expressed genes linked to hypoxia and centrosome dysregulation. Stratification of LUAD patients into high- and low-risk groups demonstrated longer overall survival (OS) in the low-risk cohort. High-risk patients demonstrated elevated tumor mutational burden (TMB) and immunosuppressive microenvironment features, including reduced infiltration of eosinophils, immature dendritic cells, and mast cells. Risk scores were correlated with sensitivity to targeted therapy and chemotherapy.

[CONCLUSIONS] Our integrative ML model uncovers hypoxia-centrosome crosstalk as a critical driver of LUAD progression. The hypoxia and centrosome score-related genes (HCSRGs) signature enables robust risk stratification and identifies actionable targets for precision oncology, providing a framework for personalized therapeutic strategies in LUAD.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

같은 제1저자의 인용 많은 논문 (5)

📖 전문 본문 읽기 PMC JATS · ~40 KB · 영문

Introduction

Introduction
Among the various subtypes of non-small cell lung cancer (NSCLC), lung adenocarcinoma (LUAD) remains a key driver of cancer-related mortality worldwide (1). Despite substantial advances in early detection and treatment strategies, many LUAD patients still face limited therapeutic options. Given its high mutational burden and metastatic potential, identifying reliable genetic markers for tumor prognosis and therapeutic response is critical.
Hypoxia, characterized by reduced oxygen availability, is a prominent feature of many solid tumors, including LUAD, due to limited vascularization and inefficient oxygen diffusion (2). This microenvironment not only supports tumor survival and metastatic spread but also induces resistance to various therapies, including chemotherapy and targeted therapy (3). Under hypoxic conditions, tumors activate a complex network of signaling pathways, most notably the hypoxia-inducible factor (HIF) pathway, which regulates genes involved in angiogenesis, metastasis, and drug resistance (4). In addition to hypoxia, centrosome dysfunction has emerged as an important factor in tumor progression and spread. The centrosome, essential for proper cell division, mitotic spindle formation, and chromosome segregation, plays a critical role in maintaining genomic stability (5). Recent studies have shown that abnormalities in centrosome number and structure can disrupt cellular homeostasis, leading to genomic instability, cell cycle dysregulation, and enhanced tumorigenesis (6-8). Given that the centrosome is central to the accurate progression of mitosis, its function is significantly influenced by both intrinsic and extrinsic factors, especially under hypoxic conditions (9). The current hypoxia-related prognostic models (10) and centrosome-related gene set models (11) have been established to predict the prognosis of NSCLC patients. However, since hypoxia and centrosome are closely related but have distinct biological functions (12,13), evaluating patient prognosis solely based on one gene set may not fully capture the comprehensive impact of their joint action on tumor progression and survival. Therefore, integrating hypoxia-related and centrosome-related gene sets could establish a more accurate and predictive prognostic model for LUAD patients, which is expected to improve our understanding of tumor biology and inform the development of targeted therapies.
In this context, machine learning (ML) offers an innovative approach to analyze large, complex datasets for the identification of potential biomarkers and the prediction of patient outcomes. In this study, we applied ML techniques to uncover genetic signatures associated with hypoxia-related and centrosome-related pathways in LUAD. By integrating multi-omics data, including publicly available transcriptomic datasets, we aim to construct a robust prognostic model capable of predicting patient survival and therapeutic response. We present this article in accordance with the TRIPOD reporting checklist (available at https://tcr.amegroups.com/article/view/10.21037/tcr-2025-1594/rc).

Methods

Methods

Data collection
To grate multiple publicly available datasets, this study aimed to identify hypoxia-related and centrosome-related genetic signatures associated with LUAD and develop prognostic prediction models. Gene expression profiles and clinical data were sourced from The Cancer Genome Atlas (TCGA) (https://www.cancer.gov/ccg/research/genome-sequencing/tcga), which included 59 normal samples and 489 tumor samples. Additionally, three LUAD datasets were obtained from the Gene Expression Omnibus (GEO) (https://www.ncbi.nlm.nih.gov/geo/): GSE26939 (115 samples), GSE31210 (226 samples), and GSE72094 (397 samples). Only cases with complete clinical details and an overall survival (OS) exceeding the general population’s OS were considered. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Single-sample gene set enrichment analysis (ssGSEA)
From the Molecular Signatures Database (MSigDB) (https://www.gsea-msigdb.org/gsea/msigdb), 200 hypoxia-related gene sets were retrieved, where these genes are associated with the hypoxic response program. Additionally, 726 centrosome-related genes were collected from the MiCroKiTS database (http://microkit.biocuckoo.org/), which provides a list of genes associated with centrosome function. Subsequently, an ssGSEA was conducted on all the samples using the “ClusterProfiler” R package, and hypoxia and centrosome scores were calculated for each sample (14).

Weighted gene co-expression network analysis (WGCNA)
Transcriptomic data from The Cancer Genome Atlas Lung Adenocarcinoma (TCGA-LUAD) was used to construct a gene co-expression network using the “WGCNA” R package (15). Firstly, transcriptome data from TCGA-LUAD were selected to build the co-expression network. A soft-thresholding power (β) exceeding 0.90 was chosen based on the scale-free topology criterion to ensure approximate independence of the network. The optimal power value was determined using the pickSoftThreshold function in R. Additionally, a minimum module size of 30 was set for gene module detection. Next, a Topological Overlap Matrix (TOM) was calculated to quantify the similarity between genes. Hierarchical clustering of genes was performed using a dissimilarity measure based on TOM, and gene modules were identified using the dynamic tree cut algorithm. These modules represent groups of genes that exhibit similar expression patterns across the samples. To identify modules associated with hypoxia and centrosome scores, a heatmap was generated to illustrate the module-trait relationships. Hypoxia and centrosome score-related genes (HCSRGs) were identified in modules of interest, which were characterized as modules exhibiting high connections with both scores.

Functional enrichment analysis
Differential expression analysis of HCSRGs was performed using the “limma” R package (16). Significant differential expression was observed for genes with adjusted P values below 0.05 and |log2 fold change (FC)| exceeding a magnitude of 1. Furthermore, pathway enrichment analysis using the “clusterProfiler” R package identified significant Gene Ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways associated with the differentially expressed genes (17).

ML
To identify HCSRGs associated with prognosis, univariate Cox regression analysis was first performed on differentially expressed genes, resulting in prognostic candidates. These genes were subsequently subjected to a comprehensive ML framework that integrated 101 algorithmic combinations to construct prognostic models in the TCGA-LUAD cohort. The performance of each model was evaluated using three independent validation datasets (GSE26939, GSE31210, and GSE72094). For each validation dataset, Harrell’s concordance index (C-index) was calculated, and the model with the highest average C-index was identified as the least absolute shrinkage and selection operator (LASSO) + survival-support vector machine (SVM) model.

Model construction
To construct the optimal prognostic model, candidate genes were first screened using the survival-SVM algorithm implemented in the “survivalsvm” R package. This approach was used to identify genes most relevant to OS based on support vector machine principles adapted for survival outcomes. Subsequently, LASSO regression was applied to the genes identified by the survival-SVM to further refine the prognostic signature. The penalty parameter (λ) in the LASSO model was optimized through 10-fold cross-validation using the “glmnet” R package, and the λ value corresponding to the minimum mean cross-validated partial likelihood deviance was selected. Genes with nonzero coefficients at this λ were retained, yielding a final set used for model construction. Based on the linear predictor derived from the model, patients were then dichotomized into high- and low-risk groups using the median risk score as the cutoff.

Construction of a nomogram for predicting survival
Nomograms were constructed to predict survival outcomes in LUAD patients based on clinical information and risk scores using the “rms” R package. Clinical stage and risk scores were integrated into the multivariate Cox regression model to predict patient outcomes. Time-dependent receiver operating characteristic (ROC) curves were generated using the “survivalROC” R package to assess model performance over time (18). Finally, a calibration plot was constructed with the “rms” R package to evaluate the predictive accuracy of the nomogram.

Estimation of tumor microenvironments (TMEs)
The ESTIMATE method was used to evaluate TME characteristics. This algorithm estimates the immune and stromal components of the tumor by calculating immune and stromal scores from gene expression data. Additionally, ssGSEA was applied to quantify the relative abundance of immune cell types. The relationship between HCSRG signatures and TME components was investigated, with particular emphasis on immune cell infiltration and extracellular matrix components. Using the “ggpubr” R package (19), we examined immune checkpoint gene expression levels in high- and low-risk groups to look for any associations with immune checkpoint expression.

Tumor mutational burden (TMB)
Using somatic mutation data from the TCGA-LUAD cohort, TMB was computed. The top 20 most mutated genes were found using the “maftools” R package, which also allowed for the visualization of their mutation frequency and spectrum (20).

Drug sensitivity analysis
Drug sensitivity analysis between the high- and low-risk groups was performed using the “pRRophetic” R package, which is based on the Cancer Genome Project (CGP) and the Cancer Cell Line Encyclopedia (CCLE) (17). These databases contain large-scale genomic and pharmacological response data from hundreds of cancer cell lines. The pRRophetic algorithm applies a linear regression model trained on these cell line datasets to estimate drug sensitivity in tumor samples according to their gene expression profiles.

Statistical analysis
R software version 4.2.2 was utilized for all statistical analyses. Descriptive statistics were used to summarize clinical features, followed by pairwise comparisons employing either a t-test or a Wilcoxon rank-sum test based on data characteristics. To assess variations in overall patient survival rates, log-rank testing and Kaplan-Meier survival analysis were used. For all statistical analyses, significance was established at a P value threshold of less than 0.05.

Results

Results

Weighted correlation network analysis
From TCGA, 489 samples of tumor tissue were collected. Hypoxia and centrosome scores were calculated for each sample using ssGSEA. The hypoxia gene set comprises 200 genes associated with the hypoxic response, while a set of 726 centrosome-related genes was used to generate the respective scores. To identify gene modules associated with hypoxia and centrosome scores, WGCNA was performed on the TCGA-LUAD transcriptome data. Following the removal of outlier samples, a scale-free network was built using β=3 as the soft threshold value (Figure 1A). Twenty-two modules in all were found, and each one was given a unique color label (Figure 1B). Notably, the magenta and turquoise modules, comprising 3,728 genes, exhibited strong correlations with hypoxia and centrosome scores (Figure 1C). These modular genes, known as HCSRGs, were examined further for variations in expression.

Identification and analysis of differential expression HCSRGs
The differential expression of 3,728 HCSRGs between normal and LUAD samples was analyzed, identifying 227 upregulated genes and 80 downregulated genes in the LUAD samples (Figure 2A,2B). GO enrichment analysis revealed pathways associated with actin binding, cell-substrate adhesion, focal adhesion, and cell-cell junction (Figure 2C). Differentially expressed HCSRGs were primarily enriched in pathways related to glucose metabolism, according to KEGG analysis (Figure 2D).

ML-driven prognostic model effectively stratifies LUAD patients
Univariate Cox regression analysis identified 48 genes associated with prognosis, which were then used to construct ML models. Among the 306 differentially expressed HCSRGs identified between tumor and normal samples, 48 genes were significantly associated with OS in univariate Cox regression (P<0.05). These 48 prognostic genes were then input into the ML framework for model construction. The results indicate that the average C-index of the LASSO + survival-SVM regression model is the highest among the three datasets, demonstrating its strong discriminative power (Figure 3A). Therefore, LASSO + survival-SVM was chosen for further analysis. A refined 16-gene prognostic signature was established, encompassing PKM, S100A16, CHPF, LPGAT1, PLEK2, KCTD12, ITPRID2, IL33, LIFR, PTPRM, COL4A3, LATS2, PDIK1L, GORAB, LEF1, and METAP1D (Figure 3B,3C). The median risk score was used to divide the patients into high- and low-risk subgroups. Risk score distributions and survival status stratification across the TCGA-LUAD, GSE26939, GSE31210, and GSE72094 cohorts consistently highlighted pronounced prognostic disparities between subgroups (P<0.001, log-rank test; Figure 3D-3K). Low-risk individuals showed noticeably longer OS in all cohorts, according to survival analysis (Figure 3L-3O). We subsequently performed differential expression analysis between the high-risk and low-risk groups and conducted functional enrichment analysis on these differentially expressed genes (Figure S1). The results revealed that these genes were significantly enriched in multiple centrosome-related and hypoxia-related pathways. This suggests that the fundamental cause of the significant prognostic differences between the high- and low-risk groups may indeed be driven by abnormalities in hypoxia-related and centrosome-related pathways.

Nomogram construction
After that, we conducted both univariate and multivariate Cox regression analysis for clinical characteristics and risk scores. The findings showed that the prognosis of LUAD patients was modified independently by risk score and clinical stage (Figure 4A,4B). Additionally, we constructed and calibrated a nomogram (Figure 4C), demonstrating its ability to reasonably predict OS rates compared to ideal models across the TCGA (Figure 4D), GSE26939 (Figure 4E), GSE31210 (Figure 4F), and GSE72094 (Figure 4G) cohorts. ROC curve analysis for the TCGA cohort showed that the area under the curve (AUC) for 1-, 3-, and 5-year OS rates were 0.739, 0.719, and 0.648, respectively (Figure 4H). In the GSE26939 cohort, the AUC values for 1-, 3-, and 5-year OS rates were 0.805, 0.727, and 0.764, respectively (Figure 4I). For the GSE31210 cohort, the AUC values for 1-, 3-, and 5-year OS rates were 0.828, 0.680, and 0.722, respectively (Figure 4J). In the GSE72094 cohort, the AUC values for 1-, 2-, and 3-year OS rates were 0.719, 0.703, and 0.869, respectively (Figure 4K). Results confirm the accuracy of the nomogram.

TME and immune checkpoint analysis
The ESTIMATE algorithm and ssGSEA were utilized to evaluate the TME. The expression levels of migrating immune cells and their associated pathways were analyzed, revealing reduced expression of activated CD4 T cells, CD56 dim natural killer cells, gamma delta T cells, natural killer T cells, neutrophils, and type 2 T helper cells in the low-risk group. Additionally, a pathway that was differentially expressed between the two groups was found to be highly expressed in the high-risk group (Figure 5A,5B). In addition, a comparison of the two groups immune checkpoint gene expression levels revealed that the low-risk group had lower levels of CD276 and TNFSF9 expression (Figure 5C).

Tumor mutation burden
Immune checkpoint blocking has been shown in preclinical and clinical trials to offer long-term therapeutic benefits, including extended OS and better treatment response, especially for patients with greater TMB. Our results indicated that TMB was lower in the low-risk group (Figure 6A). Survival analysis further revealed that patients in the low TMB group had poorer OS (Figure 6B). Additionally, the worst OS was seen in patients with both low TMB and high-risk ratings (Figure 6C). After that, we looked at the TCGA-LUAD genetic mutation landscape and determined the top 20 most commonly mutated genes (Figure 6D,6E). The high-risk group displayed a higher mutation frequency, while TP53, TTN, and MCU16 displayed the greatest mutation frequencies among them.

Efficacy of the model in predicting drug sensitivity
The TCGA-LUAD cohorts half maximal inhibitory concentration (IC50) values were calculated for each drug, and the relationship between IC50 and risk score was examined. A total of 56 drugs with a P value less than 0.01 were selected for further investigation. Figure 7A shows the connection between IC50 and model genes. Additionally, the IC50 differences for certain medicines between high- and low-risk groups were displayed using a boxplot (Figure 7B). Notably, the IC50 values were higher in the low-risk group for nearly all drugs. These results imply that LUAD patients are more sensitive to chemotherapy if their risk score is higher. Although the correlation coefficients between IC50 and risk score were statistically significant (r=0.33–0.42), their strength was moderate; therefore, these findings should be interpreted as indicating a general trend rather than a definitive predictive capability.

Discussion

Discussion
In this study, we developed a ML-driven framework to dissect the interplay between hypoxia and centrosome dysregulation in LUAD, establishing a robust prognostic signature with translational implications. By integrating transcriptomic data from TCGA-LUAD cohorts, we identified two co-expression modules (magenta and turquoise) encompassing 3,728 genes, which exhibited strong associations with hypoxia and centrosome activity. These modules were enriched in pathways regulating mitotic spindle assembly and HIF-1α signaling, suggesting synergistic roles in promoting genomic instability and metabolic reprogramming—hallmarks of LUAD progression.
We utilized a ML-based method to identify these gene characteristics, including PKM, S100A16, CHPF, LPGAT1, PLEK2, KCTD12, ITPRID2, IL33, LIFR, COL4A3, LATS2, PDIK1L, GORAB, LEF1, and METAP1D, which have been implicated in various studies. PKM plays a crucial role in tumor metabolism, particularly in glycolysis (21), where its upregulation supports the energy demands of rapid cell proliferation (Warburg effect) (22). CHPF promotes LUAD proliferation and anti-apoptotic mechanisms via the MAPK pathway (23). METAP1D is involved in protein synthesis and demethylation, regulating protein maturation and stability, while COL4A3 and METAP1D methylation have been identified as potential epigenetic biomarkers for colorectal cancer (24,25). LATS2 regulates cell proliferation and apoptosis via the Hippo pathway and acts as a tumor suppressor, with downregulation linked to uncontrolled cell proliferation and tumorigenesis (26). KCTD12, associated with potassium channel functionality, regulates intracellular signaling and voltage, with reduced phosphorylation levels observed in pancreatic cancer, LUAD, and breast cancer (27-29). The roles of these genes in tumorigenesis remain incompletely understood, warranting further investigation to elucidate their underlying mechanisms. Notably, several of these genes have established links to hypoxia and centrosome-related biological processes. GO enrichment analysis conducted between the high-risk and low-risk groups further substantiated these connections, revealing significant enrichment in pathways related to hypoxia response, oxygen metabolism, centrosome organization, and microtubule-based movement. For instance, PKM and S100A16 are transcriptionally regulated by HIF-1α and participate in glycolytic reprogramming under hypoxic stress (30,31); LATS2 and LEF1 are involved in centrosome duplication and Wnt signaling, contributing to mitotic control and chromosomal stability (32,33); and CHPF, PLEK2, and COL4A3 have been associated with cytoskeletal organization and extracellular matrix remodeling, both of which are influenced by hypoxia (34-36). These findings, together with the GO results, indicate that the 16 genes exhibit biological synergies within the hypoxia-centrosome regulatory axis, suggesting that the 16 genes are not merely statistical correlates but may act synergistically within hypoxia- and centrosome-regulated networks that drive LUAD progression. Nevertheless, further experimental studies, such as gene knockdown and rescue assays under hypoxic conditions, are warranted to clarify the direct functional relationships between these genes and the hypoxia–centrosome axis. The resultant 16-gene prognostic signature demonstrated superior discriminative capacity compared to existing models. For instance, while Zhao et al. (37) reported hypoxia-based models with modest AUC values (0.721 for 1-year survival), our HCRS achieved AUCs of 0.82, 0.79, and 0.75 for 1-, 3-, and 5-year survival, respectively, across multi-center cohorts.
TME analysis revealed distinct immune landscapes between risk groups. High-risk patients exhibited immunosuppressive features, including reduced CD4+ T cell and NK cell infiltration, coupled with downregulated immune checkpoints. These findings parallel recent reports linking centrosome amplification to PD-L1 suppression in NSCLC, suggesting that HCRS may encapsulate both cell-intrinsic vulnerabilities and immune evasion mechanisms (38). Although high-risk patients exhibited elevated TMB, which is typically linked to increased tumor immunogenicity, their immune microenvironment appeared functionally suppressed. High TMB tumors showed enhanced T cell infiltration but upregulation of CD276 and TNFSF9, which promote T cell dysfunction and immune escape. In contrast, costimulatory and regulatory genes such as TNFSF15, CD28, CD40LG, CD200R1, BTLA, TNFSF18, IDO2, CD160, BTNL2, and ADORA2A were downregulated, indicating impaired T cell activation. These findings suggest that although high TMB may enhance neoantigen exposure, the simultaneous loss of costimulatory signaling and activation of inhibitory checkpoints drives T cell exhaustion, resulting in an immunosuppressive phenotype. This could explain why high-risk patients with high TMB exhibit poor prognosis despite a theoretically immunogenic tumor profile.
Despite its strengths, there are limitations in this study. First, while LASSO + survival-SVM minimized overfitting, external validation in prospective cohorts is needed to address potential batch effects from public datasets. Second, although METAP1D and COL4A3 methylation patterns showed prognostic relevance, their functional roles in LUAD require experimental validation—particularly whether they modulate centrosome clustering or hypoxia tolerance. Third, drug sensitivity predictions based on GDSC IC50 values (in vitro) may not fully recapitulate in vivo pharmacokinetics; organoid-based assays could bridge this gap in future work.

Conclusions

Conclusions
In summary, this study identifies key hypoxia- and centrosome-related gene signatures in LUAD using TCGA and GEO datasets. By analyzing the prognostic status of different risk groups, we developed a risk score model based on sixteen selected genes. Further, we performed comprehensive analyses of clinical features, immune infiltration, mutation characteristics, and chemotherapy sensitivity, all based on the risk score. The results demonstrate that these gene signatures serve as valuable prognostic biomarkers, offering potential to enhance prognosis prediction and guide personalized treatment strategies in LUAD, thereby improving clinical decision-making.

Supplementary

Supplementary
The article’s supplementary files as

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

🟢 PMC 전문 열기