Dynamic screening initiation using 16 plasma protein biomarkers with polygenic risk and PLCOm2012: a precision prevention framework for lung cancer.
1/5 보강
[BACKGROUND] low-dose computed tomography (LDCT) screening strategies for lung cancer are primarily targeted at high-risk groups based on smoking history and age, leading to over-screening of low-risk
- 95% CI 0.80–0.93
APA
Wang R, Ye Z, et al. (2025). Dynamic screening initiation using 16 plasma protein biomarkers with polygenic risk and PLCOm2012: a precision prevention framework for lung cancer.. Journal of translational medicine, 24(1), 218. https://doi.org/10.1186/s12967-025-07468-1
MLA
Wang R, et al.. "Dynamic screening initiation using 16 plasma protein biomarkers with polygenic risk and PLCOm2012: a precision prevention framework for lung cancer.." Journal of translational medicine, vol. 24, no. 1, 2025, pp. 218.
PMID
41430273 ↗
Abstract 한글 요약
[BACKGROUND] low-dose computed tomography (LDCT) screening strategies for lung cancer are primarily targeted at high-risk groups based on smoking history and age, leading to over-screening of low-risk individuals while delaying timely detection in high-risk groups. This study aimed to develop a precision screening framework integrating plasma proteomics, polygenic risk score (PRS), and the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial 2012 model (PLCOm2012) to enable dynamic and optimized screening strategies.
[METHODS] A multicenter, multimodal study was conducted across two prospective cohorts. Plasma protein profiling identified 16 biomarkers, which were combined with PRS and PLCOm2012 to develop a prediction model. Protein score, PRS, and PLCOm2012 score were used to develop a combined risk score (CRS). Risk advancement period (RAP) analysis quantified personalized screening initiation ages based on 10-year cumulative risk thresholds.
[RESULTS] The CRS demonstrated superior predictive accuracy, with 3-, 5-, and 10-year AUCs of 0.85 (95% CI: 0.80–0.93), 0.88 (95% CI: 0.83–0.92), and 0.85 (95% CI: 0.81–0.87), respectively. High-risk individuals yielded HR of 3.18 (95% CI: 2.34–4.31) with thresholds reached 11.71 (RAP: -11.71, 95% CI: –15.58 - −6.16) years earlier than those at medium-risk, while low-risk individuals could delay screening by 11.78 (RAP: 11.78, 95% CI: 6.95 - 15.08) years. Risk stratification revealed distinct differences in cumulative lung cancer incidence across groups. Personalized screening initiation ages suggested that high-risk individuals begin screening before age 40, while low-risk individuals could postpone screening until age 55 or later.
[CONCLUSION] This study developed a precision lung cancer screening framework by integrating plasma proteomics, genetic profiling, and clinical data. This approach enhanced early detection in high-risk individuals while safely delaying screening for low-risk groups.
[TRIAL REGISTRATION] Clinical trial number: NCT06422637.
[SUPPLEMENTARY INFORMATION] The online version contains supplementary material available at 10.1186/s12967-025-07468-1.
[METHODS] A multicenter, multimodal study was conducted across two prospective cohorts. Plasma protein profiling identified 16 biomarkers, which were combined with PRS and PLCOm2012 to develop a prediction model. Protein score, PRS, and PLCOm2012 score were used to develop a combined risk score (CRS). Risk advancement period (RAP) analysis quantified personalized screening initiation ages based on 10-year cumulative risk thresholds.
[RESULTS] The CRS demonstrated superior predictive accuracy, with 3-, 5-, and 10-year AUCs of 0.85 (95% CI: 0.80–0.93), 0.88 (95% CI: 0.83–0.92), and 0.85 (95% CI: 0.81–0.87), respectively. High-risk individuals yielded HR of 3.18 (95% CI: 2.34–4.31) with thresholds reached 11.71 (RAP: -11.71, 95% CI: –15.58 - −6.16) years earlier than those at medium-risk, while low-risk individuals could delay screening by 11.78 (RAP: 11.78, 95% CI: 6.95 - 15.08) years. Risk stratification revealed distinct differences in cumulative lung cancer incidence across groups. Personalized screening initiation ages suggested that high-risk individuals begin screening before age 40, while low-risk individuals could postpone screening until age 55 or later.
[CONCLUSION] This study developed a precision lung cancer screening framework by integrating plasma proteomics, genetic profiling, and clinical data. This approach enhanced early detection in high-risk individuals while safely delaying screening for low-risk groups.
[TRIAL REGISTRATION] Clinical trial number: NCT06422637.
[SUPPLEMENTARY INFORMATION] The online version contains supplementary material available at 10.1186/s12967-025-07468-1.
🏷️ 키워드 / MeSH 📖 같은 키워드 OA만
같은 제1저자의 인용 많은 논문 (5)
- Hyaluronic acid filler-induced vascular occlusion-Three case reports and overview of prevention and treatment.
- Tumor-Derived CDC37 Inhibits Antigen Cross-Presentation in Dendritic Cells and Impairs Anti-Tumor Immunity in Breast Cancer.
- Brain radiotherapy added to first-line immunochemotherapy improves survival in patients with treatment-naïve, driver-negative lung adenocarcinoma and synchronous brain metastases.
- The Correlation Among PD-L1 Expression and the Driver Genes Status in Malignant Pleural Effusion of Lung Adenocarcinoma.
- Cigarette smoke promotes the progression of non-small cell lung cancer by activating ERK1/2-FOXC1 axis to induce epithelial-mesenchymal transition.
📖 전문 본문 읽기 PMC JATS · ~62 KB · 영문
Background
Background
Lung cancer remains the most frequently diagnosed and the leading cause of cancer-related death worldwide [1]. Low-dose computed tomography (LDCT) screening has proven effective in detecting early-stage lung cancer and reducing mortality among high-risk populations, such as long-term smokers [2, 3]. However, several pivotal challenges persist that impede the optimal implementation and effectiveness of current LDCT screening programs. Firstly, risk assessment mainly relies on smoking and family history, potentially missing high-risk individuals, especially those with early-onset lung cancer [4, 5]. Secondly, current guidelines initiate LDCT based solely on age, leading to “companion screening” for low-risk individuals, wasting valuable resources and leading to unnecessary radiation exposure [6]. This necessitates precision-guided screening protocols which prioritize earlier detection for high-risk individuals while postponing screening for those at low risk. Thirdly, false positives often lead to additional tests and interventions, adding financial burden to patients and healthcare systems, particularly in resource-limited settings [7]. Given these limitations, alternative methods for precision risk stratification and personalized screening decisions were warranted.
Emerging evidence highlights that proteins enter the peripheral circulatory system via active secretion or cellular leakage, reflecting real-time physiological and pathological processes [8]. Several studies have identified blood proteins as promising biomarkers for lung cancer risk [9–11], yet most rely on case-control designs with limited samples and follow-up, restricting both generalizability and predictive accuracy. In contrast, population-based longitudinal studies provide a stronger framework for detecting early molecular signatures of disease, enabling more precise risk stratification [12].
Genetic factors also play a crucial role in lung cancer development [13, 14]. Integrating blood protein biomarkers with genetic risk profiles presents a powerful opportunity to enhance lung cancer risk prediction models. This approach can enhance the accuracy of risk prediction models, allowing for more personalized and effective screening strategies that address the limitations of current LDCT protocols.
In this study, we developed a multicenter, multimodal risk stratification framework for lung cancer screening by integrating plasma proteomics, polygenic risk score (PRS), and Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial 2012 model (PLCOm2012). We hypothesized that this multimodal framework would achieve better predictive and clinical performance compared with single-modality prediction models. Furthermore, by providing clearer risk stratification and quantification of Risk advancement period (RAP), it would offer personalized guidance for lung cancer screening strategies tailored to different risk groups.
Lung cancer remains the most frequently diagnosed and the leading cause of cancer-related death worldwide [1]. Low-dose computed tomography (LDCT) screening has proven effective in detecting early-stage lung cancer and reducing mortality among high-risk populations, such as long-term smokers [2, 3]. However, several pivotal challenges persist that impede the optimal implementation and effectiveness of current LDCT screening programs. Firstly, risk assessment mainly relies on smoking and family history, potentially missing high-risk individuals, especially those with early-onset lung cancer [4, 5]. Secondly, current guidelines initiate LDCT based solely on age, leading to “companion screening” for low-risk individuals, wasting valuable resources and leading to unnecessary radiation exposure [6]. This necessitates precision-guided screening protocols which prioritize earlier detection for high-risk individuals while postponing screening for those at low risk. Thirdly, false positives often lead to additional tests and interventions, adding financial burden to patients and healthcare systems, particularly in resource-limited settings [7]. Given these limitations, alternative methods for precision risk stratification and personalized screening decisions were warranted.
Emerging evidence highlights that proteins enter the peripheral circulatory system via active secretion or cellular leakage, reflecting real-time physiological and pathological processes [8]. Several studies have identified blood proteins as promising biomarkers for lung cancer risk [9–11], yet most rely on case-control designs with limited samples and follow-up, restricting both generalizability and predictive accuracy. In contrast, population-based longitudinal studies provide a stronger framework for detecting early molecular signatures of disease, enabling more precise risk stratification [12].
Genetic factors also play a crucial role in lung cancer development [13, 14]. Integrating blood protein biomarkers with genetic risk profiles presents a powerful opportunity to enhance lung cancer risk prediction models. This approach can enhance the accuracy of risk prediction models, allowing for more personalized and effective screening strategies that address the limitations of current LDCT protocols.
In this study, we developed a multicenter, multimodal risk stratification framework for lung cancer screening by integrating plasma proteomics, polygenic risk score (PRS), and Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial 2012 model (PLCOm2012). We hypothesized that this multimodal framework would achieve better predictive and clinical performance compared with single-modality prediction models. Furthermore, by providing clearer risk stratification and quantification of Risk advancement period (RAP), it would offer personalized guidance for lung cancer screening strategies tailored to different risk groups.
Materials & methods
Materials & methods
Cohort recruitment and participants selection
This study employed a multi-cohort design combining a hospital-based, prospective, case-control cohort and a population-based prospective cohort to strengthen methodological rigor.
Participants in the case-control cohort were nested within the Guangzhou Medical University (GMU) cohort, which is a multi-center, prospective study enrolled 2,757 individuals with diverse respiratory diseases From April 2018 to January 2021. In this study, we selected two independent sub-cohorts of the GMU cohort: the First Affiliated Hospital of Guangzhou Medical University (1STGMU) and Tianjin Key Laboratory of Clinical Multi-omics (TKLCM) cohort, yielding 188 lung cancer patients and 155 health controls in this study. The health control participants were followed up before the database lock to ensure that none had progressed to lung cancer. All case and control plasma samples were randomized and analyzed together within the same proteomic assay batches. Detailed information about cohort construction and participant exclusion can been found in our previous study [15].
The population cohort used in this study was the UKBB, a prospective cohort comprising both lung cancer cases and healthy controls, recruited between 2006 and 2012 [16]. Participants were excluded if they lacked protein measurements, genotype, and baseline information, or if they had been diagnosed with lung cancer at enrollment. This resulted a final analytical cohort of 33,456 participants.
Outcomes definition
The primary outcome of this study was the lung cancer diagnosis, confirmed through histopathological examination by clinicians and pathologists in the 1STGMU/TKLCM cohort. In the UKBB cohort, lung cancer was defined using ICD−10 codes C34. The follow-up period was defined as the time from participant enrollment in UKBB to the earliest occurrence of lung cancer, loss to follow-up, death, or the end of available registry follow-up.
Sample collection and protein quantification
In the 1STGMU/TKLCM cohort, 5 mL of peripheral blood was collected per participant and processed under a harmonized standard operating procedure: samples were delivered within 4 h, centrifuged at 1600 g for 15 min at 4 °C to obtain plasma, and stored at − 80 °C. To mitigate masking by high-abundance proteins, the zeolite NaY nanomaterial previously reported by us was used for enrichment [17]. Dual QC was implemented: pooled-plasma controls in each batch to assess preparation consistency and a long-lasting peptide QC injected after every 20 cohort samples to monitor MS stability. After spiking indexed retention time standards, peptides were analyzed on a Thermo UltiMate 3000 UHPLC coupled to an Orbitrap Q Exactive HF in DIA mode using a C18 column (150 µm × 30 cm, 1.9 µm, 120 Å), a 600 nL/min flow, a 60 min gradient (overall 3–90% B), 2.0 kV positive ESI, and a 320 °C ion transfer tube. MS1 scans covered m/z 350–1500 at 60,000 resolutions, with HCD MS2 at 30,000 and normalized collision energy 28. DIA data was processed with DIA-NN against the human UniProt database using the trypsin/P rule with a 1% FDR. Post-QC protein quantifications were normalized across samples using the “Aquantile” method in the limma package and log-transformed for downstream analyses; the overall proteomic measurement, processing, and QC framework has been detailed previously [15], and approximately 3,900 proteins were quantified.
A total of 2,923 proteins was quantified using the Olink Explore 3072 platform, detailed regrading sample selection, proteomic measurement, processing, and QC were provided elsewhere [18].
Identification and validation of protein biomarkers in lung cancer
Differentially expressed protein (DEP) were identified using the “limma” package after platform-specific preprocessing and QC [19], with proteins exhibiting an absolute log fold change greater that 1 and an adjusted P-value less than 0.05 considered differentially expressed. For DEPs measurable in UKBB, prospective associations with incident lung cancer were tested using multivariable Cox proportional hazards model. Protein showing directionally concordant discovery log2FC and UKBB HR with p < 0.05 were considered validated.
To address multicollinearity and improve prediction accuracy, we applied the LASSO regression and random survival forest (RSF) model using the R packages “glmnet” [20] and “randomForestSRC” [21]. LASSO regression utilized Cox regression with 10-fold cross-validation to determine the optimal λ. RSF was built with 1000 trees, log-rank splitting, selecting proteins with importance scores above zero. The final protein biomarker panel comprised the intersection of LASSO-retained and RSF-retained proteins for downstream analyses. GO and KEGG enrichment analyses were conducted using the “clusterProfiler” package [22].
Construction of prostate, lung, colorectal, and ovarian cancer screening trial 2012 model (PLCOm2012)
The PLCOm2012 model was utilized in this study, with baseline characteristics from the UKBB cohort closely aligned with those of the original model. Previous study has demonstrated that the PLCOm2012 showed high discriminatory power and relatively good calibration compared to other widely used risk models [23]. In this study, PLCOm2012 was constructed using participants baseline information, including age, gender, body mass index (BMI), race, education level, smoking history, personal history of COPD, personal history of cancer, and family history of lung cancer. Detailed information regarding to the PLCOm2012 model specification details was provided in Supplementary Material 1.
Genetic data acquisition and polygenic risk scores (PRS) calculation
To assess the genetic risk associated with lung cancer, PRS were computed using UKBB genotype data. 23 high-penetrance single nucleotide polymorphisms (SNPs) identified by Zhu et al. [24] were used in this study, which had shown significant associations with lung cancer across the large cohorts involving different populations [25, 26], and demonstrated superior variance explained (R2) compared with other PRSs. SNPs were subsequently extracted using bgenix [27]. QC measures excluded SNPs with low minor allele frequency, poor imputation quality, or ambiguous palindromic alleles. PLINK 2.0 was used for genotype conversion and PRS calculation, aligning allele effects with published references [28]. Details of data preparation, QC procedures, and computational framework are described in Supplementary Material 2 and the distribution of missing rates for SNPs is shown in Supplementary Table 1.
Development and validation of risk prediction models for lung cancer
Protein score (PS) and PLCOm2012 score (PLCOS) were calculated based on variable weights using the “caret” package [29]. The combined risk score (CRS) was then calculated using the PS, PRS, and PLCOS.
Participants in the UKBB cohort were stratified and randomly allocated into training and testing subsets in a 7:3 ratio. In the training subset, we fitted Cox proportional hazards models for the PS, PRS, and PLCOS. During this process, the baseline hazard h0t and the regression coefficients β for the PS and PLCOS were estimated within the UKBB training set. For absolute risk estimation, model calibration, and risk advancement period (RAP) analyses, we used the baseline hazard h0t estimated from the UKBB training subset. We then integrated PS, PRS, and PLCOS to construct a combined risk score for predicting incident lung cancer and visualized the prediction with a nomogram. Model performance was evaluated using the area under the receiver operating characteristic curve (AUC) with two-sided 95% confidence intervals (CIs) on the testing set, The CIs were estimated by a nonparametric bootstrap with 1,000 resamples, using the “pROC” [30] and timeROC package [31]. The model fit was calculated and evaluated using the Nagelkerke R2 using the rms package [32].
Risk stratification and personalized screening initiation for lung cancer
All participants were classified into high-, medium-, and low-risk groups using prespecified cut points from the CRS score distribution, yielding a 1:1:8 ratio; parallel strata were also defined for PS, PRS, and PLCOS. Cumulative lung cancer incidence was estimated using Kaplan-Meier curves and differences across groups were compared via log-rank test.
RAP analysis was employed to investigate the specific risk levels of high-risk and low-risk groups compared to the reference group (medium-risk group). Age, gender, PS, PRS, and top 10 genetic principal components (PCs) associated with lung cancer were included in the multivariable CPH model as covariates to derived point estimations. The 95% CIs were estimated using 1,000 stratified bootstrap replicates.
To further individualize screening for patients with varying risk levels, we calculated each participant’s 10-year cumulative lung cancer risk to derive personalized screening initiation ages across risk groups; the 10-year horizon was chosen to match the cohort’s follow-up window, align with common decision windows in screening policy, and remain consistent with prior equal risk and RAP studies [33, 34]. According to current NCCN guidelines, which recommend starting screening at age 50, we defined the risk-adapted screening initiation age as the age at which individuals with a specific lung cancer risk reach a 10-year cumulative risk equivalent to that of the general population at age 50.
Cohort recruitment and participants selection
This study employed a multi-cohort design combining a hospital-based, prospective, case-control cohort and a population-based prospective cohort to strengthen methodological rigor.
Participants in the case-control cohort were nested within the Guangzhou Medical University (GMU) cohort, which is a multi-center, prospective study enrolled 2,757 individuals with diverse respiratory diseases From April 2018 to January 2021. In this study, we selected two independent sub-cohorts of the GMU cohort: the First Affiliated Hospital of Guangzhou Medical University (1STGMU) and Tianjin Key Laboratory of Clinical Multi-omics (TKLCM) cohort, yielding 188 lung cancer patients and 155 health controls in this study. The health control participants were followed up before the database lock to ensure that none had progressed to lung cancer. All case and control plasma samples were randomized and analyzed together within the same proteomic assay batches. Detailed information about cohort construction and participant exclusion can been found in our previous study [15].
The population cohort used in this study was the UKBB, a prospective cohort comprising both lung cancer cases and healthy controls, recruited between 2006 and 2012 [16]. Participants were excluded if they lacked protein measurements, genotype, and baseline information, or if they had been diagnosed with lung cancer at enrollment. This resulted a final analytical cohort of 33,456 participants.
Outcomes definition
The primary outcome of this study was the lung cancer diagnosis, confirmed through histopathological examination by clinicians and pathologists in the 1STGMU/TKLCM cohort. In the UKBB cohort, lung cancer was defined using ICD−10 codes C34. The follow-up period was defined as the time from participant enrollment in UKBB to the earliest occurrence of lung cancer, loss to follow-up, death, or the end of available registry follow-up.
Sample collection and protein quantification
In the 1STGMU/TKLCM cohort, 5 mL of peripheral blood was collected per participant and processed under a harmonized standard operating procedure: samples were delivered within 4 h, centrifuged at 1600 g for 15 min at 4 °C to obtain plasma, and stored at − 80 °C. To mitigate masking by high-abundance proteins, the zeolite NaY nanomaterial previously reported by us was used for enrichment [17]. Dual QC was implemented: pooled-plasma controls in each batch to assess preparation consistency and a long-lasting peptide QC injected after every 20 cohort samples to monitor MS stability. After spiking indexed retention time standards, peptides were analyzed on a Thermo UltiMate 3000 UHPLC coupled to an Orbitrap Q Exactive HF in DIA mode using a C18 column (150 µm × 30 cm, 1.9 µm, 120 Å), a 600 nL/min flow, a 60 min gradient (overall 3–90% B), 2.0 kV positive ESI, and a 320 °C ion transfer tube. MS1 scans covered m/z 350–1500 at 60,000 resolutions, with HCD MS2 at 30,000 and normalized collision energy 28. DIA data was processed with DIA-NN against the human UniProt database using the trypsin/P rule with a 1% FDR. Post-QC protein quantifications were normalized across samples using the “Aquantile” method in the limma package and log-transformed for downstream analyses; the overall proteomic measurement, processing, and QC framework has been detailed previously [15], and approximately 3,900 proteins were quantified.
A total of 2,923 proteins was quantified using the Olink Explore 3072 platform, detailed regrading sample selection, proteomic measurement, processing, and QC were provided elsewhere [18].
Identification and validation of protein biomarkers in lung cancer
Differentially expressed protein (DEP) were identified using the “limma” package after platform-specific preprocessing and QC [19], with proteins exhibiting an absolute log fold change greater that 1 and an adjusted P-value less than 0.05 considered differentially expressed. For DEPs measurable in UKBB, prospective associations with incident lung cancer were tested using multivariable Cox proportional hazards model. Protein showing directionally concordant discovery log2FC and UKBB HR with p < 0.05 were considered validated.
To address multicollinearity and improve prediction accuracy, we applied the LASSO regression and random survival forest (RSF) model using the R packages “glmnet” [20] and “randomForestSRC” [21]. LASSO regression utilized Cox regression with 10-fold cross-validation to determine the optimal λ. RSF was built with 1000 trees, log-rank splitting, selecting proteins with importance scores above zero. The final protein biomarker panel comprised the intersection of LASSO-retained and RSF-retained proteins for downstream analyses. GO and KEGG enrichment analyses were conducted using the “clusterProfiler” package [22].
Construction of prostate, lung, colorectal, and ovarian cancer screening trial 2012 model (PLCOm2012)
The PLCOm2012 model was utilized in this study, with baseline characteristics from the UKBB cohort closely aligned with those of the original model. Previous study has demonstrated that the PLCOm2012 showed high discriminatory power and relatively good calibration compared to other widely used risk models [23]. In this study, PLCOm2012 was constructed using participants baseline information, including age, gender, body mass index (BMI), race, education level, smoking history, personal history of COPD, personal history of cancer, and family history of lung cancer. Detailed information regarding to the PLCOm2012 model specification details was provided in Supplementary Material 1.
Genetic data acquisition and polygenic risk scores (PRS) calculation
To assess the genetic risk associated with lung cancer, PRS were computed using UKBB genotype data. 23 high-penetrance single nucleotide polymorphisms (SNPs) identified by Zhu et al. [24] were used in this study, which had shown significant associations with lung cancer across the large cohorts involving different populations [25, 26], and demonstrated superior variance explained (R2) compared with other PRSs. SNPs were subsequently extracted using bgenix [27]. QC measures excluded SNPs with low minor allele frequency, poor imputation quality, or ambiguous palindromic alleles. PLINK 2.0 was used for genotype conversion and PRS calculation, aligning allele effects with published references [28]. Details of data preparation, QC procedures, and computational framework are described in Supplementary Material 2 and the distribution of missing rates for SNPs is shown in Supplementary Table 1.
Development and validation of risk prediction models for lung cancer
Protein score (PS) and PLCOm2012 score (PLCOS) were calculated based on variable weights using the “caret” package [29]. The combined risk score (CRS) was then calculated using the PS, PRS, and PLCOS.
Participants in the UKBB cohort were stratified and randomly allocated into training and testing subsets in a 7:3 ratio. In the training subset, we fitted Cox proportional hazards models for the PS, PRS, and PLCOS. During this process, the baseline hazard h0t and the regression coefficients β for the PS and PLCOS were estimated within the UKBB training set. For absolute risk estimation, model calibration, and risk advancement period (RAP) analyses, we used the baseline hazard h0t estimated from the UKBB training subset. We then integrated PS, PRS, and PLCOS to construct a combined risk score for predicting incident lung cancer and visualized the prediction with a nomogram. Model performance was evaluated using the area under the receiver operating characteristic curve (AUC) with two-sided 95% confidence intervals (CIs) on the testing set, The CIs were estimated by a nonparametric bootstrap with 1,000 resamples, using the “pROC” [30] and timeROC package [31]. The model fit was calculated and evaluated using the Nagelkerke R2 using the rms package [32].
Risk stratification and personalized screening initiation for lung cancer
All participants were classified into high-, medium-, and low-risk groups using prespecified cut points from the CRS score distribution, yielding a 1:1:8 ratio; parallel strata were also defined for PS, PRS, and PLCOS. Cumulative lung cancer incidence was estimated using Kaplan-Meier curves and differences across groups were compared via log-rank test.
RAP analysis was employed to investigate the specific risk levels of high-risk and low-risk groups compared to the reference group (medium-risk group). Age, gender, PS, PRS, and top 10 genetic principal components (PCs) associated with lung cancer were included in the multivariable CPH model as covariates to derived point estimations. The 95% CIs were estimated using 1,000 stratified bootstrap replicates.
To further individualize screening for patients with varying risk levels, we calculated each participant’s 10-year cumulative lung cancer risk to derive personalized screening initiation ages across risk groups; the 10-year horizon was chosen to match the cohort’s follow-up window, align with common decision windows in screening policy, and remain consistent with prior equal risk and RAP studies [33, 34]. According to current NCCN guidelines, which recommend starting screening at age 50, we defined the risk-adapted screening initiation age as the age at which individuals with a specific lung cancer risk reach a 10-year cumulative risk equivalent to that of the general population at age 50.
Results
Results
Cohort demographics and characteristics
Figure 1 illustrated the research strategy of this study. In 1STGMU/TKLCM cohort, 343 participants had a mean age of 59.10 years (range 18–85 years), with 162 (47.23%) being female. In the UKBB cohort, the mean age of the 33,456 participants was 55.78 years (range 39–70 years), with 18,004 (53.81%) being female. Detailed baseline characteristics were presented in Supplementary Table 2 and Supplementary Table 3.
Identification and validation of proteomic profiles in lung cancer
In the 1STGMU/TKLCM cohort, a total of 1,227 DEPs were identified between lung cancer cases and controls, with 656 proteins upregulated and 571 proteins downregulated (Fig. 2A, and Supplementary Material 3). Enrichment analysis of these DEPs revealed that the protein sets were predominantly associated with immune response, cell differentiation, translation and ribosome regulation, as well as developmental processes and signal transduction (Fig. 2B). These findings suggested a significant involvement of these biological pathways in lung cancer.
In the UKBB cohort, with a median follow-up time of 13.24 ± 1.23 years, 318 participants were diagnosed with lung cancer (Supplementary Table 3). 307 of 1,227 DEPs were also measured in UKBB cohort, and finally 46 were significantly validated, with 29 DEPs were positively associated with incident lung cancer (hazard ratio (HR) > 1, p < 0.05) and 16 proteins were inversely associated with incident lung cancer (HR < 1, p < 0.05) (Fig. 2C and Supplementary Table 4).
Protein selection and construction of multi-modal risk scores
We subsequently utilized a combination of RSF model and LASSO regression combined methods to selected protein signatures associated with lung cancer risk (Fig. 2D and Supplementary Figure 1A - C). Consequently, 16 protein biomarkers (CHRDL1, CST3, CTSH, GPC1, IGFBP2, ITGB2, KITLG, LTBP2, NDUFS6, NECTIN2, PI3, PTN, SIAE, SIRPB1, BAG6, TFRC) were ultimately identified for the predictive model (Supplementary Table 5). The PS calculated by these proteins were significantly associated with the incidence of lung cancer incidence, with a HR of 1.04 (95% CI: 1.03–1.05). These 16-protein biomarkers represented a robust panel for predicting lung cancer risk, as confirmed in both the 1STGMU/TKLCM and UKBB cohort. The corresponding details were provided in Supplementary Material 4.
We also calculated PRS and PLCOS, both of which were significantly associated with lung cancer risk. The PRS demonstrated a HR of 2.71 (95% CI: 0.73–10.1), while the PLCOS showed a HR of 1.03 (95%CI: 1.03–1.03). These findings suggested that genetic and clinical models can complement each other in predicting lung cancer risk.
Assessment of prediction models based on multi-modal risk scores
Prediction models based on PS, PRS, and PLCOS were developed using CPH model. In the validation set, the time-dependent AUCs at 3-, 5-, and 10-year for PS were 0.83 (95% CI: 0.70–0.89), 0.77 (95% CI: 0.68–0.85), and 0.75 (95% CI: 0.69–0.80), respectively (Fig. 3A). Similarly, the time-dependent AUCs for PRS were 0.65 (95% CI: 0.54–0.68), 0.60 (95% CI: 0.54–0.68), and 0.58 (95% CI: 0.53–0.64) (Fig. 3B), and for PLCOS, the time-dependent AUC were 0.85 (95% CI: 0.80–0.90), 0.84 (95% CI: 0.79–0.89), and 0.83 (95% CI: 0.79–0.88) (Fig. 3C) at the same time points. When combining these three models into a CRS, a significant improvement in predictive performance was observed at 3-, 5-, and 10-year time points compared with the single-modality models. The time-dependent AUCs for the combined model were 0.85 (95% CI: 0.80–0.93) at 3 year, 0.88 (95% CI: 0.83–0.92) in the 5 year, and 0.85 (95% CI: 0.81–0.87) at 10 year, demonstrating improved predictive accuracy compared to individual models (Fig. 3D and Table 1). The nomogram based on PS, PRS and PLCOS was utilized for visualizing the prediction of 3-, 5-, and 10-year lung cancer risk (Fig. 3E). The calibration curves comparing predicted and observed lung cancer risks for the CRS model demonstrated good agreement across different prediction scenarios, further confirming the robustness of the model (Supplementary Figure 2 A– 2B). The Nagelkerke R2 for PS, PRS, PLCOM and CRS models were 0.64, 0.47, 0.78 and 0.65, respectively, demonstrating the good degree of fit. Supplementary Figure 3 showed the distribution of Shapley additive explanations values for PS, PRS, and PLCOM, highlighting the varying influence of these features on the model’s predictions.
Risk stratification and cumulative risk assessment across lung cancer risk groups
We subsequently stratified participants into low-, medium-, and high-risk groups based on their PS, PRS, PLCOS, and CRS. In the overall population, individuals in the high-risk group had significantly higher risk of developing lung cancer compared to those in the medium-risk group. Specifically, the high-risk group showed a 1.90-fold increase in risk based on PS (HR: 1.90, 95% CI: 1.41–2.55), a 1.47-fold increase based on PRS (HR: 1.47, 95% CI: 0.95–2.28), a 2.95-fold increase based on PLCOS (HR: 2.95, 95% CI: 2.17–4.01), and a 3.18-fold increase based on CRS (HR: 3.18, 95% CI: 2.34–4.31). Conversely, individuals in the low-risk group had significantly lower risks compared to the medium-risk group. The low-risk group exhibited a 0.23-fold risk based on PS (HR: 0.23, 95% CI: 0.17–0.31), a 0.86-fold risk based on PRS (HR: 0.86, 95% CI: 0.60–1.23), a 0.23-fold risk based on PLCOS (HR: 0.23, 95% CI: 0.16–0.32), and a 0.21-fold risk based on CRS (HR: 0.21, 95% CI: 0.15–0.30)( Supplementary Table 6).
To further evaluate the clinical utility, we calculated both the negative predictive value (NPV) and the sensitivity of the models. The results indicated that the CRS model demonstrated good clinical utility. The NPVs at 3-, 5-, and 10-year were 0.98 (95%CI: 0.97–0.99), 0.99 (95%CI: 0.99–0.99), and 0.99 (95%CI: 0.99–0.99), respectively. The sensitivities at 3-, 5-, and 10-year were 0.86 (95%CI: 0.70–0.97), 0.84 (95%CI: 0.76–0.97), and 0.81 (95%CI: 0.72–0.92), respectively (Supplementary Table 7 and 8). It was noteworthy that 759 non-smoking participants were identified as being at medium or high risk, demonstrating the effectiveness of the CRS-based risk stratification approach.
Kaplan-Meier (K-M) cumulative incidence curves for the three risk groups, stratified by each of the four risk scores, were shown in Fig. 4A–4D. Significant differences in survival were observed across the risk groups, with log-rank tests yielding p < 0.001 for the PS, PLCOS, and CRS-based stratifications, and p = 0.002 for the PRS group. The subgroup comparison between the low-, medium- and high-risk groups, under the PRS stratification did not yield statistically significant differences in the log-rank test. In contrast, all subgroup comparisons using the other scoring systems exhibited statistically significant differences.
These findings demonstrated the efficacy of the combined model in distinguishing between high and low-risk individuals, enabling more accurate stratification for lung cancer risk.
Risk advancement period and personalized screening strategies
To further quantify the time discrepancy in reaching the same lung cancer risk threshold across different groups, we performed a RAP analysis. Participants in the high PS group reached an equivalent lung cancer risk 11.08 years earlier compared to the medium PS group (RAP: −11.08, 95% CI: −15.73 - −6.05), while those in the low PS group reached this risk 7.65 years later (RAP: 7.65, 95% CI: 1.40 - 13.13). Similarly, the high PRS group reached this risk 1.61 years earlier compared to the medium PRS group (RAP: −1.61, 95% CI: −4.83 - 1.27), whereas the low PRS group reached the risk 3.42 years later (RAP: 3.42, 95% CI: −0.65 - 7.49). Lastly, for the CRS, individuals in the high-risk group reached the same risk 11.71 years earlier than the medium group (RAP: −11.71, 95% CI: −15.58 - −6.16), while the low-risk group reached it 11.78 years later (RAP: 11.78, 95% CI: 6.95 - 15.08) (Table 2).
In terms of lung cancer screening, we calculated the 10-years cumulative lung cancer risk to estimate personalized screening initiation ages for individuals in different risk groups, the detailed cumulative lung cancer risk was shown in the Supplementary Material 5. Compared to the general population, which is recommended to begin screening at age 50, individuals in the low-, medium-, and high-risk groups based on the PS would reach the equivalent cumulative risk thresholds at ages 55.77, 40.43, and < 40 years, respectively (Fig. 4E). Individuals in the low-, medium-, and high-risk groups based on the PRS would reach the equivalent cumulative risk thresholds at ages 51.76, 48.02, and 46.19 years, respectively (Fig. 4F). Based on the CRS, individuals in the low-, medium-, and high-risk groups reach the equivalent cumulative risk thresholds at 58.12, 40.72, and < 40 years, respectively (Fig. 4G).
These findings suggested that integrating risk stratification into screening protocols could optimize early detection, with high-risk individuals benefiting from earlier screening initiation.
Comparison with age-based eligibility and exploratory resource and cost analysis
Based on the multimodal risk stratification framework for lung cancer screening, more precise recommendations could be provided in real-world screening settings. CRS-guided screening markedly reduced the scan volume to 248.7 versus 730.5 examinations per 1,000 persons. When we aligned the total LDCT volume under the CRS strategy with that of the current age-50 threshold, the results indicated that 100% of the high-risk group and 100% of the intermediate-risk group (3,314 individuals), as well as 67.3% of the low-risk group (17,811 individuals), would be included for LDCT. These findings suggested that the CRS could substantially lower the equipment and budgetary demands of LDCT screening and that, under an equivalent scanning workload, risk adapted triage conferred a clear efficiency advantage.
Cohort demographics and characteristics
Figure 1 illustrated the research strategy of this study. In 1STGMU/TKLCM cohort, 343 participants had a mean age of 59.10 years (range 18–85 years), with 162 (47.23%) being female. In the UKBB cohort, the mean age of the 33,456 participants was 55.78 years (range 39–70 years), with 18,004 (53.81%) being female. Detailed baseline characteristics were presented in Supplementary Table 2 and Supplementary Table 3.
Identification and validation of proteomic profiles in lung cancer
In the 1STGMU/TKLCM cohort, a total of 1,227 DEPs were identified between lung cancer cases and controls, with 656 proteins upregulated and 571 proteins downregulated (Fig. 2A, and Supplementary Material 3). Enrichment analysis of these DEPs revealed that the protein sets were predominantly associated with immune response, cell differentiation, translation and ribosome regulation, as well as developmental processes and signal transduction (Fig. 2B). These findings suggested a significant involvement of these biological pathways in lung cancer.
In the UKBB cohort, with a median follow-up time of 13.24 ± 1.23 years, 318 participants were diagnosed with lung cancer (Supplementary Table 3). 307 of 1,227 DEPs were also measured in UKBB cohort, and finally 46 were significantly validated, with 29 DEPs were positively associated with incident lung cancer (hazard ratio (HR) > 1, p < 0.05) and 16 proteins were inversely associated with incident lung cancer (HR < 1, p < 0.05) (Fig. 2C and Supplementary Table 4).
Protein selection and construction of multi-modal risk scores
We subsequently utilized a combination of RSF model and LASSO regression combined methods to selected protein signatures associated with lung cancer risk (Fig. 2D and Supplementary Figure 1A - C). Consequently, 16 protein biomarkers (CHRDL1, CST3, CTSH, GPC1, IGFBP2, ITGB2, KITLG, LTBP2, NDUFS6, NECTIN2, PI3, PTN, SIAE, SIRPB1, BAG6, TFRC) were ultimately identified for the predictive model (Supplementary Table 5). The PS calculated by these proteins were significantly associated with the incidence of lung cancer incidence, with a HR of 1.04 (95% CI: 1.03–1.05). These 16-protein biomarkers represented a robust panel for predicting lung cancer risk, as confirmed in both the 1STGMU/TKLCM and UKBB cohort. The corresponding details were provided in Supplementary Material 4.
We also calculated PRS and PLCOS, both of which were significantly associated with lung cancer risk. The PRS demonstrated a HR of 2.71 (95% CI: 0.73–10.1), while the PLCOS showed a HR of 1.03 (95%CI: 1.03–1.03). These findings suggested that genetic and clinical models can complement each other in predicting lung cancer risk.
Assessment of prediction models based on multi-modal risk scores
Prediction models based on PS, PRS, and PLCOS were developed using CPH model. In the validation set, the time-dependent AUCs at 3-, 5-, and 10-year for PS were 0.83 (95% CI: 0.70–0.89), 0.77 (95% CI: 0.68–0.85), and 0.75 (95% CI: 0.69–0.80), respectively (Fig. 3A). Similarly, the time-dependent AUCs for PRS were 0.65 (95% CI: 0.54–0.68), 0.60 (95% CI: 0.54–0.68), and 0.58 (95% CI: 0.53–0.64) (Fig. 3B), and for PLCOS, the time-dependent AUC were 0.85 (95% CI: 0.80–0.90), 0.84 (95% CI: 0.79–0.89), and 0.83 (95% CI: 0.79–0.88) (Fig. 3C) at the same time points. When combining these three models into a CRS, a significant improvement in predictive performance was observed at 3-, 5-, and 10-year time points compared with the single-modality models. The time-dependent AUCs for the combined model were 0.85 (95% CI: 0.80–0.93) at 3 year, 0.88 (95% CI: 0.83–0.92) in the 5 year, and 0.85 (95% CI: 0.81–0.87) at 10 year, demonstrating improved predictive accuracy compared to individual models (Fig. 3D and Table 1). The nomogram based on PS, PRS and PLCOS was utilized for visualizing the prediction of 3-, 5-, and 10-year lung cancer risk (Fig. 3E). The calibration curves comparing predicted and observed lung cancer risks for the CRS model demonstrated good agreement across different prediction scenarios, further confirming the robustness of the model (Supplementary Figure 2 A– 2B). The Nagelkerke R2 for PS, PRS, PLCOM and CRS models were 0.64, 0.47, 0.78 and 0.65, respectively, demonstrating the good degree of fit. Supplementary Figure 3 showed the distribution of Shapley additive explanations values for PS, PRS, and PLCOM, highlighting the varying influence of these features on the model’s predictions.
Risk stratification and cumulative risk assessment across lung cancer risk groups
We subsequently stratified participants into low-, medium-, and high-risk groups based on their PS, PRS, PLCOS, and CRS. In the overall population, individuals in the high-risk group had significantly higher risk of developing lung cancer compared to those in the medium-risk group. Specifically, the high-risk group showed a 1.90-fold increase in risk based on PS (HR: 1.90, 95% CI: 1.41–2.55), a 1.47-fold increase based on PRS (HR: 1.47, 95% CI: 0.95–2.28), a 2.95-fold increase based on PLCOS (HR: 2.95, 95% CI: 2.17–4.01), and a 3.18-fold increase based on CRS (HR: 3.18, 95% CI: 2.34–4.31). Conversely, individuals in the low-risk group had significantly lower risks compared to the medium-risk group. The low-risk group exhibited a 0.23-fold risk based on PS (HR: 0.23, 95% CI: 0.17–0.31), a 0.86-fold risk based on PRS (HR: 0.86, 95% CI: 0.60–1.23), a 0.23-fold risk based on PLCOS (HR: 0.23, 95% CI: 0.16–0.32), and a 0.21-fold risk based on CRS (HR: 0.21, 95% CI: 0.15–0.30)( Supplementary Table 6).
To further evaluate the clinical utility, we calculated both the negative predictive value (NPV) and the sensitivity of the models. The results indicated that the CRS model demonstrated good clinical utility. The NPVs at 3-, 5-, and 10-year were 0.98 (95%CI: 0.97–0.99), 0.99 (95%CI: 0.99–0.99), and 0.99 (95%CI: 0.99–0.99), respectively. The sensitivities at 3-, 5-, and 10-year were 0.86 (95%CI: 0.70–0.97), 0.84 (95%CI: 0.76–0.97), and 0.81 (95%CI: 0.72–0.92), respectively (Supplementary Table 7 and 8). It was noteworthy that 759 non-smoking participants were identified as being at medium or high risk, demonstrating the effectiveness of the CRS-based risk stratification approach.
Kaplan-Meier (K-M) cumulative incidence curves for the three risk groups, stratified by each of the four risk scores, were shown in Fig. 4A–4D. Significant differences in survival were observed across the risk groups, with log-rank tests yielding p < 0.001 for the PS, PLCOS, and CRS-based stratifications, and p = 0.002 for the PRS group. The subgroup comparison between the low-, medium- and high-risk groups, under the PRS stratification did not yield statistically significant differences in the log-rank test. In contrast, all subgroup comparisons using the other scoring systems exhibited statistically significant differences.
These findings demonstrated the efficacy of the combined model in distinguishing between high and low-risk individuals, enabling more accurate stratification for lung cancer risk.
Risk advancement period and personalized screening strategies
To further quantify the time discrepancy in reaching the same lung cancer risk threshold across different groups, we performed a RAP analysis. Participants in the high PS group reached an equivalent lung cancer risk 11.08 years earlier compared to the medium PS group (RAP: −11.08, 95% CI: −15.73 - −6.05), while those in the low PS group reached this risk 7.65 years later (RAP: 7.65, 95% CI: 1.40 - 13.13). Similarly, the high PRS group reached this risk 1.61 years earlier compared to the medium PRS group (RAP: −1.61, 95% CI: −4.83 - 1.27), whereas the low PRS group reached the risk 3.42 years later (RAP: 3.42, 95% CI: −0.65 - 7.49). Lastly, for the CRS, individuals in the high-risk group reached the same risk 11.71 years earlier than the medium group (RAP: −11.71, 95% CI: −15.58 - −6.16), while the low-risk group reached it 11.78 years later (RAP: 11.78, 95% CI: 6.95 - 15.08) (Table 2).
In terms of lung cancer screening, we calculated the 10-years cumulative lung cancer risk to estimate personalized screening initiation ages for individuals in different risk groups, the detailed cumulative lung cancer risk was shown in the Supplementary Material 5. Compared to the general population, which is recommended to begin screening at age 50, individuals in the low-, medium-, and high-risk groups based on the PS would reach the equivalent cumulative risk thresholds at ages 55.77, 40.43, and < 40 years, respectively (Fig. 4E). Individuals in the low-, medium-, and high-risk groups based on the PRS would reach the equivalent cumulative risk thresholds at ages 51.76, 48.02, and 46.19 years, respectively (Fig. 4F). Based on the CRS, individuals in the low-, medium-, and high-risk groups reach the equivalent cumulative risk thresholds at 58.12, 40.72, and < 40 years, respectively (Fig. 4G).
These findings suggested that integrating risk stratification into screening protocols could optimize early detection, with high-risk individuals benefiting from earlier screening initiation.
Comparison with age-based eligibility and exploratory resource and cost analysis
Based on the multimodal risk stratification framework for lung cancer screening, more precise recommendations could be provided in real-world screening settings. CRS-guided screening markedly reduced the scan volume to 248.7 versus 730.5 examinations per 1,000 persons. When we aligned the total LDCT volume under the CRS strategy with that of the current age-50 threshold, the results indicated that 100% of the high-risk group and 100% of the intermediate-risk group (3,314 individuals), as well as 67.3% of the low-risk group (17,811 individuals), would be included for LDCT. These findings suggested that the CRS could substantially lower the equipment and budgetary demands of LDCT screening and that, under an equivalent scanning workload, risk adapted triage conferred a clear efficiency advantage.
Discussion
Discussion
This study utilized multimodal prospective data, including plasma protein, PRS and PLCOm2012 to construct a combined predictive model and perform risk stratification. The combined model demonstrated excellent overall discriminative performance, enabling personalized screening recommendations based on risk-adapted starting ages. Based on the risk stratification by the CRS, high risk groups reach the same level of lung cancer risk 11.71 earlier than the medium group, whereas low risk individuals reach the same risk level 11.78 years later compared to the medium group.
Our study introduced key technical innovations in proteomics. Using Zeolite NaY nanomaterials in the 1STGMU/TKLCM cohort, we enhanced detection of low-abundance plasma proteins by reducing interference from high-abundance proteins [17]. This improved the identification of valuable biomarkers. Additionally, the UKBB cohort employed the Olink Explore 3072 platform, leveraging proximity signal amplification technology to precisely detect fragile, early-stage protein signals. Olink’s high throughput, and standardized cross-platform approach further support its clinical applicability.
The enhanced proteomic coverage achieved through Zeolite NaY and Olink technologies generated high-dimensional data, necessitating robust feature selection. To address this, we employed two machine learning methods for feature selection. LASSO employs L1 regularization to shrink irrelevant coefficients to zero, effectively eliminating redundant features. This approach is particularly suited to high-dimensional proteomic data, balancing sparsity and model interpretability [35]. RSF constructs multiple decision trees to capture nonlinear relationships among proteins and provides Gini-based importance scores that objectively measure each protein’s contribution to classification. By taking the intersection of proteins selected by both methods, we minimized false positives, balanced model complexity with interpretability, and enhanced the potential for clinical translation [36].
Following the combined screening process, we ultimately identified 16 plasma proteins which were independently associated with lung cancer incidence and demonstrated consistent predictive performance across various time points. Most of these proteins have previously been reported to be involved in the pathogenesis and progression of lung cancer [37–47]. For instance, Ou et al. [37] conducted a pan-cancer and LUAD-focused study showing that CHRDL1, a secreted antagonist of BMPs (a TGF-β-family arm), was broadly downregulated in tumors; in LUAD, higher CHRDL1 expression aligned with better survival, more favorable immune infiltration, and lower tumor-stemness scores. Functional assays demonstrated that CHRDL1 overexpression in A549/H1299 cells suppressed proliferation, migration, and invasion, and reduced tumor growth in nude mice. Deng et al. [38] showed that the secreted BMP antagonist CHRDL1 was downregulated in lung adenocarcinoma; lower expression correlated with higher T/N stage, TP53 mutation, and poorer overall survival, whereas higher expression remained an independent favorable prognostic factor. Pathway and immune analyses suggested involvement in cell-cycle control and tumor immune infiltration, supporting CHRDL1 as a prognostic biomarker and potential therapeutic target in LUAD. Kleeman et al. [40] showed that CST3 was induced by glucocorticoid signaling in lung cancer–relevant systems. Glucocorticoid receptor binding at a CST3 enhancer increased transcription and extracellular secretion; tumor-derived CST3 then reshaped the protease–inhibitor milieu and, together with tumor-associated macrophages, fostered an immunosuppressive microenvironment. In vivo, CST3 knockout markedly depleted TREM2+ macrophages and restrained tumor growth; clinically, high CST3 expression characterized an “immunologically quiet” state and correlated with poorer responses to checkpoint blockade. Luyapan et al. [39] found that variation in the surfactant pathway gene CTSH was associated with lung-cancer risk and replicated across datasets; genetically predicted CTSH expression was also positively associated with risk. Because CTSH encodes cathepsin H that processes pro-SP-B into mature SP-B, the study linked alveolar surfactant homeostasis to lung-cancer susceptibility and highlighted CTSH as a potential biomarker. Salomonsson et al. [44] reported that in NSCLC roughly one third of tumors were KITLG protein positive by IHC (21/68, ~31%), with similar rates across adenocarcinoma, squamous cell carcinoma, and LCNEC (~30, 31, and 33%; p > 0.05), using ≥10% stained tumor cells as the positivity threshold. Ando et al. [45] reported that in lung adenocarcinoma, high expression of the transmembrane adhesion protein NECTIN2 was associated with poorer overall and recurrence-free survival; in cell models, NECTIN2 knockout increased apoptosis and reduced proliferation and migration, supporting it as an adverse prognostic biomarker and potential therapeutic target. Importantly, to our knowledge, this was the first study to validate these 16 plasma biomarkers across different ethnic populations, particularly among East Asian individuals, highlighting the novelty of our findings.
The multifactorial nature of cancer pathogenesis necessitates multi-omics approaches for precise risk stratification. To address this, we integrated a PRS derived from genome-wide association data with clinical predictors and proteomic profiles. In this study, we adopted a focused PRS model based on 23 high-penetrance SNPs identified by Zhu et al. [24], which had been proven to exhibit stronger genetic effects than other loci and demonstrated strong associations with lung cancer across large cohorts involving different populations [25, 48]. Zhu et al. [24] established stringent criteria for SNP selection, rigorously excluding those with a minor allele frequency below 0.5%, poor genotype quality, or palindromic sites, thereby ensuring the high reliability and accuracy of the final 23 SNPs. Boumtje et al. [26] conducted a benchmarking analysis based on the various published/reported PRSs, the 23 specific SNPs used in this study showed the highest variance explained (R2) compared with others and was the closets in explanatory power to the genome-wide PRS which incorporating 1.1 million variants. At the same time, routine use of PRS in population screening must meet clear ethical safeguards, including explicit genetics specific informed consent, privacy by design data governance with minimization, encryption, role based access, and time limited retention, access to genetic counseling and transparent result disclosure, compliance with local insurance and anti-discrimination regulations, and attention to equity to avoid widening disparities; we view these as prerequisites for future implementation.
For clinical risk assessment, we incorporated the validated PLCOm2012 model, which integrates demographic and behavioral variables to estimate individualized lung cancer probability. Consistent with large-scale evaluations by Robbins et al. [23], PLCOm2012 demonstrated robust predictive accuracy in our cohort (3-, 5-, 10-years AUCs: 0.85, 0.84, 0.83), reaffirming its suitability as a foundational clinical predictor.
Integrating multimodal data significantly improved lung cancer risk prediction over single modalities. The combined model achieved 3-, 5-, and 10-years AUCs of 0.85, 0.88, and 0.85, respectively, consistently outperforming individual models. Time-dependent analysis confirmed sustained predictive accuracy, highlighting the robustness of this approach for short- and long-term risk assessment. Although the CRS model also demonstrated excellent goodness of fit, with a Nagelkerke R2 of 0.65, it should be further noted that the Nagelkerke R2 is a pseudo-R2, the magnitude of which was highly dependent on data characteristics and factors such as censoring and follow-up. Therefore, this result should be interpreted with caution.
The implementation of risk-adapted screening strategies, informed by multimodal risk stratification, holds transformative potential for precision oncology. Pioneering studies in other malignancies have demonstrated this approach’s utility [33, 34]. However, in lung cancer screening, previous risk models have predominantly relied on single-modality approaches, limiting their capacity to address the disease’s complex etiology. To bridge this gap, our study pioneers the integration of proteomic biomarkers, PRS, and clinical predictors into a unified lung cancer screening framework. This multimodal strategy not only achieved superior risk discrimination but also enabled precise, biologically grounded screening recommendations. High-risk individuals based on the CRS were recommended to initiate LDCT screening 11.71 years earlier than current guideline thresholds, potentially capturing early-stage tumors that conventional age-based protocols might miss. Interestingly, we found that 759 participants who never smoked were identified as high-risk individuals, for whom earlier initiation of LDCT screening was recommended. By integrating a multimodal lung cancer screening framework, this approach effectively identified potential high-risk individuals who might have been overlooked by traditional risk factors, thereby enabling more precise and individualized screening guidance.
We position the CRS as a pre-LDCT triage tool that prioritizes those who truly warrant screening and reduces unnecessary screening among low-risk individuals. Considering initiation before age 40 is conditional and risk adapted, applying only when the 10-year absolute risk exceeds a prespecified threshold; this threshold, and thus the starting age, depends on the baseline risk of the general population, with higher baselines leading to earlier attainment. Because earlier initiation may increase cumulative radiation, implementation requires dose stewardship with dose optimized LDCT, moderately extended follow-up intervals (about 18–24 months), and volume and volume doubling time-based nodule management to limit downstream imaging.
Although a formal cost-effectiveness analysis was essential, it was not the primary objective or scope of this study. The analyses presented here were exploratory resource estimations that did not incorporate treatment-stage transitions or quality-adjusted life years (QALYs). A standardized cost-effectiveness evaluation would serve as a key focus for future health economics research. It was important to note that cost considerations were highly context specific. In China, national medical insurance and government-supported screening programs could substantially reduce individual financial burdens, thereby improving the feasibility of implementing risk-adapted early screening. This study did not advocate universal earlier screening initiation; instead, it supported a threshold-based, risk-adapted strategy combined with shared decision-making and smoking cessation support. In all implementation contexts, formal health economic modeling based on CRS risk distributions should precede any policy adoption.
Individuals classified as low risk by the CRS were advised to start LDCT about eight years later than the general population based on our results. This recommendation was conditional and updated dynamically through annual CRS reassessment using peripheral blood. Immediate LDCT was triggered by new or worsening respiratory symptoms, increased smoking exposure, a newly reported family history, or incidental pulmonary nodules. For those with very low short- to medium-term absolute risk, delaying screening reduced over screening, false positives, overdiagnosis, and cumulative radiation while prioritizing resources for high-risk groups. Screening age adjustments were made by shared decision-making; if earlier LDCT was chosen, a low-dose protocol and nodule management based on volume and doubling time were recommended.
In the subgroup analysis, RAP confidence intervals were relatively wide. This was statistically interpretable: small event counts, or high censoring reduced effective information, and shallow cumulative-risk slopes near the chosen threshold amplified minor risk-estimation errors into larger age uncertainty. Differences in subgroup composition and baseline risk, threshold sensitivity, measurement variability (proteomic/genetic), and competing risks could also widen intervals. These patterns reflected limited information rather than intrinsic model instability. With larger sample sizes, longer follow-up, and multicenter data, RAP intervals were expected to narrow, enabling more precise subgroup estimates; accordingly, we offered actionable recommendations where evidence was sufficient and maintained caution and transparency where it was limited.
Previous studies had laid an important foundation for multimodal lung cancer risk stratification. Davies et al. [49] applied the Olink platform to the Liverpool Lung Project cohort, identifying and externally validating protein signatures in a prospective population. Their results demonstrated good discrimination within a 1–3-year prediction window and suggested potential predictive value for longer-term follow-up, underscoring the role of plasma proteins in pre-screening triage at the population level. Similarly, Albanes et al. [9] characterized the circulating proteomic landscape within three years before diagnosis across multiple cohorts, providing a transferable molecular basis for near-term risk identification. In addition, Fahrmann et al. [11] integrated a four-protein panel with PLCOm2012 in the PLCO prospective cohort and achieved an AUC of 0.85 for one-year pre-diagnostic risk discrimination, establishing a paradigm for “biomarker + clinical risk” joint decision-making. Against this background, our study delivered several advances. First, broader multimodal integration—a single framework combining 16 plasma proteins, a PRS, and PLCOm2012, validated in a Chinese hospital cohort and the UK Biobank, and particularly effective for identifying high-risk never-smokers; Second, longer prediction horizons with decision-oriented outputs—absolute risks at 3, 5, and 10 year and RAP-derived individualized screening ages, with time-dependent AUCs of 0.85/0.88/0.85 and high-risk individuals reaching equivalent 10-year risk 11.7 years earlier than the medium-risk group; Third, implementation-focused triage before LDCT, fewer scans under matched workloads and a threshold-based protocol for initiation and reassessment, improving early detection while balancing radiation exposure and over-screening. Collectively, these advances deepened multimodal integration, widened the temporal scope, and strengthened clinical implement ability.
This study has limitations. First, although this study incorporated two cohorts from China and the United Kingdom, providing broad geographic and ancestry representation, external validation in more diverse populations remained necessary to further strengthen the generalizability of the CRS model. Second, despite the systematic and rigorous quality control and batch handling applied in this study, batch effects and inter laboratory variability cannot be fully eliminated. Third, despite employing multiple approaches to reduce the risk of overfitting and performing extensive external validation, potential residual overfitting could not be completely ruled out.
This study utilized multimodal prospective data, including plasma protein, PRS and PLCOm2012 to construct a combined predictive model and perform risk stratification. The combined model demonstrated excellent overall discriminative performance, enabling personalized screening recommendations based on risk-adapted starting ages. Based on the risk stratification by the CRS, high risk groups reach the same level of lung cancer risk 11.71 earlier than the medium group, whereas low risk individuals reach the same risk level 11.78 years later compared to the medium group.
Our study introduced key technical innovations in proteomics. Using Zeolite NaY nanomaterials in the 1STGMU/TKLCM cohort, we enhanced detection of low-abundance plasma proteins by reducing interference from high-abundance proteins [17]. This improved the identification of valuable biomarkers. Additionally, the UKBB cohort employed the Olink Explore 3072 platform, leveraging proximity signal amplification technology to precisely detect fragile, early-stage protein signals. Olink’s high throughput, and standardized cross-platform approach further support its clinical applicability.
The enhanced proteomic coverage achieved through Zeolite NaY and Olink technologies generated high-dimensional data, necessitating robust feature selection. To address this, we employed two machine learning methods for feature selection. LASSO employs L1 regularization to shrink irrelevant coefficients to zero, effectively eliminating redundant features. This approach is particularly suited to high-dimensional proteomic data, balancing sparsity and model interpretability [35]. RSF constructs multiple decision trees to capture nonlinear relationships among proteins and provides Gini-based importance scores that objectively measure each protein’s contribution to classification. By taking the intersection of proteins selected by both methods, we minimized false positives, balanced model complexity with interpretability, and enhanced the potential for clinical translation [36].
Following the combined screening process, we ultimately identified 16 plasma proteins which were independently associated with lung cancer incidence and demonstrated consistent predictive performance across various time points. Most of these proteins have previously been reported to be involved in the pathogenesis and progression of lung cancer [37–47]. For instance, Ou et al. [37] conducted a pan-cancer and LUAD-focused study showing that CHRDL1, a secreted antagonist of BMPs (a TGF-β-family arm), was broadly downregulated in tumors; in LUAD, higher CHRDL1 expression aligned with better survival, more favorable immune infiltration, and lower tumor-stemness scores. Functional assays demonstrated that CHRDL1 overexpression in A549/H1299 cells suppressed proliferation, migration, and invasion, and reduced tumor growth in nude mice. Deng et al. [38] showed that the secreted BMP antagonist CHRDL1 was downregulated in lung adenocarcinoma; lower expression correlated with higher T/N stage, TP53 mutation, and poorer overall survival, whereas higher expression remained an independent favorable prognostic factor. Pathway and immune analyses suggested involvement in cell-cycle control and tumor immune infiltration, supporting CHRDL1 as a prognostic biomarker and potential therapeutic target in LUAD. Kleeman et al. [40] showed that CST3 was induced by glucocorticoid signaling in lung cancer–relevant systems. Glucocorticoid receptor binding at a CST3 enhancer increased transcription and extracellular secretion; tumor-derived CST3 then reshaped the protease–inhibitor milieu and, together with tumor-associated macrophages, fostered an immunosuppressive microenvironment. In vivo, CST3 knockout markedly depleted TREM2+ macrophages and restrained tumor growth; clinically, high CST3 expression characterized an “immunologically quiet” state and correlated with poorer responses to checkpoint blockade. Luyapan et al. [39] found that variation in the surfactant pathway gene CTSH was associated with lung-cancer risk and replicated across datasets; genetically predicted CTSH expression was also positively associated with risk. Because CTSH encodes cathepsin H that processes pro-SP-B into mature SP-B, the study linked alveolar surfactant homeostasis to lung-cancer susceptibility and highlighted CTSH as a potential biomarker. Salomonsson et al. [44] reported that in NSCLC roughly one third of tumors were KITLG protein positive by IHC (21/68, ~31%), with similar rates across adenocarcinoma, squamous cell carcinoma, and LCNEC (~30, 31, and 33%; p > 0.05), using ≥10% stained tumor cells as the positivity threshold. Ando et al. [45] reported that in lung adenocarcinoma, high expression of the transmembrane adhesion protein NECTIN2 was associated with poorer overall and recurrence-free survival; in cell models, NECTIN2 knockout increased apoptosis and reduced proliferation and migration, supporting it as an adverse prognostic biomarker and potential therapeutic target. Importantly, to our knowledge, this was the first study to validate these 16 plasma biomarkers across different ethnic populations, particularly among East Asian individuals, highlighting the novelty of our findings.
The multifactorial nature of cancer pathogenesis necessitates multi-omics approaches for precise risk stratification. To address this, we integrated a PRS derived from genome-wide association data with clinical predictors and proteomic profiles. In this study, we adopted a focused PRS model based on 23 high-penetrance SNPs identified by Zhu et al. [24], which had been proven to exhibit stronger genetic effects than other loci and demonstrated strong associations with lung cancer across large cohorts involving different populations [25, 48]. Zhu et al. [24] established stringent criteria for SNP selection, rigorously excluding those with a minor allele frequency below 0.5%, poor genotype quality, or palindromic sites, thereby ensuring the high reliability and accuracy of the final 23 SNPs. Boumtje et al. [26] conducted a benchmarking analysis based on the various published/reported PRSs, the 23 specific SNPs used in this study showed the highest variance explained (R2) compared with others and was the closets in explanatory power to the genome-wide PRS which incorporating 1.1 million variants. At the same time, routine use of PRS in population screening must meet clear ethical safeguards, including explicit genetics specific informed consent, privacy by design data governance with minimization, encryption, role based access, and time limited retention, access to genetic counseling and transparent result disclosure, compliance with local insurance and anti-discrimination regulations, and attention to equity to avoid widening disparities; we view these as prerequisites for future implementation.
For clinical risk assessment, we incorporated the validated PLCOm2012 model, which integrates demographic and behavioral variables to estimate individualized lung cancer probability. Consistent with large-scale evaluations by Robbins et al. [23], PLCOm2012 demonstrated robust predictive accuracy in our cohort (3-, 5-, 10-years AUCs: 0.85, 0.84, 0.83), reaffirming its suitability as a foundational clinical predictor.
Integrating multimodal data significantly improved lung cancer risk prediction over single modalities. The combined model achieved 3-, 5-, and 10-years AUCs of 0.85, 0.88, and 0.85, respectively, consistently outperforming individual models. Time-dependent analysis confirmed sustained predictive accuracy, highlighting the robustness of this approach for short- and long-term risk assessment. Although the CRS model also demonstrated excellent goodness of fit, with a Nagelkerke R2 of 0.65, it should be further noted that the Nagelkerke R2 is a pseudo-R2, the magnitude of which was highly dependent on data characteristics and factors such as censoring and follow-up. Therefore, this result should be interpreted with caution.
The implementation of risk-adapted screening strategies, informed by multimodal risk stratification, holds transformative potential for precision oncology. Pioneering studies in other malignancies have demonstrated this approach’s utility [33, 34]. However, in lung cancer screening, previous risk models have predominantly relied on single-modality approaches, limiting their capacity to address the disease’s complex etiology. To bridge this gap, our study pioneers the integration of proteomic biomarkers, PRS, and clinical predictors into a unified lung cancer screening framework. This multimodal strategy not only achieved superior risk discrimination but also enabled precise, biologically grounded screening recommendations. High-risk individuals based on the CRS were recommended to initiate LDCT screening 11.71 years earlier than current guideline thresholds, potentially capturing early-stage tumors that conventional age-based protocols might miss. Interestingly, we found that 759 participants who never smoked were identified as high-risk individuals, for whom earlier initiation of LDCT screening was recommended. By integrating a multimodal lung cancer screening framework, this approach effectively identified potential high-risk individuals who might have been overlooked by traditional risk factors, thereby enabling more precise and individualized screening guidance.
We position the CRS as a pre-LDCT triage tool that prioritizes those who truly warrant screening and reduces unnecessary screening among low-risk individuals. Considering initiation before age 40 is conditional and risk adapted, applying only when the 10-year absolute risk exceeds a prespecified threshold; this threshold, and thus the starting age, depends on the baseline risk of the general population, with higher baselines leading to earlier attainment. Because earlier initiation may increase cumulative radiation, implementation requires dose stewardship with dose optimized LDCT, moderately extended follow-up intervals (about 18–24 months), and volume and volume doubling time-based nodule management to limit downstream imaging.
Although a formal cost-effectiveness analysis was essential, it was not the primary objective or scope of this study. The analyses presented here were exploratory resource estimations that did not incorporate treatment-stage transitions or quality-adjusted life years (QALYs). A standardized cost-effectiveness evaluation would serve as a key focus for future health economics research. It was important to note that cost considerations were highly context specific. In China, national medical insurance and government-supported screening programs could substantially reduce individual financial burdens, thereby improving the feasibility of implementing risk-adapted early screening. This study did not advocate universal earlier screening initiation; instead, it supported a threshold-based, risk-adapted strategy combined with shared decision-making and smoking cessation support. In all implementation contexts, formal health economic modeling based on CRS risk distributions should precede any policy adoption.
Individuals classified as low risk by the CRS were advised to start LDCT about eight years later than the general population based on our results. This recommendation was conditional and updated dynamically through annual CRS reassessment using peripheral blood. Immediate LDCT was triggered by new or worsening respiratory symptoms, increased smoking exposure, a newly reported family history, or incidental pulmonary nodules. For those with very low short- to medium-term absolute risk, delaying screening reduced over screening, false positives, overdiagnosis, and cumulative radiation while prioritizing resources for high-risk groups. Screening age adjustments were made by shared decision-making; if earlier LDCT was chosen, a low-dose protocol and nodule management based on volume and doubling time were recommended.
In the subgroup analysis, RAP confidence intervals were relatively wide. This was statistically interpretable: small event counts, or high censoring reduced effective information, and shallow cumulative-risk slopes near the chosen threshold amplified minor risk-estimation errors into larger age uncertainty. Differences in subgroup composition and baseline risk, threshold sensitivity, measurement variability (proteomic/genetic), and competing risks could also widen intervals. These patterns reflected limited information rather than intrinsic model instability. With larger sample sizes, longer follow-up, and multicenter data, RAP intervals were expected to narrow, enabling more precise subgroup estimates; accordingly, we offered actionable recommendations where evidence was sufficient and maintained caution and transparency where it was limited.
Previous studies had laid an important foundation for multimodal lung cancer risk stratification. Davies et al. [49] applied the Olink platform to the Liverpool Lung Project cohort, identifying and externally validating protein signatures in a prospective population. Their results demonstrated good discrimination within a 1–3-year prediction window and suggested potential predictive value for longer-term follow-up, underscoring the role of plasma proteins in pre-screening triage at the population level. Similarly, Albanes et al. [9] characterized the circulating proteomic landscape within three years before diagnosis across multiple cohorts, providing a transferable molecular basis for near-term risk identification. In addition, Fahrmann et al. [11] integrated a four-protein panel with PLCOm2012 in the PLCO prospective cohort and achieved an AUC of 0.85 for one-year pre-diagnostic risk discrimination, establishing a paradigm for “biomarker + clinical risk” joint decision-making. Against this background, our study delivered several advances. First, broader multimodal integration—a single framework combining 16 plasma proteins, a PRS, and PLCOm2012, validated in a Chinese hospital cohort and the UK Biobank, and particularly effective for identifying high-risk never-smokers; Second, longer prediction horizons with decision-oriented outputs—absolute risks at 3, 5, and 10 year and RAP-derived individualized screening ages, with time-dependent AUCs of 0.85/0.88/0.85 and high-risk individuals reaching equivalent 10-year risk 11.7 years earlier than the medium-risk group; Third, implementation-focused triage before LDCT, fewer scans under matched workloads and a threshold-based protocol for initiation and reassessment, improving early detection while balancing radiation exposure and over-screening. Collectively, these advances deepened multimodal integration, widened the temporal scope, and strengthened clinical implement ability.
This study has limitations. First, although this study incorporated two cohorts from China and the United Kingdom, providing broad geographic and ancestry representation, external validation in more diverse populations remained necessary to further strengthen the generalizability of the CRS model. Second, despite the systematic and rigorous quality control and batch handling applied in this study, batch effects and inter laboratory variability cannot be fully eliminated. Third, despite employing multiple approaches to reduce the risk of overfitting and performing extensive external validation, potential residual overfitting could not be completely ruled out.
Conclusion
Conclusion
This study developed a multimodal framework integrating proteomics, PRS, and clinical factors for precise lung cancer screening. Using diverse cohorts and novel methods, we identified biologically relevant protein biomarkers. The combined model improved risk stratification, addressing current LDCT limitations by minimizing unnecessary screenings and prioritizing high-risk individuals. This approach highlights the value of multidimensional risk assessment in enhancing early detection and optimizing healthcare resources.
This study developed a multimodal framework integrating proteomics, PRS, and clinical factors for precise lung cancer screening. Using diverse cohorts and novel methods, we identified biologically relevant protein biomarkers. The combined model improved risk stratification, addressing current LDCT limitations by minimizing unnecessary screenings and prioritizing high-risk individuals. This approach highlights the value of multidimensional risk assessment in enhancing early detection and optimizing healthcare resources.
Electronic supplementary material
Electronic supplementary material
Below is the link to the electronic supplementary material.
Below is the link to the electronic supplementary material.
출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.
🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반
- Reforming the delivery of smoking cessation: a distributional cost-effectiveness analysis of providing smoking cessation as part of targeted lung cancer screening.
- A Phase II Study of Durvalumab, Doxorubicin, and Ifosfamide in Recurrent and/or Metastatic Pulmonary Sarcomatoid Carcinoma (KCSG LU-19-24).
- A herbal formulation inhibits growth and survival of lung cancer cells through DNA damage and apoptosis - in vitro and in vivo studies.
- Negative trial but positive lesson: reframing immunotherapy resistance from one-size-fits-all to precision strategies.
- Lung Cancer Screening in Adults: State-of-the-Art and Policy Mapping (2025).
- Retrospective dosimetric evaluation of the collapsed cone, AAA, and Acuros XB algorithms for lung cancer Halcyon VMAT plans.