Lightweight hybrid foundation model for lung cancer prognosis based on low-dose chest X-ray images.

Hayeso HH; Lonseko ZM; Alotaibi FMG; Gan T; Shi P; Dong S; Rao N

doi:10.21037/tlcr-2025-aw-1299

← 뒤로

Lightweight hybrid foundation model for lung cancer prognosis based on low-dose chest X-ray images.

1/5 보강

Translational lung cancer research 📖 저널 OA 100% 2025~2026 2026 Vol.15(3) p. 47

Hayeso HH, Lonseko ZM, Alotaibi FMG, Gan T, Shi P, Dong S

📖 무료 전문 🟢 PMC 전문 PMC13071740

PubMed ↗ DOI ↗ BibTeX ↓ RIS ↓

📝 환자 설명용 한 줄

[BACKGROUND] Lung cancer (LC) is the leading cause of cancer mortality worldwide.

🔬 핵심 임상 통계 (초록에서 자동 추출 — 원문 검증 권장)

p-value P<0.001
95% CI 0.927-0.943

이 논문을 인용하기

↓ .bib ↓ .ris

APA Hayeso HH, Lonseko ZM, et al. (2026). Lightweight hybrid foundation model for lung cancer prognosis based on low-dose chest X-ray images.. Translational lung cancer research, 15(3), 47. https://doi.org/10.21037/tlcr-2025-aw-1299

MLA Hayeso HH, et al.. "Lightweight hybrid foundation model for lung cancer prognosis based on low-dose chest X-ray images.." Translational lung cancer research, vol. 15, no. 3, 2026, pp. 47.

PMID 41982697 ↗

DOI 10.21037/tlcr-2025-aw-1299

Abstract

[BACKGROUND] Lung cancer (LC) is the leading cause of cancer mortality worldwide. Accurate prognosis in vulnerable populations, particularly in low-resource settings remains challenging with high radiation dose of standard imaging modalities. While chest X-rays (CXRs) are safer and more accessible, their low spatial resolution limits prognostic accuracy. Existing multimodal models primarily rely on computationally intensive architectures, such as computed tomography (CT) or positron emission tomography (PET) inputs, which reduce their clinical utility. This study aimed to develop and validate a lightweight hybrid foundation model (LHFM) that integrates CXR imaging with clinical data to enable accurate LC prognosis in low-resource settings.

[METHODS] We developed a LHFM that integrates visual features from CXR images extracted by a segment anything model (SAM)-Med2D encoder with semantically enriched prompts generated by BioGPT and clinical metadata. These multimodal features are fused via a dual-branch transformer architecture for survival prediction. The model was trained and validated on the JSRT and PadChest datasets, with external validation on multicenter datasets including the NIH CXR. Performance was evaluated using the concordance index (C-index), area under the receiver operating characteristic curve (AUROC), and Kaplan-Meier (KM) survival analysis.

[RESULTS] The proposed LHFM achieved superior prognostic performance, with a C-index of 0.910 [95% confidence interval (CI): 0.898-0.922, standard deviation (SD) =0.006] and AUROC of 0.935 (95% CI: 0.927-0.943, SD =0.004), outperforming existing multimodal benchmarks (P<0.001). KM curves demonstrated significant separation between the high-risk and low-risk groups. Domain-shift robustness testing across heterogeneous external datasets demonstrated representation stability under distribution shift.

[CONCLUSIONS] LHFM establishes a new paradigm for prognostic precision by exhibiting significant performance from low-dose CXR. This hybrid approach directly addresses the implementation gap in clinical artificial intelligence (AI), offering a scalable, equitable, and immediately applicable solution for personalized cancer care in resource-limited and radiography first workflows, with potential applicability across other cancer types.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

같은 제1저자의 인용 많은 논문 (1)

IDF-Net: Interpretable Dynamic Fusion Network for Colorectal Cancer Diagnosis Using Cross-Modal Imaging.
Diagnostics (Basel, Switzerland) 2025

📖 전문 본문 읽기 PMC JATS · ~94 KB · 영문

Introduction

Introduction
Lung cancer (LC) remains the leading cause of cancer-related mortality worldwide, accounting for over 1.8 million deaths annually (1,2). Accurate prognostication is crucial for personalized treatment, yet existing models rely on computed tomography (CT) or positron emission tomography (PET) imaging modalities (3-6) which are limited by cost, radiation exposure, and access barriers in vulnerable populations and low-resource settings (7,8). By contrast, chest X-ray (CXR) is inexpensive, widely accessible, and involves substantially lower radiation dose (9-12), but its two-dimensional, lower-resolution nature hinders reliable extraction of subtle prognostic signals.
Recent deep learning (DL) studies have demonstrated that CXR images encode prognostic information (13,14); however, most frameworks remain unimodal, image-only, underutilizing complementary clinical context available in radiology reports or electronic health record (EHR) data (15,16). Consequently, these models lack interpretability and are impractical for real-world deployment. Existing multimodal architectures, including Lite-ProSENet (17) and FGCN (18), achieve significant improvements concordance index (C-index) ≈0.76–0.78, but they require high-resolution CT and computationally expensive transformers, which restricts their clinical scalability (19). In contrast, other radiomics and omics-based approaches report more modest performance (C-index ≈0.65–0.78) and often lack external validation or interpretability (20-23). Yet, these methods depend on high-resolution, high-radiation imaging and computationally intensive architectures, limiting scalability beyond specialized centers. Collectively, current evidence underscores a critical gap: there is no lightweight, interpretable, CXR-based multimodal framework that leverages recent advances in medical foundation encoders while remaining deployable in routine practice and resource-limited clinical environments.
Emerging medical vision-language models, such as segment anything model (SAM)-Med2D for segmentation (24) and BioGPT for biomedical language understanding (16), enable rich visual-text representations but are typically embedded in large, resource-intensive pipelines (25). Harnessing their representational strength within a compact architecture optimized for CXR-based risk stratification offers an opportunity to deliver accurate, interpretable prognosis without relying on CT/PET or high-end computing resources.
To address these gaps, we propose a lightweight hybrid foundation model (LHFM) designed to integrate CXR-derived embeddings from SAM-Med2D with BioGPT-generated clinical prompts and structured metadata through a dual-branch transformer. We hypothesize that integrating SAM-Med2D visual embeddings with BioGPT-generated semantic prompts and clinical data will achieve significant survival prediction accuracy from low-dose CXRs compared with unimodal and heavy multimodal baselines. The main contributions of this study are as follows.
❖ We propose a LHFM that integrates SAM-Med2D CXR features, BioGPT-generated semantic prompts, and structured clinical metadata through a unified dual-branch transformer architecture for LC prognosis.

❖ LHFM bridges the gap between high-capacity foundation models and deployable clinical aids by combining visual embeddings from SAM-Med2D with textual semantics from BioGPT, achieving high accuracy while preserving interpretability and scalability.

❖ Across multiple CXR-based datasets, LHFM consistently outperformed existing unimodal and multimodal baselines, with semantic prompt–guided fusion enhancing both interpretability and prognostic performance.

❖ The framework demonstrates robust cross-domain generalization and enables real-time inference on standard hardware, underscoring its translational potential for resource-limited and clinical settings.

We present this article in accordance with the TRIPOD reporting checklist (available at https://tlcr.amegroups.com/article/view/10.21037/tlcr-2025-aw-1299/rc).

Methods

Methods

Datasets
This study utilized multiple publicly available chest CXR and CT datasets for model development and validation, and to assess domain adaptability. The model was developed and validated primarily on the JSRT and PadChest CXR datasets, which provide high-quality annotated images alongside demographic and clinical metadata, enabling robust prognostic model development across diverse patient groups and imaging protocols (26,27). Baseline demographic and clinical characteristics for the primary development cohort (JSRT) are summarized in Table S1. For experimental validation, these datasets were split into training (80%) and testing (20%). To minimize unintended correlations and prevent information leakage, this study used advanced splitting such that samples from the same subject did not appear in both training and testing partitions. Model stability was assessed using 5-fold cross-validation on the development datasets, and the held-out test set remained strictly isolated until final evaluation, and sample images are presented in Figure 1. The model’s generalizability and translational potential was performed on heterogeneous datasets, including the large-scale NIH ChestX-ray and the NSCLC-Radiomics-Interobserver1 datasets, which represent varied demographic distributions and clinical contexts. In addition to internal testing, we evaluated LHFM under marked distribution shifts using heterogeneous external imaging cohorts. Because some datasets (e.g., Shenzhen/Montgomery) (28,29) are not LC cohorts, these analyses are reported as domain-shift robustness and feature transferability tests, rather than clinical external validation of LC prognosis. This evaluation strategy probes whether multimodal representations learned from low-dose radiographs remain stable under substantial differences in acquisition, pathology prevalence, and cohort composition. Furthermore, the NSCLC-Radiomics-Interobserver1 dataset was incorporated to assess the model’s performance (30).
Accordingly, the primary endpoint used for training and evaluation was an algorithmically computed risk label (high-risk vs. low-risk) derived from available demographic and clinical metadata under controlled experimental assumptions. This design enables systematic assessment of whether hybrid fusion of imaging features and semantic prompt representations improves predictive discrimination in a low-dose CXR setting, while avoiding unsupported claims of actual time-to-event clinical outcome prediction.
The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. All datasets used in this study were publicly available (as presented in Appendix 1) and fully anonymized, and institutional review board approval was waived. Missing data in demographic or clinical metadata were handled by imputation with median values for continuous variables and mode for categorical variables, confirming balanced input across patient groups.
Clinical prompts enhance LC prognosis by integrating key tumor features such as size, density, margins, and location into a lightweight predictive model. This method integrates imaging and clinical data to provide interpretable, efficient, and highly accurate predictions, making it suitable for diverse healthcare settings. In this study, clinical prompts were used as auxiliary inputs during model training.

Proposed methods
This section describes the proposed method for determining the prognosis of patients with LC using low-dose CXR. The main workflow includes four steps: data preprocessing, feature extraction, hybrid model training, and prognostic prediction, as presented in Figure 2. We begin by outlining data preprocessing techniques for robust and generalizable inputs, followed by model architecture, training, and prognostic prediction.

Data preprocessing
The preprocessing pipeline standardized the CXR images from the adult and other cohorts by resizing them to 224×224 pixels and normalizing the pixel intensities. Data augmentation, including rotation, scaling, and cropping, was applied to improve model robustness. Clinical metadata were processed to handle missing values, ensuring consistent and balanced input for subsequent multimodal feature extraction and model training. Pixel intensity values were normalized via z-score standardization as depicted in Eq. [1]. Clinical metadata were processed via a parallel preprocessing pipeline.
where is the raw pixel intensity, is the dataset-wide mean, and is the standard deviation (SD), which is computed per dataset to account for acquisition-specific illumination and contrast variations.

Statistical analysis
Statistical analysis was performed to evaluate the model’s prognostic performance and robustness. The C-index assessed survival ranking, while the area under the receiver operating characteristic curve (AUROC) evaluated binary classification. Kaplan-Meier (KM) analysis with the log-rank test compared survival between risk groups. Performance is reported as mean ± SD with 95% CI from five-fold cross-validation. Sensitivity analyses included modality ablation and external validation. All analyses were conducted using Python (version 3.10) as shown in Table S2. P values smaller than 0.05 were considered statistically significant.

Constructing a LHFM
Figure 3 demonstrates the conceptual transition from traditional and DL models to the proposed model (LHFM), highlighting its multimodal efficiency and interpretability. The framework and implementation of the proposed model, are shown in Figure 4, which consists of a dual-branch feature extractor, a fusion model, and a prognostic prediction module. The details are explained below.

Dual hybrid feature extraction
In this stage, visual features are extracted from the preprocessed CXR images via the SAM-Med2D (24) visual encoder. It generates high-fidelity embeddings that capture spatial context and critical visual features from CXR, facilitating the subsequent integration of visual information into the model. In parallel, BioGPT (16), a biomedical language model, prompts are image-conditioned semantic representations, generated from CXR-derived embeddings using a lightweight learned prompt generator that bridges the visual latent space to BioGPT’s token embedding space. The generated prompt functions as an auxiliary semantic representation that supports multimodal fusion and interpretability. Importantly, the prompt branch is designed as a representation-learning mechanism, rather than a standalone clinical text prognostic model. Qualitative examples of generated prompts and additional prompt-sensitivity experiments are provided in the Table S3. For example, a generated prompt might be: “Presence of a speculated mass with lesion margins in the right upper lobe, approximately 3.2 cm in diameter.” These prompts offer interpretable descriptions of clinical conditions, effectively bridging pixel-level features to high-level semantic descriptions.
Dual hybrid feature extraction combines visual and textual pathways to create clinically informative CXR representations. Let denote the set of preprocessed CXR images, where each starts by projecting through a vision transformer (ViT) backbone whose patch-token output is linearly mapped to a compact visual embedding . A lightweight prompt generator then converts into a clinically interpretable prompt . This prompt is tokenized and passed through BioGPT, whose [CLS] output is similarly linearly projected to yield the text embedding . By aligning and in a shared latent space ( and ), we fuse pixel-level detail with semantic descriptors, enabling the model to ground high-level clinical concepts directly in image features.
Let denote the set of preprocessed CXR images, where each . We first extract a visual embedding for each image via the SAM-Med2D encoder as shown in Eqs. [2-4].
where is implemented as a ViT with a projection head. Specifically, if are the raw patch-token outputs, then visual extraction employs the SAM-Med2D (24).
encoder, , which processes input image into spatial embeddings using a transformer architecture fine-tuned for medical imaging. Concurrently, the clinical textual prompt from , via prompt-generator is embedded with BioGPT (16):
where represents the number of clinical text embeddings. Here is a small feed-forward network and , with the token output of BioGPT, and , . , where cross-modal attention dynamically synthesizes clinical prompts conditioned on (visual embeddings). This cross-attention mechanism, , explicitly aligns pixel-level features with semantic descriptors, ensuring that prompts are anatomically grounded and clinically interpretable.

Hybrid multimodal fusion model
The main model training stage involves integrating visual embeddings and semantic prompts via a lightweight dual-branch transformer model. This multimodal fusion architecture outputs scalar risk scores indicative of prognosis. The fusion model integrates visual (), textual (), and clinical () features via a dual-branch transformer. Let denote an input CXR. The vision encoder employs a patch-based transformer. The inputs are visual features , textual features , and . The modality-specific projections are fused as depicted in Eq. [5].
where are learnable weights. Tokenization and iterative fusion are employed as shown in Eqs. [5,6].
where represents layer normalization, || represents the concatenation operator, , and represents the fusion weights/bias.

Prognostic prediction
Our prognosis-prediction module embeds CXR features, BioGPT-derived semantic prompts, and tabular covariates into a shared latent vector . A dual-head architecture then produces (I) a calibrated continuous hazard score for time-event analysis and (II) a SoftMax posterior that dichotomizes patients into low versus high-risk strata. The transformer backbone first aligns modality-specific embeddings through stacked multihead fusion-attention layers, a design that has already outperformed conventional pipelines (31). The fused representation is routed to a survival head, , mirroring the continuous-risk formulation and to a classification head, , which supports binary triage consistent with recent vision-language prognostic frameworks for CXR analysis. The survival risk and discrete outcome are depicted in Eqs. [7,8], respectively. A sigmoid-activated risk score quantifies individual survival probability, whereas a parallel SoftMax head yields class-level prognosis. Both heads share the fused latent , ensuring that predictions are informed simultaneously by pixel-level patterns, semantic prompts, and structured patient data (32).

Loss function of LHFM
The model was optimized via weighted binary cross-entropy loss (Eq. [9]) to address class imbalance, with a positive-class weight applied to increase the recall of high-risk cases. Training employed the Adam optimizer (β1=0.9, β2=0.999, weight decay =1×10–4, learning rate =1×10–4) with early stopping, yielding stable convergence and well-calibrated risk predictions suitable for clinical decision support. The simplified gradient ensures numerically stable updates under mixed-precision training.

Reproducibility and implementation details
All experiments were implemented in Python using PyTorch and trained on an NVIDIA GeForce RTX 2080Ti GPU. To enhance reproducibility and minimize run-to-run variability, random initialization was controlled using a deterministic seed (applied across Python/NumPy/PyTorch, including CUDA where applicable). CXR images were preprocessed using a standardized pipeline, including resizing all inputs to 224×224 pixels and applying consistent intensity normalization to stabilize optimization and improve convergence. Model training was performed using the Adam optimizer with a batch size of 16, and an early stopping strategy was applied based on validation performance to reduce overfitting and retain the best-performing checkpoint. To improve generalization, radiography-appropriate data augmentation was incorporated during training. Model evaluation was conducted using 5-fold cross-validation, and final performance is reported as mean ± SD across folds along with 95% confidence intervals (CIs) to quantify uncertainty and ensure reliable comparison across experiments. Specific details were presented in Table S2.

Evaluation metrics
The experimental results were evaluated via the concordance index (C-index) to assess prognostic accuracy (33) as depicted in Eq. [10], alongside the area under the AUROC to compute classification performance. Model performance was reported as mean ± SD across folds, with 95% CI, enabling robust estimation of variability and ensuring reproducibility under repeated sampling. The statistical significance of improvements over baselines was validated by CIs and interpretability analysis quantified the contributions of imaging and textual features to the predictions (34).
where are the predicted risk scores for individuals and , and where and are the actual observed survival times. is an event indicator for individual (1 if the event occurred, 0 if censored), and is an indicator function that returns a value of 1 if the condition inside is true and 0 otherwise.
Moreover, KM survival estimation is utilized in LC prognosis, which robustly handles values, allowing accurate survival function estimation even when patients are lost to follow-up or when the event has not occurred by the end of the study (35), as depicted in Eq. [11].
where represents the probability that a patient survives longer than times after diagnosis or treatment starts, represents the ordered event time, indicates the number of events at time , represents the number of individuals at risk just before , and the product is taken over all distinct event times up to .

Results

Results

Performance comparisons with related works
The progressive improvement from unimodal to multimodal fusion models highlights the intrinsic value of heterogeneous data integration in survival prediction tasks. Specifically, the baseline clinical model demonstrated limited prognostic discrimination (C-index: 0.696±0.008) as presented in Table 1, underscoring the inherent limitations of traditional approaches that rely on structured tabular data. In contrast, the FGCN model’s modest gains (C-index: 0.769±0.007) suggest that model inter-feature dependencies can partially address these limitations.
These trends are consistent with the recent studies where multimodal or habitat-aware CT models achieved strong but CT-dependent survival or response prediction in NSCLC patients (19,36-38). The C-index of 0.910±0.006 (95% CI: 0.898–0.922) confirms that LHFM effectively integrates image and text information, outperforming multimodal baselines, thereby overcoming the gap that often hinders unimodal DL approaches.
The most significant improvement is observed over Deng et al. (22), with AUROC and C-index increases of 0.203 and 0.251, respectively, highlighting a substantial increase in both classification discrimination and survival ranking capabilities. Similarly, compared with the baseline unimodal model, LHFM achieved improvements of 0.185 in the AUROC and 0.214 in the C-index, underscoring the benefit of integrating multimodal data. Notably, while Lite-ProSENet (17) exhibited strong performance, LHFM still achieved significant improvements (AUROC: +0.016, C-index: +0.017), reflecting that its advantages persist even against high-performing multimodal architectures.
These improvements validate LHFM’s multimodal hybrid fusion approach, effectively leveraging complementary information from different data types. The consistent superiority across all comparisons suggests that LHFM generalizes well to different baseline architectures and is not overly tuned to any single reference model. This makes LHFM a clinically promising framework for LC prognosis, with potential applicability across diverse datasets and clinical scenarios.

KM survival curve
KM survival analysis confirmed significant stratification between the high-and low-risk groups identified by LHFM (log-rank P ≈6.42×10⁻13), as shown in Figure 5. The high-risk group exhibited a survival decline, with the probability of survival decreasing to less than 0.4 within 12 months and approaching zero by 60 months, whereas the low-risk group maintained substantially greater survival throughout follow-up. The minimal overlap of the 95% CI beyond the initial time points reinforced the robustness of this separation. LHFM risk stratification aligns with clinically significant outcome differences, demonstrating strong prognostic utility (C-index: 0.910; AUROC: 0.935) as validated in Table 2.

Generalizability
Table 2 shows that LHFM preserves exceptional prognostic discrimination across a spectrum of external cohorts. To evaluate external validity and generalizability, the model was trained on the JSRT and PadChest datasets and directly tested on independent external datasets. For the expansive NIH CXR dataset (28), the model achieves a C-index of 0.851, attesting to its robustness within adult radiography pipelines. The performance further increased to 0.879 on the volumetric, expert-annotated NSCLC Radiomics Interobserver1 (30), indicating that the integration of rich 3D morphological information can meaningfully enhance survival stratification.
Notably, even on the comparatively limited and tuberculosis-oriented Shenzhen and Montgomery CXR datasets (29), LHFM achieved a C-index of 0.835, reflecting its performance reduction amidst domain and pathology shifts. All values exceeded 0.80, confirming the clinical relevance and robustness under distribution shifts, collectively underscoring LHFM’s capacity to generalize across diverse imaging modalities, spatial resolutions, cohort scales, and disease spectra with minimal adaptation. All external validation values are reported with 95% CIs.
External validation across three independent datasets (NIH CXR, NSCLC-Radiomics-Interobserver1, and Shenzhen-Montgomery) demonstrated strong generalizability (C-index range: 0.835–0.879), confirming that LHFM maintains prognostic accuracy under significant demographic and modality variation.

Ablation study
An ablation study conducted on a stratified hold-out test set (20% of the dataset) confirmed that visual and textual components are essential for LHFM’s prognostic performance. The removal of either modality resulted in significant performance degradation, as depicted in Table 3. Ablation study performance with the text-only variant exhibited particularly substantial declines, underscoring the indispensability of imaging data for capturing significant anatomical details. Multimodal fusion via attention-based integration outperformed naive fusion strategies, with the full model (LHFM) achieving optimal performance (AUROC: 0.935±0.004; C-index: 0.910±0.006). These results confirm the synergistic interaction between imaging and clinical text modalities, whereas visualization analyses, as depicted in Figure 6, further validated the model’s interpretability and class-agnostic robustness.

Sensitivity analyses and robustness evaluation
To address concerns regarding potential prompt learning and unusually high discrimination metrics, we conducted sensitivity analyses across benchmark comparisons, modality ablations, and external robustness evaluation. The full multimodal LHFM achieved the strongest overall performance (AUROC =0.935±0.004; C-index =0.910±0.006). Removing individual branches resulted in consistent degradation, including the image-only model (AUROC =0.873±0.007; C-index =0.842±0.009) and the prompt-only model (AUROC =0.824±0.006; C-index =0.794±0.008), supporting that performance gains are not attributable to a single dominant modality. These experiments confirm that LHFM benefits from complementary multimodal integration rather than trivial shortcuts (Table 4).
Sensitivity analyses demonstrated that LHFM remained robust across ablation experiments, benchmark comparisons, and external evaluations (Table 4). The full multimodal LHFM achieved the strongest overall performance (AUROC =0.935±0.004; C-index =0.910±0.006). Removing individual branches resulted in consistent performance degradation, including the image-only variant (AUROC =0.873±0.007; C-index =0.842±0.009) and the prompt-only variant (AUROC =0.824±0.006; C-index =0.794±0.008), supporting the complementary value of hybrid fusion. External robustness was maintained across independent datasets with C-index values ranging from 0.835 to 0.879, confirming generalizable prognostic discrimination under dataset shift (Table S3).

Discussion

Discussion
This study introduces LHFM, which integrates visual features extracted from CXR via a fine-tuned SAM-Med2D encoder with semantically enriched clinical text prompts generated by BioGPT for LC prognosis. The LHFM demonstrated consistent superiority over state-of-the-art unimodal and multimodal benchmarks across diverse cohorts. It achieved a C-index of 0.910 (95% CI: 0.898–0.922; P<0.001), representing a significant improvement over existing model such as Lite-ProSENet (17). Ablation studies validated that integrating clinical text embeddings enhanced prognostic accuracy by approximately 7% compared with image-only baselines and substantially improved model interpretability through attention-guided risk stratification. Unlike existing ViT models focused primarily on report generation (39,40), the LHFM is specifically designed for survival analysis, explicitly modeling time-to-event endpoints with a computationally efficient architecture. This directly addresses the unmet need for lightweight, interpretable prognostic models in vulnerable populations and resource-limited contexts.
The interpretability and clinical relevance of the LHFM were enhanced through BioGPT-guided saliency maps that consistently co-localized with clinically relevant anatomical regions, thereby reinforcing clinician trust and adhering to emerging standards for AI transparency and accountability (39). This aligns with the findings of a recent study that emphasized the importance of structured AI reporting and its integration into radiology workflows (41). This alignment is particularly impactful in vulnerable populations oncology, where limited datasets and the imperative to minimize radiation exposure make interpretable, low-dose prognostic tools essential for both scientific and ethical precision medicine (41). The model’s prognostic utility was further confirmed through KM survival analysis, which provides visualization and complements statistical metrics with risk stratification (42). By leveraging the accessibility and safety of CXR (9), our approach provides a clinically relevant and evidence-driven framework for advancing precision oncology in diverse healthcare settings.
Furthermore, LHFM employs a lightweight architectural design that uses substantially fewer parameters than conventional transformer-based survival models do (43), ensuring computational efficiency without compromising predictive performance. This framework is particularly crucial for clinical deployment in resource-constrained environments. The model demonstrated robust generalizability across diverse patient demographics and imaging protocols, achieving consistent prognostic performance with C-index values ranging from 0.835 to 0.879 on external validation datasets, exceeding the clinically meaningful threshold of 0.80 (44,45). The clinical implication is that this framework could be integrated into bedside radiology systems to support real-time triage and informed decision-making (3,46).
The strong discrimination reported in this study should be interpreted within the context of the study design. Because the primary endpoint is an algorithmically computed risk label derived from available demographic and clinical data, the reported AUROC and C-index reflect methodological feasibility and multimodal integration capacity, rather than direct clinical survival prediction performance. Therefore, these results should not be directly compared with CT-based survival radiomics studies using verified time-to-event endpoints. Instead, LHFM demonstrates that lightweight hybrid fusion of foundation-level visual embeddings and semantically enriched prompt representations can deliver robust risk stratification performance on low-dose CXR data under controlled assumptions.
In addition to its methodological contributions, the LHFM has significant implications for medical system integration (47,48). By enabling accurate LC prognosis from low-dose CXR, the framework directly supports clinical decision-support systems deployable at the point of care. Its lightweight architecture reduces computational requirements, facilitating adoption in health facilities where advanced imaging modalities are not available. The scalability of the framework enables integration with radiology workflows and EHR platforms, supporting risk stratification, follow-up scheduling, and treatment prioritization. By leveraging CXR, the most widely available imaging modality, our model enhances cost-effectiveness, reduces reliance on high-radiation techniques, and promotes equitable access to advanced prognostic tools. In this way, the model contributes not only to technical innovation but also to healthcare efficiency and system-wide optimization. Moreover, this work aligns with the translational focus of Translational Lung Cancer Research by signifying how foundation models can be adapted into lightweight, clinically deployable prognostic tools.
We performed dedicated model generated prompt-sensitivity experiments to verify that the prompt pathway does not artificially inflate performance through pairing artifacts. Compared with the reference LHFM, replacing prompts with a null token reduced performance, while prompt shuffling across subjects produced a substantial drop, and random prompt injection further degraded discrimination. These findings confirm that the prompt branch contributes meaningful value only when prompts remain semantically relevant and correctly paired with the corresponding image, supporting the interpretation that LHFM learns clinically grounded multimodal representations rather than prompt-driven outcome.
However, this study is limited by its retrospective design and potential biases arising from multicenter heterogeneity. Although robustness and prompt-sensitivity analyses were performed, prospective validation on survival-linked LC cohorts is needed. Furthermore, the reliance on CXR, while a strength in terms of accessibility, inherently limits the morphological detail available compared to CT, which may cap the ultimate prognostic performance. Future studies will prospectively evaluate LHFM within integrated genomic and EHR to assess workflow impact, trust calibration, and prognostic depth.

Conclusions

Conclusions
This study presents LHFM, which integrates SAM-Med2D visual features with BioGPT-generated prompts and clinical metadata for the prognosis of LC. Our model demonstrated significant prognostic accuracy (C-index =0.910 and AUROC =0.935) and interpretability compared to existing frameworks, while maintaining computational efficiency, making it suitable for vulnerable populations and resource-limited healthcare environments. By leveraging CXRs, the most widely accessible imaging modality, LHFM reduces reliance on costly and high-radiation imaging, facilitating seamless integration into EHR and radiology workflows. In conclusion, the proposed framework provides a scalable, cost-effective decision support tool that integrates seamlessly into healthcare delivery systems, improving workflow efficiency and access to accurate prognoses in diverse clinical settings. Future work will focus on prospective validation and the incorporation of genomic data to enhance clinical utility across oncological domains.

Supplementary

Supplementary
The article’s supplementary files as

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.