본문으로 건너뛰기
← 뒤로

Deep learning-based pulmonary nodule risk assessment outperforms established malignancy risk scores in lung cancer screening.

1/5 보강
Radiology advances 2026 Vol.3(1) p. umag003
Retraction 확인
출처

Mortani Barbosa EJ, Kim Y, Zhang Y, Setio AAA, Mellot F, Grenier PA

📝 환자 설명용 한 줄

[BACKGROUND] Pulmonary nodules are commonly encountered in lung cancer screening.

🔬 핵심 임상 통계 (초록에서 자동 추출 — 원문 검증 권장)
  • 95% CI 0.89-0.94

이 논문을 인용하기

↓ .bib ↓ .ris
APA Mortani Barbosa EJ, Kim Y, et al. (2026). Deep learning-based pulmonary nodule risk assessment outperforms established malignancy risk scores in lung cancer screening.. Radiology advances, 3(1), umag003. https://doi.org/10.1093/radadv/umag003
MLA Mortani Barbosa EJ, et al.. "Deep learning-based pulmonary nodule risk assessment outperforms established malignancy risk scores in lung cancer screening.." Radiology advances, vol. 3, no. 1, 2026, pp. umag003.
PMID 41768120 ↗

Abstract

[BACKGROUND] Pulmonary nodules are commonly encountered in lung cancer screening. The risk of malignancy varies widely and is generally estimated using expert consensus guidelines (Lung CT Imaging Reporting and Data Systems [Lung-RADS]).

[PURPOSE] To assess the performance of a deep learning algorithm (Deep Pulmonary Nodule Profiler [DeepPNP]) for pulmonary nodule malignancy risk estimation in a lung cancer screening dataset and the effect of data enrichment in model training.

[MATERIALS AND METHODS] A retrospective analysis was conducted using 3 datasets. DeepPNP is a 3D convolutional network (EfficientNet-B0-based) operating on nodule-centered 3D patches. For the DeepPNP model training and validation, the National Lung Screening Trial (NLST) dataset was combined with 2 independent malignant nodule-only datasets, resulting in a merged dataset of 28 057 nodules, including 2362 malignant nodules. An ablation model (DeepPNP-NLST) was trained on NLST only. The testing was conducted on a held-out dataset from the NLST dataset. Performance metrics, including sensitivity, specificity, precision, F1 score, and accuracy, were analyzed across 3 operating thresholds selected based on specificities of 0.80, 0.85, and 0.90 (selected on the validation set). Benchmarks included Lung-RADS v2022 and the PanCan model.

[RESULTS] On the NLST test set (including 2597 nodules from 1243 CT scans), DeepPNP achieved an area under the receiver operating characteristic curve (ROC AUC) of 0.96 (95% confidence interval [CI], 0.95-0.97), outperforming Lung-RADS AUC = 0.91 (95% CI, 0.89-0.94; < .001) and PanCan AUC = 0.93 (95% CI, 0.91-0.95; < .001). DeepPNP-NLST had an AUC of 0.95 (95% CI, 0.93-0.97; = .045 vs DeepPNP), indicating a modest gain from positive-only supplementation. Subgroup analyses showed consistent outperformance across nodule sizes and types. Operating-point metrics at 0.80/0.85/0.90 specificity are reported; at 0.80 specificity, DeepPNP achieved sensitivity of 0.94 (100/107; 95% CI, 0.88-0.98) and specificity of 0.88 (2196/2490; 95% CI, 0.87-0.90).

[CONCLUSION] DeepPNP outperformed established malignancy risk models in lung cancer screening. The inclusion of biopsy-confirmed malignant nodules from 2 external datasets provided a measurable performance gain, underscoring the importance of data enrichment during model training.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

📖 전문 본문 읽기 PMC JATS · ~47 KB · 영문

Introduction

Introduction
Lung cancer is the leading cause of cancer-related mortality worldwide,1 and early detection is essential for improving survival rates. Low-dose computed tomography (LDCT) screening has been shown to reduce lung cancer mortality by 20% compared to chest radiography.2,3 However, LDCT screening presents challenges, including high false-positive rates and overdiagnosis.4–6 Traditional radiological assessment can be subjective and variable, affecting diagnostic accuracy.7–10 Furthermore, the ever-increasing volume of imaging studies necessitates tools that enhance efficiency, accuracy and consistency.11
Artificial intelligence (AI), particularly convolutional neural networks (CNNs), has achieved state-of-the-art performance in image recognition tasks, including pulmonary nodule classification.12–19 Studies have applied AI models in lung cancer screening populations, reporting high sensitivity and specificity in distinguishing benign from malignant nodules.16–19 One study developed a 3D Deep learning (DL) algorithm achieving performance comparable to radiologists in predicting lung cancer risk.18
The objective of this study was 2-fold: (1) to develop a DL model for malignancy risk estimation in screening-detected pulmonary nodules and to compare its performance with established malignancy risk scores, and (2) to assess the effect of malignant nodule enrichment during model training. The latter involved augmenting the lung cancer screening dataset with additional biopsy-confirmed malignant nodules.

Materials and methods

Materials and methods

Study design and ethical considerations
Institutional Review Board (IRB) approvals were obtained from all participating institutions. Given the retrospective nature of the study, the requirement for individual patient consent was waived. The study was conducted in compliance with the Health Insurance Portability and Accountability Act (HIPAA) for U.S.-based data handling. This retrospective study trained, optimized, and validated a DL-based algorithm (DeepPNP) for pulmonary nodule malignancy risk estimation.

Datasets for model development and testing

National Lung Screening Trial (NLST) Dataset

2
: The NLST dataset comprised low-dose CT scans from 26 722 participants collected between 2002 and 2007 in a multicenter randomized trial. Eligibility included adults aged 55-74 years with at least 30 pack-years of smoking and no prior lung cancer. The screening protocol consisted of a baseline low-dose CT and 2 annual screens (T0-T2); in this study, we included scans from all screening rounds. Malignant nodules were biopsy-confirmed within 1 year, and benign nodules were from participants who remained free of lung cancer during follow-up. All nodules larger than or equal to 3 mm were included, yielding 26 488 nodules (793 malignant). Multiple nodules could originate from the same participant. Nodules were analyzed as independent observations, with no adjustment for within-participant clustering.

Dataset A: This dataset comprised 1132 retrospectively and consecutively collected chest CT scans from patients with biopsy-confirmed lung cancer, acquired at a tertiary referral center in Europe between 2010 and 2020. Scans included both low-dose and standard-dose protocols and were performed without intravenous contrast. Radiologists identified and labeled malignant nodules ≥3 mm. This dataset contributed 998 biopsy-confirmed malignant nodules to the training set.

Dataset B

20
: This dataset comprised CT scans from 608 patients with biopsy-confirmed lung cancer, sourced via a data broker across multiple U.S. centers during 2014-2022. The data broker aggregates de-identified imaging from a multi-site provider network, and the exact number of contributing centers for this cohort was not available. Cases were identified from institutional pathology/biopsy reports. CT studies included both low-dose and standard-dose scans. Radiologists identified and labeled malignant nodules ≥3 mm. This dataset contributed 571 malignant nodules to the training set.

These 3 datasets are independent. NLST participants were enrolled as part of a prospective screening trial, while the other 2 datasets consisted of biopsy-confirmed malignant nodules retrospectively collected for model enrichment. The participant characteristics and CT acquisition parameters of these 3 datasets are provided in the Supplementary Material. Table 1 summarizes the characteristics of pulmonary nodules from the 3 datasets. For each nodule, size was measured as the longest axial diameter in millimeters on thin-section CT images. Nodule type (solid, part-solid, non-solid, or calcified) and the presence of spiculation were recorded based on radiologist annotations during the truthing process. Information on lung cancer histology and stage was not consistently available across datasets and therefore was not included in the analysis. Figure 1 summarizes the inclusion and exclusion criteria for the 3 datasets and the training, validation, and testing data split.

Data preparation and splitting
Participants from the NLST dataset were randomly divided into training (75%), validation (10%), and test (15%) sets, ensuring a representative distribution of the study sample across subsets. To alleviate the class imbalance between benign and malignant nodules in the NLST dataset, additional malignant nodules from Datasets A and B were incorporated into the training set. These datasets, containing only malignant nodules, strengthened the model’s capacity to identify and learn features associated with malignancy. The NLST validation and test sets retained the original population distribution, making them representative of a screening scenario. Table 2 presents the characteristics of the training, validation, and test sets in terms of nodule size, type, spiculation, and malignancy.
To evaluate the effectiveness of the data enrichment strategy, we trained an additional model, DeepPNP-NLST, using only the screening data (NLST) without incorporating the additional biopsy-confirmed malignant cases.

Nodule patch pre-processing
Across the 3 datasets, 14 radiologists reviewed the CT scans and labeled pulmonary nodules using an internally developed tool; each radiologist annotated a subset of cases. Each nodule was assigned a centroid, and the size was determined as the longest axial diameter. A secondary review was done by a thoracic radiologist with 8 years of experience in thoracic imaging. For the NLST dataset, all nodules >3 mm were annotated. In the 2 malignant datasets, only biopsy-proven malignant nodules were annotated, based on biopsy reports or images used for procedure guidance. For pulmonary nodules smaller than 50 mm in diameter, we cropped a 50 × 50 × 50 mm³ 3D region of interest (ROI) patch centered on the nodule and then resampled it to a 64 × 64 × 64 voxel patch. This 3D patch size was chosen to cover most pulmonary nodules and provide sufficient context around the nodule. For nodules larger than 50 mm in diameter, a larger 3D ROI patch, defined as a bounding box tightly fitting the nodule, was extracted to ensure the entire nodule was included, then resampled to a 64 × 64 × 64 voxel patch as well. The patches were normalized from a window level of -300 Hounsfield units (HU) and a window width of 1400 HU to a value range between 0 and 1, with values exceeding this range being clipped. The resulting preprocessed 3D patches served as model inputs during both training and inference.

Study outcome
The primary outcome of this study was the classification of pulmonary nodules detected through lung cancer screening as either benign or malignant. Malignant nodules were defined as those that were biopsy-confirmed within 1 year of detection in the NLST dataset, or nodules identified as malignant in the 2 external datasets (Dataset A and Dataset B), which consisted entirely of biopsy-proven lung cancers. Benign nodules were defined as nodules detected in the NLST dataset for which patients remained free of lung cancer throughout the 7-year follow-up period.

DL model

Model architecture and training
We implemented a 3D convolutional neural network (DeepPNP) based on the EfficientNet-B0 architecture, adapting it to process volumetric CT imaging data.21 The model architecture was designed to capture complex spatial features associated with pulmonary nodules by replacing the standard 2D convolutional kernels (3 × 3 and 5 × 5) with 3D kernels (3 × 3 × 3 and 5 × 5 × 5). We optimized the model’s hyperparameters using a randomized search, testing 32 configurations and selecting the hyperparameters that performed best on the validation dataset. The final model configuration included a batch size of 16, a learning rate of 2.092e-05, and a dropout rate of 0.223, which helped balance model generalization and overfitting risks. To manage class imbalance, we used focal loss with a gamma parameter set to 2 and the alpha parameter set to 0.25.22 The model was optimized using the Adam optimizer.23 The training was conducted over 50 epochs, each consisting of 500 iteration steps. To enhance the robustness and generalizability of the model, extensive 3D data augmentation techniques were employed during training, incorporating both spatial and intensity-based transformations. Spatial augmentations included random shifts along each axis (−1.0 mm to 1.0 mm), rotations (−30° to 30°), zooming (scaling factors of 0.95 to 1.05), and flipping along axial, coronal, or sagittal planes. Intensity augmentations consisted of adding Gaussian noise (10% of the window width), Gaussian blurring (standard deviation of 0.5 to 1.5), brightness scaling (0.7 to 1.3), and contrast adjustments (0.65 to 1.5). Each augmentation was applied probabilistically: flipping with a 50% probability, Gaussian blurring with a 10% probability, and all other augmentations with a 15% probability. These augmentations were composed sequentially to ensure a diverse training dataset and improve model resilience to variations in nodule size, shape, intensity, and position, while maintaining clinical relevance. The system was implemented using the PyTorch framework (version 1.10.1) and Python (version 3.9.2). Model development and testing were conducted on a high-performance supercomputing cluster comprising 200 compute nodes, each equipped with 8 GPUs, including Titan X, V100, A100, and H100 models. The training code will not be made publicly available at this time, as DeepPNP is currently a prototype.

Heatmap
We generated Gradient-weighted Class Activation Mapping (Grad-CAM) saliency maps for DeepPNP predictions.24 For each 3D nodule patch, Grad-CAM was computed and min–max normalized. Visualization used the axial, coronal, and sagittal central slices of the patch, shown both as raw images and as overlays with the corresponding heatmap to highlight image regions most influencing the prediction. Heatmaps were used solely for qualitative interpretation and were not used for training, model selection, or thresholding.

Statistical analysis
We compared DeepPNP with DeepPNP-NLST, the Pan-Canadian Early Detection of Lung Cancer (PanCan) model, and Lung CT Imaging Reporting and Data Systems (Lung-RADS).25,26 PanCan scores were calculated using variables derived from nodule size and morphology available in the NLST dataset including malignancy probability estimates using 9 variables—age, sex, family history of lung cancer, emphysema, nodule size, type, location, count, and spiculation. Lung-RADS categories were assigned according to the version v2022 criteria based on nodule size and attenuation type.
The performance of the models was evaluated using the area under the receiver operating characteristic curve (ROC AUC), and DeLong test was used to calculate P values for pairwise ROC AUC comparisons.27  P values were 2-sided, unadjusted for multiple comparisons, with a nominal significance level of 0.05. Subgroup analyses by nodule size ranges and types were performed and ROC AUCs are reported for each subgroup. The 3 specificity thresholds (0.80, 0.85, 0.90) were selected from operating points on the DeepPNP ROC curve on the NLST validation set to represent clinically relevant trade-offs. Additionally, models’ performance was further analyzed at 3 operating thresholds, with corresponding metrics—sensitivity, specificity, precision, F1 score, and accuracy (with 95% confidence intervals [CIs] calculated via bootstrapping). Metrics were calculated using the Scikit-learn (version 1.5.0) Python library.28–30 All analyses were performed on the held-out NLST internal test set.

Results

Results

Figure 2 illustrates Grad-CAM heatmaps for 4 representative lung nodule patches from the NLST test set. The first 2 rows depict malignant nodules; the last 2 depict benign nodules. In the spiculated malignant example (row 1), activations are strong and well localized to the nodule, consistent with visually apparent malignant features. In the part-solid malignant example (row 2), activations are present but more diffuse. In the benign examples (rows 3-4), activations are weak and non-focal.
The DeepPNP model demonstrated superior performance in distinguishing malignant from benign pulmonary nodules on the NLST test set. DeepPNP achieved AUC of 0.96 (95% CI, 0.95-0.97), outperforming the Lung CT Screening Reporting and Data System (Lung-RADS v2022) (AUC = 0.91 [95% CI, 0.89-0.94]) and the Pan-Canadian Early Detection of Lung Cancer (PanCan) model (AUC = 0.93 [95% CI, 0.91-0.95]). The DeLong test demonstrated that the DeepPNP model achieved statistically significantly better performance compared to both the Lung-RADS (P < .001) and PanCan models (P < .001) in terms of ROC AUC. DeepPNP also outperformed DeepPNP-NLST (AUC = 0.95 [95% CI, 0.93-0.97]; P = .045), indicating that incorporating additional biopsy-confirmed malignant nodules modestly improved overall discrimination. Figure 3 illustrates the ROC AUCs of the models on the NLST test set.
In subgroup analyses, DeepPNP’s performance was higher for small nodules than for larger ones—AUC 0.93 (95% CI, 0.90-0.96) for 3-10 mm (Figure 3B) vs 0.85 (95% CI, 0.79-0.89) for 10-200 mm (Figure 3C). In both size ranges, DeepPNP significantly outperformed PanCan and Lung-RADS (both P < .001). By nodule type, DeepPNP performed slightly better on sub-solid than solid nodules—AUC 0.98 (95% CI, 0.94-1.00; Figure 3E) vs 0.96 (95% CI, 0.94-0.97; Figure 3D). On solid nodules, DeepPNP outperformed PanCan and Lung-RADS (both P < .001) and DeepPNP-NLST (P = .016). On sub-solid nodules, DeepPNP outperformed Lung-RADS (P = .006) and was comparable with PanCan (P = .74) and DeepPNP-NLST (P = .46).
We evaluated 3 DeepPNP configurations, each selected based on a specific operating point on the validation dataset: specificity at 0.80 (DeepPNP-S80), 0.85 (DeepPNP-S85), and 0.90 (DeepPNP-S90)—as illustrated in Figure 3A. Table 3 summarizes the sensitivity, specificity, precision, F1 score, and accuracy of these DeepPNP configurations, and for comparison, it reports the corresponding operating points for all competing methods tuned to the same specificity targets. Because Lung-RADS thresholds are discrete, the S80 and S85 targets coincide at Lung-RADS 3, yielding identical Lung-RADS results at these 2 operating points. At the operating point S80, DeepPNP performed significantly better than DeepPNP-NLST, PanCan, and Lung-RADS on specificity, precision, F1 score, and accuracy (all P < .001). Sensitivity was also higher than PanCan (P = .034). At S85, DeepPNP showed significantly higher specificity compared to DeepPNP-NLST (P = .002) and PanCan (P = .006), and higher accuracy than DeepPNP-NLST (P = .008) and PanCan (P = .002). At S90, DeepPNP-S90 achieved higher sensitivity, specificity, accuracy than Lung-RADS (all P < .001). There were no statistically significant differences between DeepPNP-S90 and DeepPNP-NLST-S90 on any metric: sensitivity (P = .92), specificity (P = .30), precision (P = .44), F1 score (P = .56), and accuracy (P = .40).

Figure 4 presents 2 examples for each decision category (true positives, true negatives, false positives, false negatives) for DeepPNP-S85. True positives were correctly identified by DeepPNP-S85 and the competing methods. True negatives were correctly classified by DeepPNP-S85; the PanCan model misclassified the first case as malignant. False positives illustrate common overcall patterns—borderline PanCan risk and small nodules in smokers whose morphology can mimic malignancy. False negatives were small/subtle with low PanCan scores. Notably, the first false-negative would be correctly classified as malignant at the more sensitive DeepPNP-S80 operating point.

Discussion

Discussion
We demonstrated that the DeepPNP model outperformed traditional risk models, such as the PanCan logistic regression model and an expert consensus guideline (Lung-RADS v2022), in classifying benign versus malignant pulmonary nodules within a lung cancer screening population. The DeepPNP model achieved the highest ROC AUC on the NLST test set, indicating excellent discriminative ability. Data enrichment by incorporating biopsy-confirmed malignant nodules from 2 additional external datasets during training resulted in measurably improved model performance.
The superior performance of the DeepPNP model on the NLST test set suggests that DL algorithms can effectively capture the relevant imaging features associated with malignancy in a screening population. Moreover, the fact that the DL models can identify high-risk nodules solely based on imaging features suggests that the factors driving AI performance in detecting high-risk nodules are likely correlated with the imaging characteristics that expert physicians use to determine the need for biopsy.
In clinical practice, the DeepPNP algorithm could serve as a decision-support tool to assist radiologists during CT interpretation. By providing automated and quantitative malignancy risk estimates for detected nodules, the model may help prioritize high-risk findings for biopsy or closer follow-up, enhancing consistency and efficiency in lung cancer screening workflows.
Our findings align with prior work that both showcase the promise of DL supporting lung cancer risk assessment for pulmonary nodules based on chest CT imaging.10,18,19,31 Ardila et al18 reported high accuracy in lung cancer risk prediction using a DL system applied to low-dose CT screening, achieving results comparable to expert radiologists. Similar conclusions were reached by Chung et al19 and Setio et al,17 who showed that AI models can match human observers in malignancy risk estimation and nodule characterization. Hendrix et al32 demonstrated that a DL model could accurately distinguish benign from malignant nodules in non-screening chest CTs. Collectively, these studies highlight the potential of DL-based risk estimation. In this context, our results confirm the excellent discriminative ability of DeepPNP in the intended screening setting.
Several limitations should be acknowledged. First, the current model assumes prior nodule localization and therefore cannot function as a fully automated screening pipeline. Second, testing was limited to the NLST set (ie, internal test set), and further validation on external datasets will be important to confirm generalizability. Finally, the study focused on nodule-level analysis without incorporating clinical or demographic variables beyond CT imaging features, which could further improve performance and robustness.
Aside from validation on external datasets, future research should focus on integrating clinical and sociodemographic variables into AI models to further enhance their performance and robustness. Population-specific model adaptation that incorporates comprehensive patient data beyond imaging features will be essential to ensure that AI tools are optimized for the characteristics of the target population and exhibit improved generalizability. Moreover, future investigations should evaluate outcomes by histologic subtype, disease stage, and survival to provide deeper insights into the clinical utility of malignancy risk estimation.
Our study demonstrates that a DL model (DeepPNP) can be effectively developed for malignancy risk estimation in screening-detected pulmonary nodules, achieving superior performance compared to traditional risk assessment tools. By incorporating additional biopsy-confirmed malignant nodules from 2 additional datasets into training, we further assessed the benefit of data enrichment, which improved model performance. These results demonstrate that DL models can effectively improve the accuracy of pulmonary nodule malignancy risk estimation compared with established guideline-based approaches such as Lung-RADS, indicating their future potential to enhance lung cancer screening workflows.

Supplementary Material

Supplementary Material
umag003_Supplementary_Data

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

🟢 PMC 전문 열기