Limitations of Large Language Models in Assisting PI-RADS Scoring on Prostate Biparametric MRI Text Reports.
2/5 보강
TL;DR
While LLMs demonstrated high sensitivity in detecting PCa and csPCa, they had significant limitations in specificity and PPV, particularly in the transition and peripheral zones, and the superior clinical utility of the PI-RADS ≥4 threshold was confirmed.
PICO 자동 추출 (휴리스틱, conf 3/4)
유사 논문P · Population 대상 환자/모집단
210 patients who underwent transperineal cognitive fusion-targeted biopsy for clinically suspected prostate cancer between December 2024 and July 2025.
I · Intervention 중재 / 시술
transperineal cognitive fusion-targeted biopsy for clinically suspected prostate cancer between December 2024 and July 2025
C · Comparison 대조 / 비교
추출되지 않음
O · Outcome 결과 / 결론
Experienced radiologists achieved better diagnostic performance, highlighting the need for cautious clinical application of LLMs. Future research should focus on optimizing LLMs to improve specificity and reliability, and combining them with human radiologists' expertise to enhance diagnostic accuracy and efficiency.
OpenAlex 토픽 ·
Prostate Cancer Diagnosis and Treatment
Artificial Intelligence in Healthcare and Education
Machine Learning in Healthcare
While LLMs demonstrated high sensitivity in detecting PCa and csPCa, they had significant limitations in specificity and PPV, particularly in the transition and peripheral zones, and the superior clin
- p-value P<0.001
- 95% CI 14.89-1000.00
- OR 109.49
APA
Siying Zhang, Zhenping Wu, et al. (2026). Limitations of Large Language Models in Assisting PI-RADS Scoring on Prostate Biparametric MRI Text Reports.. Academic radiology, 33(4), 1565-1576. https://doi.org/10.1016/j.acra.2025.12.020
MLA
Siying Zhang, et al.. "Limitations of Large Language Models in Assisting PI-RADS Scoring on Prostate Biparametric MRI Text Reports.." Academic radiology, vol. 33, no. 4, 2026, pp. 1565-1576.
PMID
41521112 ↗
Abstract 한글 요약
[BACKGROUND] Prostate cancer (PCa) is a significant global health challenge, and the prostate imaging reporting and data system (PI-RADS) is crucial for risk stratification using MRI. However, inter-reader variability, especially in the transition zone and among practitioners with differing experience levels, compromises diagnostic consistency. Large language models (LLMs) show potential in medical image analysis, particularly in standardizing reports to improve diagnostic consistency and efficiency.
[OBJECTIVE] To evaluate the performance of LLMs in assisting PI-RADS scoring based on biparametric MRI text reports and compare them with radiologists of varying experience levels. Additionally, to identify independent predictors of PCa and csPCa using multivariable logistic regression analysis.
[METHODS] This retrospective single-center study included 210 patients who underwent transperineal cognitive fusion-targeted biopsy for clinically suspected prostate cancer between December 2024 and July 2025. Three radiologists and two LLMs (DeepSeek and ChatGPT-4.1) independently reviewed anonymized reports and assigned PI-RADS v2.1 scores. Diagnostic performance was assessed using biopsy pathological results as the gold standard. Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and area under the receiver operating characteristic curve (AUC) were calculated at both lesion-level (PI-RADS ≥3 as positive) and participant-level (PI-RADS ≥3 and ≥4 as positive thresholds). Decision curve analysis was performed to evaluate clinical utility. Subgroup analyses were conducted based on lesion location (peripheral zone vs. transition zone). Multivariable logistic regression analysis identified independent predictors of PCa and csPCa.
[RESULTS] The senior radiologist demonstrated the highest diagnostic performance, with AUC values of 0.847 for PCa and 0.859 for csPCa. The attending physician achieved perfect sensitivity but had the lowest specificity and PPV. The resident physician had comparable sensitivity but lower specificity and PPV, resulting in the lowest AUC values. Both LLMs exhibited high sensitivity but extremely low specificity, leading to lower PPV than human readers. DeepSeek outperformed ChatGPT-4.1 in AUC but still fell short of the senior radiologist's performance. In region-specific analyses, the senior radiologist significantly outperformed LLMs in the transition zone, while LLMs showed high sensitivity but low specificity in the peripheral zone. At the participant level, raising the threshold to PI-RADS ≥4 substantially improved specificity for all readers. Decision curve analysis confirmed the superior clinical utility of the PI-RADS ≥4 threshold, with the senior radiologist's ratings achieving the highest net benefit. Multivariable logistic regression analysis identified PSA density as the strongest independent predictor for both PCa (OR = 109.49, 95% CI: 14.89-1000.00, P<0.001) and csPCa (OR = 152.16, 95% CI: 21.06-1000.00, P<0.001). Among all PI-RADS ratings, only the senior radiologist's scores retained independent predictive value for both PCa (OR = 17.94, P<0.001) and csPCa (OR = 22.69, P = 0.001).
[CONCLUSION] While LLMs demonstrated high sensitivity in detecting PCa and csPCa, they had significant limitations in specificity and PPV, particularly in the transition and peripheral zones. The optimal utilization strategy involves deploying LLMs as adjuncts for indeterminate cases or when using higher diagnostic thresholds (PI-RADS ≥4). Experienced radiologists achieved better diagnostic performance, highlighting the need for cautious clinical application of LLMs. Future research should focus on optimizing LLMs to improve specificity and reliability, and combining them with human radiologists' expertise to enhance diagnostic accuracy and efficiency.
[OBJECTIVE] To evaluate the performance of LLMs in assisting PI-RADS scoring based on biparametric MRI text reports and compare them with radiologists of varying experience levels. Additionally, to identify independent predictors of PCa and csPCa using multivariable logistic regression analysis.
[METHODS] This retrospective single-center study included 210 patients who underwent transperineal cognitive fusion-targeted biopsy for clinically suspected prostate cancer between December 2024 and July 2025. Three radiologists and two LLMs (DeepSeek and ChatGPT-4.1) independently reviewed anonymized reports and assigned PI-RADS v2.1 scores. Diagnostic performance was assessed using biopsy pathological results as the gold standard. Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and area under the receiver operating characteristic curve (AUC) were calculated at both lesion-level (PI-RADS ≥3 as positive) and participant-level (PI-RADS ≥3 and ≥4 as positive thresholds). Decision curve analysis was performed to evaluate clinical utility. Subgroup analyses were conducted based on lesion location (peripheral zone vs. transition zone). Multivariable logistic regression analysis identified independent predictors of PCa and csPCa.
[RESULTS] The senior radiologist demonstrated the highest diagnostic performance, with AUC values of 0.847 for PCa and 0.859 for csPCa. The attending physician achieved perfect sensitivity but had the lowest specificity and PPV. The resident physician had comparable sensitivity but lower specificity and PPV, resulting in the lowest AUC values. Both LLMs exhibited high sensitivity but extremely low specificity, leading to lower PPV than human readers. DeepSeek outperformed ChatGPT-4.1 in AUC but still fell short of the senior radiologist's performance. In region-specific analyses, the senior radiologist significantly outperformed LLMs in the transition zone, while LLMs showed high sensitivity but low specificity in the peripheral zone. At the participant level, raising the threshold to PI-RADS ≥4 substantially improved specificity for all readers. Decision curve analysis confirmed the superior clinical utility of the PI-RADS ≥4 threshold, with the senior radiologist's ratings achieving the highest net benefit. Multivariable logistic regression analysis identified PSA density as the strongest independent predictor for both PCa (OR = 109.49, 95% CI: 14.89-1000.00, P<0.001) and csPCa (OR = 152.16, 95% CI: 21.06-1000.00, P<0.001). Among all PI-RADS ratings, only the senior radiologist's scores retained independent predictive value for both PCa (OR = 17.94, P<0.001) and csPCa (OR = 22.69, P = 0.001).
[CONCLUSION] While LLMs demonstrated high sensitivity in detecting PCa and csPCa, they had significant limitations in specificity and PPV, particularly in the transition and peripheral zones. The optimal utilization strategy involves deploying LLMs as adjuncts for indeterminate cases or when using higher diagnostic thresholds (PI-RADS ≥4). Experienced radiologists achieved better diagnostic performance, highlighting the need for cautious clinical application of LLMs. Future research should focus on optimizing LLMs to improve specificity and reliability, and combining them with human radiologists' expertise to enhance diagnostic accuracy and efficiency.
🏷️ 키워드 / MeSH 📖 같은 키워드 OA만
같은 제1저자의 인용 많은 논문 (5)
- An Easy and Cost-Effective Method to Perform the "No-Touch" Technique in Saline Breast Augmentation.
- Establishing a knowledge-based planning model for left-sided breast cancer patients receiving hypofractionated postmastectomy and regional nodal irradiation.
- Copper-enriched zinc peroxides induced cuproptosis through concurrent metabolic and oxidative dysregulation for boosting immunotherapy in colorectal cancer.
- Identifying Low-Risk Patients with Cirrhosis and Acute Gastrointestinal Bleeding That May Not Require Urgent Endoscopy.
- Esophageal cancer: from pathogenesis to precision therapies.
🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반
- A Phase I Study of Hydroxychloroquine and Suba-Itraconazole in Men with Biochemical Relapse of Prostate Cancer (HITMAN-PC): Dose Escalation Results.
- Self-management of male urinary symptoms: qualitative findings from a primary care trial.
- Clinical and Liquid Biomarkers of 20-Year Prostate Cancer Risk in Men Aged 45 to 70 Years.
- Diagnostic accuracy of Ga-PSMA PET/CT versus multiparametric MRI for preoperative pelvic invasion in the patients with prostate cancer.
- Clinical Presentation and Outcomes of Patients Undergoing Surgery for Thyroid Cancer.
- Association of patient health education with the postoperative health related quality of life in low- intermediate recurrence risk differentiated thyroid cancer patients.