본문으로 건너뛰기
← 뒤로

Limitations of Large Language Models in Assisting PI-RADS Scoring on Prostate Biparametric MRI Text Reports.

2/5 보강
Academic radiology 📖 저널 OA 6.4% 2023: 1/1 OA 2024: 1/8 OA 2025: 4/67 OA 2026: 4/79 OA 2023~2026 2026 Vol.33(4) p. 1565-1576 Prostate Cancer Diagnosis and Treatm
TL;DR While LLMs demonstrated high sensitivity in detecting PCa and csPCa, they had significant limitations in specificity and PPV, particularly in the transition and peripheral zones, and the superior clinical utility of the PI-RADS ≥4 threshold was confirmed.
Retraction 확인
출처
PubMed DOI OpenAlex Semantic 마지막 보강 2026-05-01

PICO 자동 추출 (휴리스틱, conf 3/4)

유사 논문
P · Population 대상 환자/모집단
210 patients who underwent transperineal cognitive fusion-targeted biopsy for clinically suspected prostate cancer between December 2024 and July 2025.
I · Intervention 중재 / 시술
transperineal cognitive fusion-targeted biopsy for clinically suspected prostate cancer between December 2024 and July 2025
C · Comparison 대조 / 비교
추출되지 않음
O · Outcome 결과 / 결론
Experienced radiologists achieved better diagnostic performance, highlighting the need for cautious clinical application of LLMs. Future research should focus on optimizing LLMs to improve specificity and reliability, and combining them with human radiologists' expertise to enhance diagnostic accuracy and efficiency.
OpenAlex 토픽 · Prostate Cancer Diagnosis and Treatment Artificial Intelligence in Healthcare and Education Machine Learning in Healthcare

Zhang S, Wu Z, Guo M, Liu C, Cui M, Yang S

📝 환자 설명용 한 줄

While LLMs demonstrated high sensitivity in detecting PCa and csPCa, they had significant limitations in specificity and PPV, particularly in the transition and peripheral zones, and the superior clin

🔬 핵심 임상 통계 (초록에서 자동 추출 — 원문 검증 권장)
  • p-value P<0.001
  • 95% CI 14.89-1000.00
  • OR 109.49

이 논문을 인용하기

↓ .bib ↓ .ris
APA Siying Zhang, Zhenping Wu, et al. (2026). Limitations of Large Language Models in Assisting PI-RADS Scoring on Prostate Biparametric MRI Text Reports.. Academic radiology, 33(4), 1565-1576. https://doi.org/10.1016/j.acra.2025.12.020
MLA Siying Zhang, et al.. "Limitations of Large Language Models in Assisting PI-RADS Scoring on Prostate Biparametric MRI Text Reports.." Academic radiology, vol. 33, no. 4, 2026, pp. 1565-1576.
PMID 41521112 ↗

Abstract

[BACKGROUND] Prostate cancer (PCa) is a significant global health challenge, and the prostate imaging reporting and data system (PI-RADS) is crucial for risk stratification using MRI. However, inter-reader variability, especially in the transition zone and among practitioners with differing experience levels, compromises diagnostic consistency. Large language models (LLMs) show potential in medical image analysis, particularly in standardizing reports to improve diagnostic consistency and efficiency.

[OBJECTIVE] To evaluate the performance of LLMs in assisting PI-RADS scoring based on biparametric MRI text reports and compare them with radiologists of varying experience levels. Additionally, to identify independent predictors of PCa and csPCa using multivariable logistic regression analysis.

[METHODS] This retrospective single-center study included 210 patients who underwent transperineal cognitive fusion-targeted biopsy for clinically suspected prostate cancer between December 2024 and July 2025. Three radiologists and two LLMs (DeepSeek and ChatGPT-4.1) independently reviewed anonymized reports and assigned PI-RADS v2.1 scores. Diagnostic performance was assessed using biopsy pathological results as the gold standard. Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and area under the receiver operating characteristic curve (AUC) were calculated at both lesion-level (PI-RADS ≥3 as positive) and participant-level (PI-RADS ≥3 and ≥4 as positive thresholds). Decision curve analysis was performed to evaluate clinical utility. Subgroup analyses were conducted based on lesion location (peripheral zone vs. transition zone). Multivariable logistic regression analysis identified independent predictors of PCa and csPCa.

[RESULTS] The senior radiologist demonstrated the highest diagnostic performance, with AUC values of 0.847 for PCa and 0.859 for csPCa. The attending physician achieved perfect sensitivity but had the lowest specificity and PPV. The resident physician had comparable sensitivity but lower specificity and PPV, resulting in the lowest AUC values. Both LLMs exhibited high sensitivity but extremely low specificity, leading to lower PPV than human readers. DeepSeek outperformed ChatGPT-4.1 in AUC but still fell short of the senior radiologist's performance. In region-specific analyses, the senior radiologist significantly outperformed LLMs in the transition zone, while LLMs showed high sensitivity but low specificity in the peripheral zone. At the participant level, raising the threshold to PI-RADS ≥4 substantially improved specificity for all readers. Decision curve analysis confirmed the superior clinical utility of the PI-RADS ≥4 threshold, with the senior radiologist's ratings achieving the highest net benefit. Multivariable logistic regression analysis identified PSA density as the strongest independent predictor for both PCa (OR = 109.49, 95% CI: 14.89-1000.00, P<0.001) and csPCa (OR = 152.16, 95% CI: 21.06-1000.00, P<0.001). Among all PI-RADS ratings, only the senior radiologist's scores retained independent predictive value for both PCa (OR = 17.94, P<0.001) and csPCa (OR = 22.69, P = 0.001).

[CONCLUSION] While LLMs demonstrated high sensitivity in detecting PCa and csPCa, they had significant limitations in specificity and PPV, particularly in the transition and peripheral zones. The optimal utilization strategy involves deploying LLMs as adjuncts for indeterminate cases or when using higher diagnostic thresholds (PI-RADS ≥4). Experienced radiologists achieved better diagnostic performance, highlighting the need for cautious clinical application of LLMs. Future research should focus on optimizing LLMs to improve specificity and reliability, and combining them with human radiologists' expertise to enhance diagnostic accuracy and efficiency.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

같은 제1저자의 인용 많은 논문 (5)

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반