Zero-Shot PI-RADS Version 2.1 Scoring with ChatGPT-4 Turbo and Llama 3: Diagnostic Performance and Agreement with Abdominal Radiologists.

Firoozeh N; Mastrodicasa D; Behr S; Muglia VF; Westphalen AC

doi:10.1148/rycan.250119

← 뒤로

Zero-Shot PI-RADS Version 2.1 Scoring with ChatGPT-4 Turbo and Llama 3: Diagnostic Performance and Agreement with Abdominal Radiologists.

1/5 보강

Radiology. Imaging cancer 2026 Vol.8(1) p. e250119

Firoozeh N, Mastrodicasa D, Behr S, Muglia VF, Westphalen AC

PMC 전문 ↗ 원문 ↗ DOI ↗ BibTeX ↓ RIS ↓

📝 환자 설명용 한 줄

이 논문을 인용하기

BibTeX ↓ RIS ↓

APA Firoozeh N, Mastrodicasa D, et al. (2026). Zero-Shot PI-RADS Version 2.1 Scoring with ChatGPT-4 Turbo and Llama 3: Diagnostic Performance and Agreement with Abdominal Radiologists.. Radiology. Imaging cancer, 8(1), e250119. https://doi.org/10.1148/rycan.250119

MLA Firoozeh N, et al.. "Zero-Shot PI-RADS Version 2.1 Scoring with ChatGPT-4 Turbo and Llama 3: Diagnostic Performance and Agreement with Abdominal Radiologists.." Radiology. Imaging cancer, vol. 8, no. 1, 2026, pp. e250119.

PMID 41347921

DOI 10.1148/rycan.250119

Abstract

This retrospective, single-center study aimed to assess the diagnostic performance and agreement of two large language models (LLMs), ChatGPT-4 Turbo (OpenAI) and Llama 3 (Meta AI), in assigning Prostate Imaging Reporting and Data System (PI-RADS) scores to prostate MRI reports and to compare their performance with two abdominal radiologists. Structured prostate MRI reports ( = 500) obtained between January and December 2022, with original PI-RADS scores removed, were processed with LLMs using a standardized prompt to extract PI-RADS version 2.1 scores. Two abdominal radiologists independently assigned scores, with a third adjudicating discrepancies. Prostate biopsy results served as the reference standard for diagnostic performance assessment. There was high agreement between both models and radiologists: ChatGPT-4 Turbo and Llama 3 achieved 97.7% agreement (κ = 0.95), agreement between the LLM and radiologists ranged from 94.7% to 95.7% (κ = 0.89-0.91), and interradiologist agreement was 94.4% (κ = 0.88). ChatGPT-4 Turbo assigned significantly higher scores than radiologists ( < .005), while differences with Llama 3 were not statistically significant ( = .08). ChatGPT-4 Turbo and the original MRI reports achieved an area under the receiver operating characteristic curve (AUC) of 0.79 for predicting prostate cancer, and radiologists and Llama 3 achieved AUCs of 0.78 each. These results suggest that LLMs could improve prostate MRI reporting through accurate and consistent PI-RADS scoring. Prostate, Oncology, Large Language Model © RSNA, 2025.

MeSH Terms

Humans; Male; Prostatic Neoplasms; Retrospective Studies; Magnetic Resonance Imaging; Aged; Middle Aged; Radiologists; Prostate; Observer Variation; Generative Artificial Intelligence

같은 제1저자의 인용 많은 논문 (1)

Diagnostic impact of DWI absence on prostate lesion assessment using PI-RADS 2.1.
Current problems in diagnostic radiology 2025