Evaluation of large language models in assigning PI-RADS v2.1 categories for prostate MRI reports.
[BACKGROUND] This study aimed to evaluate the performance of large language models (LLMs) in classifying prostate MRI reports according to the Prostate Imaging–Reporting and Data System (PIRADS) versi
APA
Akdal Dolek B, Besler MS (2026). Evaluation of large language models in assigning PI-RADS v2.1 categories for prostate MRI reports.. BMC urology, 26(1), 27. https://doi.org/10.1186/s12894-025-02038-5
MLA
Akdal Dolek B, et al.. "Evaluation of large language models in assigning PI-RADS v2.1 categories for prostate MRI reports.." BMC urology, vol. 26, no. 1, 2026, pp. 27.
PMID
41484874
Abstract
[BACKGROUND] This study aimed to evaluate the performance of large language models (LLMs) in classifying prostate MRI reports according to the Prostate Imaging–Reporting and Data System (PIRADS) version 2.1, and to validate their use in supporting clinical decisions in prostate cancer treatment.
[METHODS] This retrospective study included 146 patients. Four LLMs — GPT-4o, GPT-o1, Google Gemini 1.5 Pro and Google Gemini 2.0 Experimental Advanced — were tested on standardised, structured prostate MRI reports. A two-radiologist consensus reference standard was used to compare model performance. Agreement was measured using weighted Cohen’s kappa, and accuracy and F1 scores were calculated for three PI-RADS risk groups: low (1–2), intermediate (3) and high (4–5).
[RESULTS] Performance varied by model. GPT-o1 achieved the highest level of agreement with radiologists (κ = 0.867), followed by GPT-4o (κ = 0.743), Gemini 1.5 Pro (κ = 0.728) and Gemini 2.0 Experimental Advanced (κ = 0.664). GPT-o1 achieved the highest F1 scores for the low-risk (0.93) and high-risk (1.00) groups, demonstrating moderate performance for the PI-RADS 3 group (0.75). All models showed weak performance for PI-RADS 3 (F1 range: 0.54–0.75). Most importantly, none of the models produced invalid results outside the target PI-RADS 1–5 range.
[CONCLUSION] LLMs show potential for automating PI-RADS classification from MRI reports, with GPT-o1 demonstrating the best overall performance. However, their failure in PI-RADS 3 lesions indicates that multicentre validation, larger datasets and multimodality integration are needed before they can be used clinically for prostate cancer diagnosis and urological decision-making.
[TRIAL REGISTRATION] Not applicable. This retrospective study did not involve a clinical trial.
[METHODS] This retrospective study included 146 patients. Four LLMs — GPT-4o, GPT-o1, Google Gemini 1.5 Pro and Google Gemini 2.0 Experimental Advanced — were tested on standardised, structured prostate MRI reports. A two-radiologist consensus reference standard was used to compare model performance. Agreement was measured using weighted Cohen’s kappa, and accuracy and F1 scores were calculated for three PI-RADS risk groups: low (1–2), intermediate (3) and high (4–5).
[RESULTS] Performance varied by model. GPT-o1 achieved the highest level of agreement with radiologists (κ = 0.867), followed by GPT-4o (κ = 0.743), Gemini 1.5 Pro (κ = 0.728) and Gemini 2.0 Experimental Advanced (κ = 0.664). GPT-o1 achieved the highest F1 scores for the low-risk (0.93) and high-risk (1.00) groups, demonstrating moderate performance for the PI-RADS 3 group (0.75). All models showed weak performance for PI-RADS 3 (F1 range: 0.54–0.75). Most importantly, none of the models produced invalid results outside the target PI-RADS 1–5 range.
[CONCLUSION] LLMs show potential for automating PI-RADS classification from MRI reports, with GPT-o1 demonstrating the best overall performance. However, their failure in PI-RADS 3 lesions indicates that multicentre validation, larger datasets and multimodality integration are needed before they can be used clinically for prostate cancer diagnosis and urological decision-making.
[TRIAL REGISTRATION] Not applicable. This retrospective study did not involve a clinical trial.