Evaluation of large language models for PI-RADS score extraction from free-text prostate MRI reports: a comparative study with human readers.

Wen J; Qian Y; Chen X; Cui Y; Hao J; Yuan X; Tan Z; Zhou Z; Meng W

doi:10.3389/fonc.2026.1743096

← 뒤로

Evaluation of large language models for PI-RADS score extraction from free-text prostate MRI reports: a comparative study with human readers.

1/5 보강

Frontiers in oncology 📖 저널 OA 100% 2021~2026 2026 Vol.16() p. 1743096 OA

Wen J, Qian Y, Chen X, Cui Y, Hao J, Yuan X

📖 무료 전문 🟢 PMC 전문 PMC13107145 🔓 OA PDF unpaywall · cc-by

PubMed ↗ DOI ↗ BibTeX ↓ RIS ↓

📝 환자 설명용 한 줄

🔬 핵심 임상 통계 (초록에서 자동 추출 — 원문 검증 권장)

p-value P = 0.004
95% CI 0.61-0.75

이 논문을 인용하기

↓ .bib ↓ .ris

APA Wen J, Qian Y, et al. (2026). Evaluation of large language models for PI-RADS score extraction from free-text prostate MRI reports: a comparative study with human readers.. Frontiers in oncology, 16, 1743096. https://doi.org/10.3389/fonc.2026.1743096

MLA Wen J, et al.. "Evaluation of large language models for PI-RADS score extraction from free-text prostate MRI reports: a comparative study with human readers.." Frontiers in oncology, vol. 16, 2026, pp. 1743096.

PMID 42038363 ↗

DOI 10.3389/fonc.2026.1743096

Abstract

[OBJECTIVE] This study aimed to evaluate the ability of GPT-4o and Gemini 2.5 Pro to extract and assign PI-RADS v2.1 score from free-text prostate MRI reports, and compare their performance with human readers of varied experience.

[METHODS] Three radiologists with differing levels of experience (resident, fellow, expert) independently reviewed the reports and assigned PI-RADS v2.1 scores. The same reports were processed through prompts with the GPT-4o and Gemini 2.5 Pro. Inter-rater agreement was evaluated using Gwet's AC1 coefficient, and the diagnostic performance was assessed using sensitivity, specificity, and area under the receiver operating characteristic curve (AUC).

[RESULTS] Inter-rater agreement between human experts was highest between the expert and fellow (Gwet's AC1 = 0.68, 95% CI 0.61-0.75), which was significantly higher than between two LLMs (Gwet's AC1 = 0.52, 95% CI 0.44-0.59, P = 0.004). The agreement between expert and GPT (Gwet's AC1 = 0.42, 95% CI 0.34-0.51) was lower than between expert and Gemini (Gwet's AC1 = 0.49, 95% CI 0.41-0.57, P = 0.17). The AUCs for resident, fellow, and expert readers were 0.81 (95% CI 0.76-0.87), 0.86 (95% CI 0.81-0.91), and 0.89 (95% CI 0.85-0.93), and for GPT and Gemini were 0.85 (95% CI 0.81-0.90) and 0.84 (95% CI 0.80-0.89), respectively.

[CONCLUSION] LLMs demonstrated promising performance in assigning PI-RADS scores from free-text prostate MRI reports, with accuracy and agreement approaching that of general radiologists; however, they are not yet ready to replace expert interpretation in high-stakes clinical settings. Nevertheless, these findings support its potential as a supplementary tool for report standardization and trainee education.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

같은 제1저자의 인용 많은 논문 (5)

Exploring New and Potential Indications for Botulinum Toxin Treatment: An Updated Literature Review.
Cureus 2024 cited 1
A comprehensive analysis of the circRNA, miRNA, and mRNA networks reveals prognostic biomarkers for gastric cancer.
Zeitschrift fur Gastroenterologie 2026
Advances in occult cancer screening for patients with unprovoked venous thromboembolism: A narrative review of epidemiology, risk, and clinical strategies.
Journal of vascular surgery. Venous and lymphatic disorders 2026
A combined study of network-based prediction and in vitro experimental validation on the procarcinogenic mechanism of propylparaben in estrogen receptor-positive breast cancer cells.
Ecotoxicology and environmental safety 2026
Beyond medical costs: a call for integrating psychosocial support into lymphoma care policy.
Annals of hematology 2026

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

Scoring Physician Risk Communication in Prostate Cancer Using Large Language Models.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 2026 Lopez-Garcia G 외 📖 unpaywall
DualPG-DTA: A Large Language Model-Powered Graph Neural Network Framework for Enhanced Drug-Target Affinity Prediction and Discovery of Novel CDK9 Inhibitors Exhibiting In Vivo Anti-Leukemia Activity.
Advanced science (Weinheim, Baden-Wurttemberg, Germany) 2026 Chen Y 외 📖 unpaywall
Automating the segmentation, date extraction, and classification of multi-report PDFs in outside medical records using optical character recognition and generative artificial intelligence.
JAMIA open 2026 Damani S 외 📖 OA
Whole-body magnetic resonance imaging for cancer screening in asymptomatic adults: a multicenter study.
European journal of cancer prevention : the official journal of the European Cancer Prevention Organisation (ECP) 2026 Ali M 외 📖 unpaywall
Decoding the interconnected splicing patterns of hepatitis B virus and host using large language and deep learning models.
Microbial genomics 2026 Lim CS 외 📖 unpaywall
Privacy-Preserving Generation of Structured Lymphoma Progression Reports from Cross-sectional Imaging: A Comparative Analysis of Llama 3.3 and Llama 4.
Journal of imaging informatics in medicine 2026 Prucker P 외 📖 unpaywall

이 논문을 인용하기

Abstract 한글 요약

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

같은 제1저자의 인용 많은 논문 (5)

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

Abstract