본문으로 건너뛰기
← 뒤로

Evaluation of large language models for PI-RADS score extraction from free-text prostate MRI reports: a comparative study with human readers.

1/5 보강
Frontiers in oncology 📖 저널 OA 100% 2021: 15/15 OA 2022: 98/98 OA 2023: 60/60 OA 2024: 189/189 OA 2025: 1004/1004 OA 2026: 620/620 OA 2021~2026 2026 Vol.16() p. 1743096 OA
Retraction 확인
출처

Wen J, Qian Y, Chen X, Cui Y, Hao J, Yuan X

📝 환자 설명용 한 줄

[OBJECTIVE] This study aimed to evaluate the ability of GPT-4o and Gemini 2.5 Pro to extract and assign PI-RADS v2.1 score from free-text prostate MRI reports, and compare their performance with human

🔬 핵심 임상 통계 (초록에서 자동 추출 — 원문 검증 권장)
  • p-value P = 0.004
  • 95% CI 0.61-0.75

이 논문을 인용하기

↓ .bib ↓ .ris
APA Wen J, Qian Y, et al. (2026). Evaluation of large language models for PI-RADS score extraction from free-text prostate MRI reports: a comparative study with human readers.. Frontiers in oncology, 16, 1743096. https://doi.org/10.3389/fonc.2026.1743096
MLA Wen J, et al.. "Evaluation of large language models for PI-RADS score extraction from free-text prostate MRI reports: a comparative study with human readers.." Frontiers in oncology, vol. 16, 2026, pp. 1743096.
PMID 42038363 ↗

Abstract

[OBJECTIVE] This study aimed to evaluate the ability of GPT-4o and Gemini 2.5 Pro to extract and assign PI-RADS v2.1 score from free-text prostate MRI reports, and compare their performance with human readers of varied experience.

[METHODS] Three radiologists with differing levels of experience (resident, fellow, expert) independently reviewed the reports and assigned PI-RADS v2.1 scores. The same reports were processed through prompts with the GPT-4o and Gemini 2.5 Pro. Inter-rater agreement was evaluated using Gwet's AC1 coefficient, and the diagnostic performance was assessed using sensitivity, specificity, and area under the receiver operating characteristic curve (AUC).

[RESULTS] Inter-rater agreement between human experts was highest between the expert and fellow (Gwet's AC1 = 0.68, 95% CI 0.61-0.75), which was significantly higher than between two LLMs (Gwet's AC1 = 0.52, 95% CI 0.44-0.59, P = 0.004). The agreement between expert and GPT (Gwet's AC1 = 0.42, 95% CI 0.34-0.51) was lower than between expert and Gemini (Gwet's AC1 = 0.49, 95% CI 0.41-0.57, P = 0.17). The AUCs for resident, fellow, and expert readers were 0.81 (95% CI 0.76-0.87), 0.86 (95% CI 0.81-0.91), and 0.89 (95% CI 0.85-0.93), and for GPT and Gemini were 0.85 (95% CI 0.81-0.90) and 0.84 (95% CI 0.80-0.89), respectively.

[CONCLUSION] LLMs demonstrated promising performance in assigning PI-RADS scores from free-text prostate MRI reports, with accuracy and agreement approaching that of general radiologists; however, they are not yet ready to replace expert interpretation in high-stakes clinical settings. Nevertheless, these findings support its potential as a supplementary tool for report standardization and trainee education.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

같은 제1저자의 인용 많은 논문 (5)

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

🟢 PMC 전문 열기