본문으로 건너뛰기
← 뒤로

The In-depth Comparative Analysis of Four Large Language AI Models for Risk Assessment and Information Retrieval from Multi-Modality Prostate Cancer Work-up Reports.

1/5 보강
The world journal of men's health 📖 저널 OA 66.7% 2021: 2/2 OA 2022: 1/1 OA 2023: 1/1 OA 2025: 9/14 OA 2026: 8/14 OA 2021~2026 2025 Vol.43(4) p. 918-933
Retraction 확인
출처

PICO 자동 추출 (휴리스틱, conf 2/4)

유사 논문
P · Population 대상 환자/모집단
환자: simulated reports, 115 (32
I · Intervention 중재 / 시술
추출되지 않음
C · Comparison 대조 / 비교
추출되지 않음
O · Outcome 결과 / 결론
However, the risks of misinterpretation impacting decision-making cannot be overlooked. Further research is necessary to validate these findings in other cancers.

Yuan LH, Huang SW, Chou D, Tsai CY

📝 환자 설명용 한 줄

[PURPOSE] Information retrieval (IR) and risk assessment (RA) from multi-modality imaging and pathology reports are critical to prostate cancer (PC) treatment.

이 논문을 인용하기

↓ .bib ↓ .ris
APA Yuan LH, Huang SW, et al. (2025). The In-depth Comparative Analysis of Four Large Language AI Models for Risk Assessment and Information Retrieval from Multi-Modality Prostate Cancer Work-up Reports.. The world journal of men's health, 43(4), 918-933. https://doi.org/10.5534/wjmh.240173
MLA Yuan LH, et al.. "The In-depth Comparative Analysis of Four Large Language AI Models for Risk Assessment and Information Retrieval from Multi-Modality Prostate Cancer Work-up Reports.." The world journal of men's health, vol. 43, no. 4, 2025, pp. 918-933.
PMID 39743220 ↗
DOI 10.5534/wjmh.240173

Abstract

[PURPOSE] Information retrieval (IR) and risk assessment (RA) from multi-modality imaging and pathology reports are critical to prostate cancer (PC) treatment. This study aims to evaluate the performance of four general-purpose large language model (LLMs) in IR and RA tasks.

[MATERIALS AND METHODS] We conducted a study using simulated text reports from computed tomography, magnetic resonance imaging, bone scans, and biopsy pathology on stage IV PC patients. We assessed four LLMs (ChatGPT-4-turbo, Claude-3-opus, Gemini-Pro-1.0, ChatGPT-3.5-turbo) on three RA tasks (LATITUDE, CHAARTED, TwNHI) and seven IR tasks. It included TNM staging, and the detection and quantification of bone and visceral metastases, providing a broad evaluation of their capabilities in handling diverse clinical data. We queried LLMs with multi-modality reports using zero-shot chain-of-thought prompting via application programming interface. With three adjudicators' consensus as the gold standard, these models' performances were assessed through repeated single-round queries and ensemble voting methods, using 6 outcome metrics.

[RESULTS] Among 350 stage IV PC patients with simulated reports, 115 (32.9%), 128 (36.6%), and 94 (26.9%) belonged to LATITUDE, CHAARTED, and TwNHI high-risk, respectively. Ensemble voting, based on three repeated single-round queries, consistently enhances accuracy with a higher likelihood of achieving non-inferior results compared to a single query. Four models showed minimal differences in IR tasks with high accuracy (87.4%-94.2%) and consistency (ICC>0.8) in TNM staging. However, there were significant differences in RA performance, with the ranking as follows: ChatGPT-4-turbo, Claude-3-opus, Gemini-Pro-1.0, and ChatGPT-3.5-turbo, respectively. ChatGPT-4-turbo achieved the highest accuracy (90.1%, 90.7%,91.6%), and consistency (ICC 0.86, 0.93, 0.76) across 3 RA tasks.

[CONCLUSIONS] ChatGPT-4-turbo demonstrated satisfactory accuracy and outcomes in RA and IR for stage IV PC, suggesting its potential for clinical decision support. However, the risks of misinterpretation impacting decision-making cannot be overlooked. Further research is necessary to validate these findings in other cancers.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

🟢 PMC 전문 열기