본문으로 건너뛰기
← 뒤로

Evaluating the Clinical Competence of Large Language Models in Prostate Cancer Management: A Comparative Study of DeepSeek-R1 and ChatGPT.

1/5 보강
Annals of surgical oncology 📖 저널 OA 24.7% 2021: 1/6 OA 2022: 4/14 OA 2023: 6/31 OA 2024: 24/70 OA 2025: 75/257 OA 2026: 118/514 OA 2021~2026 2026 Vol.33(2) p. 1858-1869
Retraction 확인
출처

Li R, Zhao A, Peng L, Shi H, Zhao J, Li Z

📝 환자 설명용 한 줄

[BACKGROUND] Large language models (LLMs) have gained prominence in medical applications, yet their performance in specialized clinical tasks remains underexplored.

🔬 핵심 임상 통계 (초록에서 자동 추출 — 원문 검증 권장)
  • p-value p < 0.05

이 논문을 인용하기

↓ .bib ↓ .ris
APA Li R, Zhao A, et al. (2026). Evaluating the Clinical Competence of Large Language Models in Prostate Cancer Management: A Comparative Study of DeepSeek-R1 and ChatGPT.. Annals of surgical oncology, 33(2), 1858-1869. https://doi.org/10.1245/s10434-025-18492-2
MLA Li R, et al.. "Evaluating the Clinical Competence of Large Language Models in Prostate Cancer Management: A Comparative Study of DeepSeek-R1 and ChatGPT.." Annals of surgical oncology, vol. 33, no. 2, 2026, pp. 1858-1869.
PMID 41094286 ↗

Abstract

[BACKGROUND] Large language models (LLMs) have gained prominence in medical applications, yet their performance in specialized clinical tasks remains underexplored. Prostate cancer, a complex malignancy requiring guideline-based management, presents a rigorous testbed for evaluating artificial intelligence (AI)-assisted decision-making. This study compared the clinical accuracy, reasoning ability, and language quality of DeepSeek-R1 and ChatGPT variants in addressing prostate cancer diagnosis and treatment.

[METHODS] A dataset of 98 prostate cancer multiple-choice questions from MedQA, MedMCQA, and China's National Medical Licensing Examination was constructed, alongside three real-world clinical cases. Responses were generated by five LLMs (DeepSeek-V3, DeepSeek-R1, ChatGPT-4o, -o3, -o4-mini) and evaluated for accuracy across three repeated runs. For case-based simulations, only R1 and o3 were compared with practicing urologists. A Clinical Decision Quality Assessment Scale (CDQAS) assessed outputs across four domains: readability, medical knowledge accuracy, diagnostic test appropriateness, and logical coherence. Blinded scoring was performed by senior urologic oncologists. Statistical analyses used one-way ANOVA with GraphPad Prism v10.1.2, Boston, Massachusetts, USA.

[RESULTS] DeepSeek-R1 achieved the highest accuracy (96.60 %) on multiple-choice tasks, significantly outperforming the other models (p < 0.05 to <0.0001). In simulated case evaluations, both R1 and o3 performed comparably with physicians in overall readability and diagnostic appropriateness. Whereas R1 demonstrated superior guideline compliance and evidence-based reasoning, o3 showed advantages in workflow clarity, sequencing, and response fluency. However, o3 generated fewer explicit errors than R1. Human clinicians maintained strengths in terminology precision and logical reasoning.

[CONCLUSION] DeepSeek-R1 and ChatGPT-o3 exhibit complementary strengths in prostate cancer clinical decision-making, with R1 favoring factual accuracy and o3 excelling in expressive clarity. Although both models approach human-level performance in structured evaluations, human oversight and continued domain-specific optimization remain essential for their safe and effective integration into clinical workflows.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

같은 제1저자의 인용 많은 논문 (5)

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반