Evaluating the Clinical Competence of Large Language Models in Prostate Cancer Management: A Comparative Study of DeepSeek-R1 and ChatGPT.
1/5 보강
[BACKGROUND] Large language models (LLMs) have gained prominence in medical applications, yet their performance in specialized clinical tasks remains underexplored.
- p-value p < 0.05
APA
Li R, Zhao A, et al. (2026). Evaluating the Clinical Competence of Large Language Models in Prostate Cancer Management: A Comparative Study of DeepSeek-R1 and ChatGPT.. Annals of surgical oncology, 33(2), 1858-1869. https://doi.org/10.1245/s10434-025-18492-2
MLA
Li R, et al.. "Evaluating the Clinical Competence of Large Language Models in Prostate Cancer Management: A Comparative Study of DeepSeek-R1 and ChatGPT.." Annals of surgical oncology, vol. 33, no. 2, 2026, pp. 1858-1869.
PMID
41094286 ↗
Abstract 한글 요약
[BACKGROUND] Large language models (LLMs) have gained prominence in medical applications, yet their performance in specialized clinical tasks remains underexplored. Prostate cancer, a complex malignancy requiring guideline-based management, presents a rigorous testbed for evaluating artificial intelligence (AI)-assisted decision-making. This study compared the clinical accuracy, reasoning ability, and language quality of DeepSeek-R1 and ChatGPT variants in addressing prostate cancer diagnosis and treatment.
[METHODS] A dataset of 98 prostate cancer multiple-choice questions from MedQA, MedMCQA, and China's National Medical Licensing Examination was constructed, alongside three real-world clinical cases. Responses were generated by five LLMs (DeepSeek-V3, DeepSeek-R1, ChatGPT-4o, -o3, -o4-mini) and evaluated for accuracy across three repeated runs. For case-based simulations, only R1 and o3 were compared with practicing urologists. A Clinical Decision Quality Assessment Scale (CDQAS) assessed outputs across four domains: readability, medical knowledge accuracy, diagnostic test appropriateness, and logical coherence. Blinded scoring was performed by senior urologic oncologists. Statistical analyses used one-way ANOVA with GraphPad Prism v10.1.2, Boston, Massachusetts, USA.
[RESULTS] DeepSeek-R1 achieved the highest accuracy (96.60 %) on multiple-choice tasks, significantly outperforming the other models (p < 0.05 to <0.0001). In simulated case evaluations, both R1 and o3 performed comparably with physicians in overall readability and diagnostic appropriateness. Whereas R1 demonstrated superior guideline compliance and evidence-based reasoning, o3 showed advantages in workflow clarity, sequencing, and response fluency. However, o3 generated fewer explicit errors than R1. Human clinicians maintained strengths in terminology precision and logical reasoning.
[CONCLUSION] DeepSeek-R1 and ChatGPT-o3 exhibit complementary strengths in prostate cancer clinical decision-making, with R1 favoring factual accuracy and o3 excelling in expressive clarity. Although both models approach human-level performance in structured evaluations, human oversight and continued domain-specific optimization remain essential for their safe and effective integration into clinical workflows.
[METHODS] A dataset of 98 prostate cancer multiple-choice questions from MedQA, MedMCQA, and China's National Medical Licensing Examination was constructed, alongside three real-world clinical cases. Responses were generated by five LLMs (DeepSeek-V3, DeepSeek-R1, ChatGPT-4o, -o3, -o4-mini) and evaluated for accuracy across three repeated runs. For case-based simulations, only R1 and o3 were compared with practicing urologists. A Clinical Decision Quality Assessment Scale (CDQAS) assessed outputs across four domains: readability, medical knowledge accuracy, diagnostic test appropriateness, and logical coherence. Blinded scoring was performed by senior urologic oncologists. Statistical analyses used one-way ANOVA with GraphPad Prism v10.1.2, Boston, Massachusetts, USA.
[RESULTS] DeepSeek-R1 achieved the highest accuracy (96.60 %) on multiple-choice tasks, significantly outperforming the other models (p < 0.05 to <0.0001). In simulated case evaluations, both R1 and o3 performed comparably with physicians in overall readability and diagnostic appropriateness. Whereas R1 demonstrated superior guideline compliance and evidence-based reasoning, o3 showed advantages in workflow clarity, sequencing, and response fluency. However, o3 generated fewer explicit errors than R1. Human clinicians maintained strengths in terminology precision and logical reasoning.
[CONCLUSION] DeepSeek-R1 and ChatGPT-o3 exhibit complementary strengths in prostate cancer clinical decision-making, with R1 favoring factual accuracy and o3 excelling in expressive clarity. Although both models approach human-level performance in structured evaluations, human oversight and continued domain-specific optimization remain essential for their safe and effective integration into clinical workflows.
🏷️ 키워드 / MeSH 📖 같은 키워드 OA만
같은 제1저자의 인용 많은 논문 (5)
- A brief review and case report of pheochromocytoma misdiagnosed as allergic vasculitis with bilateral lower extremity ulcers: a 24-year clinical course.
- From numerical amplification to functional metamorphosis: the MDSC-driven therapeutic resistance in tumor.
- Symptom Clusters in Children With Leukemia Receiving Chemotherapy: A Scoping Review.
- Ginkgetin inhibits non-small cell lung cancer via the HSP90-AKT signaling pathway.
- Global trends and inequities in childhood cancer burden from 1990 to 2021, with projections to 2040: a Global Burden of Disease study.
🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반
- A Phase I Study of Hydroxychloroquine and Suba-Itraconazole in Men with Biochemical Relapse of Prostate Cancer (HITMAN-PC): Dose Escalation Results.
- Self-management of male urinary symptoms: qualitative findings from a primary care trial.
- Clinical and Liquid Biomarkers of 20-Year Prostate Cancer Risk in Men Aged 45 to 70 Years.
- Diagnostic accuracy of Ga-PSMA PET/CT versus multiparametric MRI for preoperative pelvic invasion in the patients with prostate cancer.
- Comprehensive analysis of androgen receptor splice variant target gene expression in prostate cancer.
- Clinical Presentation and Outcomes of Patients Undergoing Surgery for Thyroid Cancer.