Evaluation of Large Language Models for Radiologists' Support in Multidisciplinary Breast Cancer Teams: Comparative Study.

Jiang H; Yang C; Zhou W; Yin CL; Zhou S; He R; Ran G; Wang W; Wu M; Yu J

doi:10.2196/68182

← 뒤로

Evaluation of Large Language Models for Radiologists' Support in Multidisciplinary Breast Cancer Teams: Comparative Study.

JMIR medical informatics 2026 Vol.14() p. e68182

Jiang H, Yang C, Zhou W, Yin CL, Zhou S, He R, Ran G, Wang W, Wu M, Yu J

PMC 전문 ↗ 원문 ↗ DOI ↗ BibTeX ↓ RIS ↓

📝 환자 설명용 한 줄

[BACKGROUND] Artificial intelligence tools, particularly large language models (LLMs), have shown considerable potential across various domains.

🔬 핵심 임상 통계 (초록에서 자동 추출 — 원문 검증 권장)

p-value P<.05

이 논문을 인용하기

BibTeX ↓ RIS ↓

APA Jiang H, Yang C, et al. (2026). Evaluation of Large Language Models for Radiologists' Support in Multidisciplinary Breast Cancer Teams: Comparative Study.. JMIR medical informatics, 14, e68182. https://doi.org/10.2196/68182

MLA Jiang H, et al.. "Evaluation of Large Language Models for Radiologists' Support in Multidisciplinary Breast Cancer Teams: Comparative Study.." JMIR medical informatics, vol. 14, 2026, pp. e68182.

PMID 41628437

DOI 10.2196/68182

Abstract

[BACKGROUND] Artificial intelligence tools, particularly large language models (LLMs), have shown considerable potential across various domains. However, their performance in the diagnosis and treatment of breast cancer remains unknown.

[OBJECTIVE] This study aimed to evaluate the performance of LLMs in supporting radiologists within multidisciplinary breast cancer teams, with a focus on their roles in facilitating informed clinical decisions and enhancing patient care.

[METHODS] A set of 50 questions covering radiological and breast cancer guidelines was developed to assess breast cancer. These questions were posed to 9 popular LLMs and clinical physicians, with the expectation of receiving direct "Yes" or "No" answers along with supporting analysis. The performances of the 9 models, including ChatGPT-4.0, ChatGPT-4o, ChatGPT-4o mini, Claude 3 Opus, Claude 3.5 Sonnet, Gemini 1.5 Pro, Tongyi Qianwen 2.5, ChatGLM, and Ernie Bot 3.5, were evaluated against that of radiologists with varying experience levels (resident physicians, fellow physicians, and attending physicians). Responses were assessed for accuracy, confidence, and consistency based on alignment with the 2024 National Comprehensive Cancer Network Breast Cancer Guidelines and the 2013 American College of Radiology Breast Imaging-Reporting and Data System recommendations.

[RESULTS] Claude 3 Opus and ChatGPT-4 achieved the highest confidence scores of 2.78 and 2.74, respectively, while ChatGPT-4o led in accuracy with a score of 2.92. In terms of response consistency, Claude 3 Opus and Claude 3.5 Sonnet led the pack with scores of 3.0, closely followed by ChatGPT-4o, Gemini 1.5 Pro, and ChatGPT-4o mini, all recording impressive scores exceeding 2.9. ChatGPT-4o mini excelled in clinical diagnostics with a top score of 3.0 among all LLMs, and this score was also higher than all physician groups; however, no statistically significant differences were observed between it and any physician group (all P>.05). ChatGPT-4 also had a higher score than the physician groups but showed comparable statistical performance to them (P>.05). Across radiological diagnostics, clinical diagnosis, and overall performance, ChatGPT-4o mini and the Claude models achieved higher mean scores than all physician groups. However, these differences were statistically significant only when compared to fellow physicians (P<.05). However, ChatGLM and Ernie Bot 3.5 underperformed across diagnostic areas, with lower scores than all physician groups but no statistically significant differences (all P>.05). Among physician groups, attending physicians and resident physicians exhibited comparable high scores in radiological diagnostic performance, whereas fellow physicians scored somewhat lower, though the difference was not statistically significant (P>.05).

[CONCLUSIONS] LLMs such as ChatGPT-4o and Claude 3 Opus showed potential in supporting multidisciplinary teams for breast cancer diagnostics and therapy. However, they cannot fully replicate the intricate decision-making processes honed through clinical experience, particularly in complex cases. This highlights the need for ongoing artificial intelligence refinement to ensure robust clinical applicability.

MeSH Terms

Humans; Breast Neoplasms; Female; Radiologists; Patient Care Team; Artificial Intelligence; Large Language Models

같은 제1저자의 인용 많은 논문 (5)

Different Glabellar Contraction Patterns in Chinese and Efficacy of Botulinum Toxin Type A for Treating Glabellar Lines: A Pilot Study.
Dermatologic surgery : official publication for American Society for Dermatologic Surgery [et al.] 2017 cited 1
Comment on: The predict value of lymph node status pre-operation by ultrasound, mammography and MRI in early breast cancer.
Journal of the Formosan Medical Association = Taiwan yi zhi 2026
Logic-gated CRISPR-Cas12a assay with engineered signal amplification for sensitive multiplexed detection of HCC miRNAs.
Biosensors & bioelectronics 2026
The critical role of epithelial-mesenchymal transition (EMT) in colorectal cancer progression and therapeutic outcomes.
Critical reviews in oncology/hematology 2026
Cancer viroimmunotherapy platforms based on varicella-zoster virus and cytomegalovirus.
Molecular therapy : the journal of the American Society of Gene Therapy 2026