본문으로 건너뛰기
← 뒤로

St. Gallen International Breast Cancer Consensus-Based Clinical Decision Validation: Concordance Assessment Between Deep Large Language Model Outputs and Global Expert Panel Recommendations.

Annals of surgical oncology 2026 Vol.33(5) p. 4518-4529 🌐 cited 1 🔓 OA Radiomics and Machine Learning in Me
TL;DR DeepSeek models showed moderate concordance in following the consensus of breast cancer expert panel and showed significant advantages in answer robustness, suggesting that DeepSeek has great application potential in the field of clinical decision-making for breast cancer.
OpenAlex 토픽 · Radiomics and Machine Learning in Medical Imaging Explainable Artificial Intelligence (XAI) Artificial Intelligence in Healthcare and Education

Pan Y, Duan C, Du J, Zhang J, Du K, Zhang C, Liu Z, Zhang W, Wang B, Ren Y, Sun Z, Zhu L

📝 환자 설명용 한 줄

DeepSeek models showed moderate concordance in following the consensus of breast cancer expert panel and showed significant advantages in answer robustness, suggesting that DeepSeek has great applicat

🔬 핵심 임상 통계 (초록에서 자동 추출 — 원문 검증 권장)
  • p-value p < 0.001
  • p-value p = 0.005

이 논문을 인용하기

BibTeX ↓ RIS ↓
APA Yi Pan, Chenglong Duan, et al. (2026). St. Gallen International Breast Cancer Consensus-Based Clinical Decision Validation: Concordance Assessment Between Deep Large Language Model Outputs and Global Expert Panel Recommendations.. Annals of surgical oncology, 33(5), 4518-4529. https://doi.org/10.1245/s10434-026-19176-1
MLA Yi Pan, et al.. "St. Gallen International Breast Cancer Consensus-Based Clinical Decision Validation: Concordance Assessment Between Deep Large Language Model Outputs and Global Expert Panel Recommendations.." Annals of surgical oncology, vol. 33, no. 5, 2026, pp. 4518-4529.
PMID 41667891

Abstract

[BACKGROUND] The newly developed large language model (LLM) DeepSeek has shown potential for application in other medical fields. However, few systematic studies have assessed its concordance with international expert consensus or compared its performance with leading models such as Gemini 2.0 Pro and ChatGPT-4o in breast cancer.

[MATERIALS AND METHODS] A total of 139 consensus questions from the 19th St. Gallen International Breast Cancer Conference (SG-BCC) were included into analysis. Each model was trained to answer each consensus question five times. The DeepSeek model was compared with the expert panel consensus in terms of concordance rate, robustness of the answers, Pearson correlation coefficient r for non-binary questions, and absolute proportion difference for binary questions. At the same time, a horizontal comparison was made with the previous LLMs Gemini 2.0 Pro and ChatGPT-4o.

[RESULTS] The overall concordance rate between DeepSeek-V3 and the expert panel consensus was 63.31%, and the average answer robustness (i.e., its self-consistency across repeated queries) of DeepSeek-V3 was 86.69%. In addition, DeepSeek-V3 performed similarly to Gemini 2.0 Pro and ChatGPT-4o in terms of concordance rate of the most frequent answers (p = 0.849). In terms of model robustness, there were significant statistical differences among the models (p < 0.001), with DeepSeek-V3 significantly outperforming Gemini 2.0 Pro (p = 0.005) and ChatGPT-4o (p < 0.001).

[CONCLUSIONS] DeepSeek models showed moderate concordance in following the consensus of breast cancer expert panel and showed significant advantages in answer robustness, suggesting that DeepSeek has great application potential in the field of clinical decision-making for breast cancer.

MeSH Terms

Humans; Breast Neoplasms; Female; Consensus; Clinical Decision-Making; Deep Learning; Practice Guidelines as Topic; Large Language Models

같은 제1저자의 인용 많은 논문 (5)