St. Gallen International Breast Cancer Consensus-Based Clinical Decision Validation: Concordance Assessment Between Deep Large Language Model Outputs and Global Expert Panel Recommendations.
TL;DR
DeepSeek models showed moderate concordance in following the consensus of breast cancer expert panel and showed significant advantages in answer robustness, suggesting that DeepSeek has great application potential in the field of clinical decision-making for breast cancer.
OpenAlex 토픽 ·
Radiomics and Machine Learning in Medical Imaging
Explainable Artificial Intelligence (XAI)
Artificial Intelligence in Healthcare and Education
DeepSeek models showed moderate concordance in following the consensus of breast cancer expert panel and showed significant advantages in answer robustness, suggesting that DeepSeek has great applicat
- p-value p < 0.001
- p-value p = 0.005
APA
Yi Pan, Chenglong Duan, et al. (2026). St. Gallen International Breast Cancer Consensus-Based Clinical Decision Validation: Concordance Assessment Between Deep Large Language Model Outputs and Global Expert Panel Recommendations.. Annals of surgical oncology, 33(5), 4518-4529. https://doi.org/10.1245/s10434-026-19176-1
MLA
Yi Pan, et al.. "St. Gallen International Breast Cancer Consensus-Based Clinical Decision Validation: Concordance Assessment Between Deep Large Language Model Outputs and Global Expert Panel Recommendations.." Annals of surgical oncology, vol. 33, no. 5, 2026, pp. 4518-4529.
PMID
41667891
Abstract
[BACKGROUND] The newly developed large language model (LLM) DeepSeek has shown potential for application in other medical fields. However, few systematic studies have assessed its concordance with international expert consensus or compared its performance with leading models such as Gemini 2.0 Pro and ChatGPT-4o in breast cancer.
[MATERIALS AND METHODS] A total of 139 consensus questions from the 19th St. Gallen International Breast Cancer Conference (SG-BCC) were included into analysis. Each model was trained to answer each consensus question five times. The DeepSeek model was compared with the expert panel consensus in terms of concordance rate, robustness of the answers, Pearson correlation coefficient r for non-binary questions, and absolute proportion difference for binary questions. At the same time, a horizontal comparison was made with the previous LLMs Gemini 2.0 Pro and ChatGPT-4o.
[RESULTS] The overall concordance rate between DeepSeek-V3 and the expert panel consensus was 63.31%, and the average answer robustness (i.e., its self-consistency across repeated queries) of DeepSeek-V3 was 86.69%. In addition, DeepSeek-V3 performed similarly to Gemini 2.0 Pro and ChatGPT-4o in terms of concordance rate of the most frequent answers (p = 0.849). In terms of model robustness, there were significant statistical differences among the models (p < 0.001), with DeepSeek-V3 significantly outperforming Gemini 2.0 Pro (p = 0.005) and ChatGPT-4o (p < 0.001).
[CONCLUSIONS] DeepSeek models showed moderate concordance in following the consensus of breast cancer expert panel and showed significant advantages in answer robustness, suggesting that DeepSeek has great application potential in the field of clinical decision-making for breast cancer.
[MATERIALS AND METHODS] A total of 139 consensus questions from the 19th St. Gallen International Breast Cancer Conference (SG-BCC) were included into analysis. Each model was trained to answer each consensus question five times. The DeepSeek model was compared with the expert panel consensus in terms of concordance rate, robustness of the answers, Pearson correlation coefficient r for non-binary questions, and absolute proportion difference for binary questions. At the same time, a horizontal comparison was made with the previous LLMs Gemini 2.0 Pro and ChatGPT-4o.
[RESULTS] The overall concordance rate between DeepSeek-V3 and the expert panel consensus was 63.31%, and the average answer robustness (i.e., its self-consistency across repeated queries) of DeepSeek-V3 was 86.69%. In addition, DeepSeek-V3 performed similarly to Gemini 2.0 Pro and ChatGPT-4o in terms of concordance rate of the most frequent answers (p = 0.849). In terms of model robustness, there were significant statistical differences among the models (p < 0.001), with DeepSeek-V3 significantly outperforming Gemini 2.0 Pro (p = 0.005) and ChatGPT-4o (p < 0.001).
[CONCLUSIONS] DeepSeek models showed moderate concordance in following the consensus of breast cancer expert panel and showed significant advantages in answer robustness, suggesting that DeepSeek has great application potential in the field of clinical decision-making for breast cancer.
MeSH Terms
Humans; Breast Neoplasms; Female; Consensus; Clinical Decision-Making; Deep Learning; Practice Guidelines as Topic; Large Language Models
같은 제1저자의 인용 많은 논문 (5)
- Pathologic response and nodal status guide adjuvant immunotherapy in non-small cell lung cancer after neoadjuvant chemoimmunotherapy: An eastern Asian cohort study.
- A screening strategy based on machine learning for diagnostic biomarkers in small cell lung cancer.
- Multimodal treatment of radiation-associated laryngeal angiosarcoma: A case report and literature review.
- p38 inhibition restores chemosensitivity of tumor cells by disrupting oligomerized breast cancer resistance protein membrane trafficking.
- Inhibition of glycosphingolipid synthesis overcomes the steric hindrance of CD30 N-glycans to augment CD30-targeted immunotherapeutic efficacy.