Promise and pitfalls of AI chatbots in complex decision-making for thyroid nodules and papillary thyroid cancer.
설문조사
2/5 보강
OpenAlex 토픽 ·
Artificial Intelligence in Healthcare and Education
Clinical Reasoning and Diagnostic Skills
Machine Learning in Healthcare
[INTRODUCTION] Artificial intelligence (AI) chatbots are increasingly used in medicine, but their reliability in scenarios with multiple management options is unclear.
APA
Grigoris Effraimidis, Athanasios Kasotas, et al. (2026). Promise and pitfalls of AI chatbots in complex decision-making for thyroid nodules and papillary thyroid cancer.. European thyroid journal, 15(2). https://doi.org/10.1530/ETJ-25-0385
MLA
Grigoris Effraimidis, et al.. "Promise and pitfalls of AI chatbots in complex decision-making for thyroid nodules and papillary thyroid cancer.." European thyroid journal, vol. 15, no. 2, 2026.
PMID
41885289 ↗
Abstract 한글 요약
[INTRODUCTION] Artificial intelligence (AI) chatbots are increasingly used in medicine, but their reliability in scenarios with multiple management options is unclear. Indeterminate thyroid nodules and low- and low-to-intermediate-risk papillary thyroid carcinoma (PTC) represent such cases.
[METHODS] In a nationwide web-based survey, 201 members of the Hellenic Endocrine Society evaluated 12 clinical vignettes on indeterminate thyroid nodules and low- and low-to-intermediate-risk PTC. Their responses were compared with those generated by four conversational AI models (ChatGPT, Gemini, Copilot, and DeepSeek) at two time points, 11 months apart. DeepSeek was assessed only at the second time point. Chatbot outputs were assessed for agreement with endocrinologists' predominant answers, concordance with the most guideline-consistent options (American and European Thyroid Association recommendations), temporal stability, and inter-model agreement.
[RESULTS] Alignment between chatbots and endocrinologists' predominant responses was limited, reaching at most 25% across scenarios. In contrast, concordance with the most guideline-consistent options was higher, up to 83% (10/12 scenarios), depending on the model and time point. Across 12 scenarios, ChatGPT, Gemini, and Copilot changed their responses in 4, 7, and 5 scenarios, respectively, with some updates moving closer to, and others further from, guideline-based answers. Inter-model agreement ranged from 33 to 67%, indicating substantial variability among chatbots.
[CONCLUSION] AI chatbots show evolving but inconsistent performance in complex thyroid management scenarios. While guideline concordance can be relatively high, substantial variability across models, limited temporal reproducibility, and poor alignment with clinical practice highlight the need for ongoing longitudinal evaluation before safe integration into clinical decision-making.
[METHODS] In a nationwide web-based survey, 201 members of the Hellenic Endocrine Society evaluated 12 clinical vignettes on indeterminate thyroid nodules and low- and low-to-intermediate-risk PTC. Their responses were compared with those generated by four conversational AI models (ChatGPT, Gemini, Copilot, and DeepSeek) at two time points, 11 months apart. DeepSeek was assessed only at the second time point. Chatbot outputs were assessed for agreement with endocrinologists' predominant answers, concordance with the most guideline-consistent options (American and European Thyroid Association recommendations), temporal stability, and inter-model agreement.
[RESULTS] Alignment between chatbots and endocrinologists' predominant responses was limited, reaching at most 25% across scenarios. In contrast, concordance with the most guideline-consistent options was higher, up to 83% (10/12 scenarios), depending on the model and time point. Across 12 scenarios, ChatGPT, Gemini, and Copilot changed their responses in 4, 7, and 5 scenarios, respectively, with some updates moving closer to, and others further from, guideline-based answers. Inter-model agreement ranged from 33 to 67%, indicating substantial variability among chatbots.
[CONCLUSION] AI chatbots show evolving but inconsistent performance in complex thyroid management scenarios. While guideline concordance can be relatively high, substantial variability across models, limited temporal reproducibility, and poor alignment with clinical practice highlight the need for ongoing longitudinal evaluation before safe integration into clinical decision-making.
🏷️ 키워드 / MeSH 📖 같은 키워드 OA만
🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반
- A Phase I Study of Hydroxychloroquine and Suba-Itraconazole in Men with Biochemical Relapse of Prostate Cancer (HITMAN-PC): Dose Escalation Results.
- Self-management of male urinary symptoms: qualitative findings from a primary care trial.
- Clinical and Liquid Biomarkers of 20-Year Prostate Cancer Risk in Men Aged 45 to 70 Years.
- Diagnostic accuracy of Ga-PSMA PET/CT versus multiparametric MRI for preoperative pelvic invasion in the patients with prostate cancer.
- Comprehensive analysis of androgen receptor splice variant target gene expression in prostate cancer.
- Clinical Presentation and Outcomes of Patients Undergoing Surgery for Thyroid Cancer.