본문으로 건너뛰기
← 뒤로

Promise and pitfalls of AI chatbots in complex decision-making for thyroid nodules and papillary thyroid cancer.

설문조사 2/5 보강
European thyroid journal 📖 저널 OA 100% 2022: 16/16 OA 2023: 20/20 OA 2024: 23/23 OA 2025: 40/40 OA 2026: 12/12 OA 2022~2026 2026 Vol.15(2) OA Artificial Intelligence in Healthcar
Retraction 확인
출처
PubMed DOI PMC OpenAlex 마지막 보강 2026-05-01
OpenAlex 토픽 · Artificial Intelligence in Healthcare and Education Clinical Reasoning and Diagnostic Skills Machine Learning in Healthcare

Effraimidis G, Kasotas A, Varsami S, Sazakli E, Karapanou O, Saltiki K

📝 환자 설명용 한 줄

[INTRODUCTION] Artificial intelligence (AI) chatbots are increasingly used in medicine, but their reliability in scenarios with multiple management options is unclear.

이 논문을 인용하기

↓ .bib ↓ .ris
APA Grigoris Effraimidis, Athanasios Kasotas, et al. (2026). Promise and pitfalls of AI chatbots in complex decision-making for thyroid nodules and papillary thyroid cancer.. European thyroid journal, 15(2). https://doi.org/10.1530/ETJ-25-0385
MLA Grigoris Effraimidis, et al.. "Promise and pitfalls of AI chatbots in complex decision-making for thyroid nodules and papillary thyroid cancer.." European thyroid journal, vol. 15, no. 2, 2026.
PMID 41885289 ↗
DOI 10.1530/ETJ-25-0385

Abstract

[INTRODUCTION] Artificial intelligence (AI) chatbots are increasingly used in medicine, but their reliability in scenarios with multiple management options is unclear. Indeterminate thyroid nodules and low- and low-to-intermediate-risk papillary thyroid carcinoma (PTC) represent such cases.

[METHODS] In a nationwide web-based survey, 201 members of the Hellenic Endocrine Society evaluated 12 clinical vignettes on indeterminate thyroid nodules and low- and low-to-intermediate-risk PTC. Their responses were compared with those generated by four conversational AI models (ChatGPT, Gemini, Copilot, and DeepSeek) at two time points, 11 months apart. DeepSeek was assessed only at the second time point. Chatbot outputs were assessed for agreement with endocrinologists' predominant answers, concordance with the most guideline-consistent options (American and European Thyroid Association recommendations), temporal stability, and inter-model agreement.

[RESULTS] Alignment between chatbots and endocrinologists' predominant responses was limited, reaching at most 25% across scenarios. In contrast, concordance with the most guideline-consistent options was higher, up to 83% (10/12 scenarios), depending on the model and time point. Across 12 scenarios, ChatGPT, Gemini, and Copilot changed their responses in 4, 7, and 5 scenarios, respectively, with some updates moving closer to, and others further from, guideline-based answers. Inter-model agreement ranged from 33 to 67%, indicating substantial variability among chatbots.

[CONCLUSION] AI chatbots show evolving but inconsistent performance in complex thyroid management scenarios. While guideline concordance can be relatively high, substantial variability across models, limited temporal reproducibility, and poor alignment with clinical practice highlight the need for ongoing longitudinal evaluation before safe integration into clinical decision-making.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

🟢 PMC 전문 열기