Automating the Observer OPTION-5 measure of shared decision making: Assessing validity by comparing large language models to human ratings.
1/5 보강
[OBJECTIVES] Observer-based measures of shared decision rely on human raters, it is resource-intensive, limiting routine assessment and improvement.
- p-value p < 0.01
- 연구 설계 randomized controlled trial
APA
Selvaraj SP, Yen RW, et al. (2026). Automating the Observer OPTION-5 measure of shared decision making: Assessing validity by comparing large language models to human ratings.. Patient education and counseling, 142, 109362. https://doi.org/10.1016/j.pec.2025.109362
MLA
Selvaraj SP, et al.. "Automating the Observer OPTION-5 measure of shared decision making: Assessing validity by comparing large language models to human ratings.." Patient education and counseling, vol. 142, 2026, pp. 109362.
PMID
41016196 ↗
Abstract 한글 요약
[OBJECTIVES] Observer-based measures of shared decision rely on human raters, it is resource-intensive, limiting routine assessment and improvement. Generative artificial intelligence could increase the speed and accuracy of observer-based evaluation while reducing the burden. This study aimed to assess the performance of large language models (LLMs) from Gemini, GPT, and LLaMA family of models in evaluating the extent of shared decision-making between clinicians and women considering surgery for early-stage breast cancer.
[METHODS] LLM-generated scores were compared with those of trained human raters from a randomized controlled trial using the 5-item Observer OPTION-5 measure. We analyzed 287 anonymized transcripts of breast cancer consultations. A series of prompts were tested across models, assessing correlations with human scores. We also evaluated the ability of LLMs to distinguish high versus low encounters and the impact of inter-rater agreement on performance. RESULTS: The scores for Observer OPTION-5 items generated by the GPT-4o and Gemini-1.5-Pro-002 correlated with human ratings (Pearson r ≈ 0.6, p-value<0.01), representing ≈ 75-80 % of the correlation observed between human raters themselves (r = 0.77). Providing detailed descriptions and examples improved the models' performance. The results also confirm that the models could distinguish high- from low-scoring encounters, with an independent-samples t-test showing a large and significant separation between the two groups (t > 10, p < 0.01).
[CONCLUSIONS] Based on the breast cancer surgery dataset we explored, LLMs can evaluate aspects of clinician-patient dialog using existing measures, providing the basis for the development and fine-tuning of prompts. Future work should focus on generalizability, larger datasets, and improving model performance.
[PRACTICE IMPLICATIONS] The prospect of being able to automate the assessment of shared decision-making opens the door to rapid feedback as a means for reflective practice improvement.
[METHODS] LLM-generated scores were compared with those of trained human raters from a randomized controlled trial using the 5-item Observer OPTION-5 measure. We analyzed 287 anonymized transcripts of breast cancer consultations. A series of prompts were tested across models, assessing correlations with human scores. We also evaluated the ability of LLMs to distinguish high versus low encounters and the impact of inter-rater agreement on performance. RESULTS: The scores for Observer OPTION-5 items generated by the GPT-4o and Gemini-1.5-Pro-002 correlated with human ratings (Pearson r ≈ 0.6, p-value<0.01), representing ≈ 75-80 % of the correlation observed between human raters themselves (r = 0.77). Providing detailed descriptions and examples improved the models' performance. The results also confirm that the models could distinguish high- from low-scoring encounters, with an independent-samples t-test showing a large and significant separation between the two groups (t > 10, p < 0.01).
[CONCLUSIONS] Based on the breast cancer surgery dataset we explored, LLMs can evaluate aspects of clinician-patient dialog using existing measures, providing the basis for the development and fine-tuning of prompts. Future work should focus on generalizability, larger datasets, and improving model performance.
[PRACTICE IMPLICATIONS] The prospect of being able to automate the assessment of shared decision-making opens the door to rapid feedback as a means for reflective practice improvement.
🏷️ 키워드 / MeSH 📖 같은 키워드 OA만
🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반
- Scoring Physician Risk Communication in Prostate Cancer Using Large Language Models.
- Real-world Treatment Selection and Shared Decision-making in De Novo Metastatic Castration-sensitive Prostate Cancer in Japan.
- Association of Patient Comorbidities With Treatment Regret Among Patients With Localized Prostate Cancer - Results From a Population-Based Cohort.
- DualPG-DTA: A Large Language Model-Powered Graph Neural Network Framework for Enhanced Drug-Target Affinity Prediction and Discovery of Novel CDK9 Inhibitors Exhibiting In Vivo Anti-Leukemia Activity.
- Automating the segmentation, date extraction, and classification of multi-report PDFs in outside medical records using optical character recognition and generative artificial intelligence.
- Genetic underpinnings of type-2 diabetes (T2D) with colorectal cancer (CRC): In-silico discovery of common molecular signatures, pathogenetic processes and therapeutic candidates.