본문으로 건너뛰기
← 뒤로

Automating the Observer OPTION-5 measure of shared decision making: Assessing validity by comparing large language models to human ratings.

1/5 보강
Patient education and counseling 📖 저널 OA 25% 2024: 0/1 OA 2025: 0/3 OA 2026: 9/30 OA 2024~2026 2026 Vol.142() p. 109362
Retraction 확인
출처

Selvaraj SP, Yen RW, Forcino R, Elwyn G

📝 환자 설명용 한 줄

[OBJECTIVES] Observer-based measures of shared decision rely on human raters, it is resource-intensive, limiting routine assessment and improvement.

🔬 핵심 임상 통계 (초록에서 자동 추출 — 원문 검증 권장)
  • p-value p < 0.01
  • 연구 설계 randomized controlled trial

이 논문을 인용하기

↓ .bib ↓ .ris
APA Selvaraj SP, Yen RW, et al. (2026). Automating the Observer OPTION-5 measure of shared decision making: Assessing validity by comparing large language models to human ratings.. Patient education and counseling, 142, 109362. https://doi.org/10.1016/j.pec.2025.109362
MLA Selvaraj SP, et al.. "Automating the Observer OPTION-5 measure of shared decision making: Assessing validity by comparing large language models to human ratings.." Patient education and counseling, vol. 142, 2026, pp. 109362.
PMID 41016196 ↗

Abstract

[OBJECTIVES] Observer-based measures of shared decision rely on human raters, it is resource-intensive, limiting routine assessment and improvement. Generative artificial intelligence could increase the speed and accuracy of observer-based evaluation while reducing the burden. This study aimed to assess the performance of large language models (LLMs) from Gemini, GPT, and LLaMA family of models in evaluating the extent of shared decision-making between clinicians and women considering surgery for early-stage breast cancer.

[METHODS] LLM-generated scores were compared with those of trained human raters from a randomized controlled trial using the 5-item Observer OPTION-5 measure. We analyzed 287 anonymized transcripts of breast cancer consultations. A series of prompts were tested across models, assessing correlations with human scores. We also evaluated the ability of LLMs to distinguish high versus low encounters and the impact of inter-rater agreement on performance. RESULTS: The scores for Observer OPTION-5 items generated by the GPT-4o and Gemini-1.5-Pro-002 correlated with human ratings (Pearson r ≈ 0.6, p-value<0.01), representing ≈ 75-80 % of the correlation observed between human raters themselves (r = 0.77). Providing detailed descriptions and examples improved the models' performance. The results also confirm that the models could distinguish high- from low-scoring encounters, with an independent-samples t-test showing a large and significant separation between the two groups (t > 10, p < 0.01).

[CONCLUSIONS] Based on the breast cancer surgery dataset we explored, LLMs can evaluate aspects of clinician-patient dialog using existing measures, providing the basis for the development and fine-tuning of prompts. Future work should focus on generalizability, larger datasets, and improving model performance.

[PRACTICE IMPLICATIONS] The prospect of being able to automate the assessment of shared decision-making opens the door to rapid feedback as a means for reflective practice improvement.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반