Evaluating the Quality and Reliability of Large Language Models for Plastic Surgery Patient Education: A Comparative Analysis of ChatGPT and OpenEvidence.
Abstract
[BACKGROUND] Concerns regarding information inaccuracy when using general-purpose large language models have prompted the quest for alternative tools. OpenEvidence has emerged as a healthcare-focused large language model trained exclusively on data from peer-reviewed medical literature.
[OBJECTIVES] This study compared the quality, accuracy, and readability of aesthetic surgery patient education materials generated by OpenEvidence and ChatGPT.
[METHODS] A standardized prompt requesting comprehensive postoperative discharge instructions for 20 of the most common aesthetic surgery procedures was entered into OpenEvidence and ChatGPT-5. Outputs were evaluated using 4 validated assessment tools: the DISCERN instrument for information quality (1-5), the Patient Education Materials Assessment Tool for Printable Materials (PEMAT-P) for information understandability and actionability (0-100), the Flesch-Kincaid scale for estimated grade level (fifth grade to professional level) and reading ease (0-100), and a Likert scale for citation accuracy (1-4).
[RESULTS] OpenEvidence scored significantly higher than ChatGPT-5 in DISCERN (3.3 ± 0.4 vs 1.7 ± 0.4, P < .001) and the citation accuracy scale (2.4 ± 1.3 vs 1.5 ± 0.7, P = .007). Scores were comparable among both tools in PEMAT-P understandability (71 ± 5 vs 69 ± 0, P = .3) and actionability (52 ± 12 vs 54 ± 5, P = .6), as well as on the Flesch Kincaid Grade Level (9.3 ± 1.0 vs 9.2 ± 0.6, P = .8) and the Flesch Reading Ease Score (40.0 ± 6.6 vs 41.0 ± 5.5, P = .6).
[CONCLUSIONS] OpenEvidence generated materials of significantly higher quality and reliability than ChatGPT, suggesting it may serve as a more reliable alternative for patient education in aesthetic surgery practice.
[OBJECTIVES] This study compared the quality, accuracy, and readability of aesthetic surgery patient education materials generated by OpenEvidence and ChatGPT.
[METHODS] A standardized prompt requesting comprehensive postoperative discharge instructions for 20 of the most common aesthetic surgery procedures was entered into OpenEvidence and ChatGPT-5. Outputs were evaluated using 4 validated assessment tools: the DISCERN instrument for information quality (1-5), the Patient Education Materials Assessment Tool for Printable Materials (PEMAT-P) for information understandability and actionability (0-100), the Flesch-Kincaid scale for estimated grade level (fifth grade to professional level) and reading ease (0-100), and a Likert scale for citation accuracy (1-4).
[RESULTS] OpenEvidence scored significantly higher than ChatGPT-5 in DISCERN (3.3 ± 0.4 vs 1.7 ± 0.4, P < .001) and the citation accuracy scale (2.4 ± 1.3 vs 1.5 ± 0.7, P = .007). Scores were comparable among both tools in PEMAT-P understandability (71 ± 5 vs 69 ± 0, P = .3) and actionability (52 ± 12 vs 54 ± 5, P = .6), as well as on the Flesch Kincaid Grade Level (9.3 ± 1.0 vs 9.2 ± 0.6, P = .8) and the Flesch Reading Ease Score (40.0 ± 6.6 vs 41.0 ± 5.5, P = .6).
[CONCLUSIONS] OpenEvidence generated materials of significantly higher quality and reliability than ChatGPT, suggesting it may serve as a more reliable alternative for patient education in aesthetic surgery practice.
추출된 의학 개체 (NER)
| 유형 | 영어 표현 | 한국어 / 풀이 | UMLS CUI | 출처 | 등장 |
|---|---|---|---|---|---|
| 약물 | ChatGPT
|
scispacy | 1 | ||
| 약물 | [BACKGROUND] Concerns
|
scispacy | 1 | ||
| 약물 | [RESULTS] OpenEvidence
|
scispacy | 1 | ||
| 약물 | [CONCLUSIONS] OpenEvidence
|
scispacy | 1 | ||
| 질환 | healthcare-focused
|
scispacy | 1 | ||
| 질환 | Language
|
scispacy | 1 | ||
| 질환 | PEMAT-P
→ Patient Education Materials Assessment Tool for Printable Materials
|
scispacy | 1 |
MeSH Terms
Humans; Patient Education as Topic; Reproducibility of Results; Comprehension; Language; Plastic Surgery Procedures; Surgery, Plastic; Large Language Models; Generative Artificial Intelligence