Generative artificial intelligence for patient education material on gastric cancer prevention.
[BACKGROUND] This study assessed the effectiveness of large language models (LLMs) in generating lay summaries for patient education on the management of precancerous lesions and early neoplasia in t
APA
Rizkala T, Muench N, et al. (2026). Generative artificial intelligence for patient education material on gastric cancer prevention.. Endoscopy. https://doi.org/10.1055/a-2780-0664
MLA
Rizkala T, et al.. "Generative artificial intelligence for patient education material on gastric cancer prevention.." Endoscopy, 2026.
PMID
41688051
Abstract
[BACKGROUND] This study assessed the effectiveness of large language models (LLMs) in generating lay summaries for patient education on the management of precancerous lesions and early neoplasia in the stomach.
[METHODS] In this pilot study, we used a two-period, crossover, blinded design to compare a ChatGPT-4o summary versus a Digestive Cancers Europe (DiCE) summary. Two panels rated the materials: expert physicians and DiCE Patient Advisory Committee members. Experts scored accuracy, completeness, comprehensibility, and satisfaction (across five sections); patients rated overall completeness, comprehensibility, and satisfaction. Paired comparisons used mixed-effects estimates. Readability was assessed with Flesch-Kincaid grade level (FKGL) and SMOG index.
[RESULTS] Median expert ratings were similar between materials across metrics. For the overall summary, median (range; IQR) scores were: accuracy 5 (4-6; 1) for ChatGPT-4o vs. 5 (3-6; 1) for DiCE ( = 0.10); completeness 4 (3-5; 1) vs. 4 (2-5; 1; = 0.27); comprehensibility 4 (3-5; 1) vs. 4 (2-5; 1; = 0.33); and satisfaction 4 (2-5; 1) vs. 3 (1-5; 2; = 0.53). Patient ratings mirrored experts, with very similar results. Readability failed to meet guideline recommendations for both summaries on both FKGL and SMOG scores.
[CONCLUSION] ChatGPT-4o produced patient materials comparable to DiCE, but both require readability optimization; a human-in-the-loop workflow and future tests across prompts and models are warranted.
[METHODS] In this pilot study, we used a two-period, crossover, blinded design to compare a ChatGPT-4o summary versus a Digestive Cancers Europe (DiCE) summary. Two panels rated the materials: expert physicians and DiCE Patient Advisory Committee members. Experts scored accuracy, completeness, comprehensibility, and satisfaction (across five sections); patients rated overall completeness, comprehensibility, and satisfaction. Paired comparisons used mixed-effects estimates. Readability was assessed with Flesch-Kincaid grade level (FKGL) and SMOG index.
[RESULTS] Median expert ratings were similar between materials across metrics. For the overall summary, median (range; IQR) scores were: accuracy 5 (4-6; 1) for ChatGPT-4o vs. 5 (3-6; 1) for DiCE ( = 0.10); completeness 4 (3-5; 1) vs. 4 (2-5; 1; = 0.27); comprehensibility 4 (3-5; 1) vs. 4 (2-5; 1; = 0.33); and satisfaction 4 (2-5; 1) vs. 3 (1-5; 2; = 0.53). Patient ratings mirrored experts, with very similar results. Readability failed to meet guideline recommendations for both summaries on both FKGL and SMOG scores.
[CONCLUSION] ChatGPT-4o produced patient materials comparable to DiCE, but both require readability optimization; a human-in-the-loop workflow and future tests across prompts and models are warranted.