A comparative evaluation of large language models for simplifying prostate cancer pathology reports: ChatGPT and Gemini.

Zeng H; Yuan Y; Wu X; Ye Z; Yuan H; Luo S; Zhang K; Wang L; Liu H; Yang H

doi:10.1097/JS9.0000000000004454

← 뒤로

A comparative evaluation of large language models for simplifying prostate cancer pathology reports: ChatGPT and Gemini.

International journal of surgery (London, England) 2026

Zeng H, Yuan Y, Wu X, Ye Z, Yuan H, Luo S, Zhang K, Wang L, Liu H, Yang H

원문 ↗ DOI ↗ BibTeX ↓ RIS ↓

📝 환자 설명용 한 줄

[OBJECTIVES] To evaluate the application value of three ChatGPT versions and Gemini in pathology report simplification tasks for prostate cancer.

🔬 핵심 임상 통계 (초록에서 자동 추출 — 원문 검증 권장)

표본수 (n) 171

이 논문을 인용하기

BibTeX ↓ RIS ↓

APA Zeng H, Yuan Y, et al. (2026). A comparative evaluation of large language models for simplifying prostate cancer pathology reports: ChatGPT and Gemini.. International journal of surgery (London, England). https://doi.org/10.1097/JS9.0000000000004454

MLA Zeng H, et al.. "A comparative evaluation of large language models for simplifying prostate cancer pathology reports: ChatGPT and Gemini.." International journal of surgery (London, England), 2026.

PMID 41632012

DOI 10.1097/JS9.0000000000004454

Abstract

[OBJECTIVES] To evaluate the application value of three ChatGPT versions and Gemini in pathology report simplification tasks for prostate cancer.

[METHODS] This retrospective study assessed GPT-3.5, GPT-4.0, GPT-4o, and Gemini on pathology reports from 228 prostate cancer patients across two institutions. Data were split into internal (center 1, n = 171) and external (center 2, n = 57) cohorts. Using specific prompts, models generated simplified texts. The evaluation of outputs included three main dimensions: (1) human scoring by patients, clinicians, and pathologists; (2) readability scores; and (3) BERT-based semantic similarity scores. Statistical comparisons employed paired t -tests or Wilcoxon signed-rank tests. Statistical consistency between raters was assessed using squared weighted kappa, intraclass correlation coefficient(3,1), and percent agreement, with 95% confidence intervals calculated for all metrics.

[RESULTS] GPT-4o (Few-Shot) achieved the highest accuracy and comprehensiveness scores from pathologists, while Gemini demonstrated the best understandability. Patient and clinician understandability ratings were consistently high across models. Mean Reading Grade Level scores varied between internal and external datasets, with GPT-4o Few-Shot performing best overall. BERT-based semantic similarity scores demonstrated distinct trends across models, reflecting differences in text simplification strategies.

[CONCLUSION] LLMs adopt distinct trade-off strategies between simplifying pathology reports and preserving their structure and logic, influenced by prompt design and textual style. Their application shows potential to enhance patient comprehension and clinical communication. Future work should focus on domain-specific fine-tuning to ensure safe and reliable clinical integration.

같은 제1저자의 인용 많은 논문 (5)

Mapping the Mandibular Lingual Foramina for Safer Chin Surgery: CT Morphometry and Predictive Modeling.
Aesthetic plastic surgery 2026
Induction Chemotherapy Followed by Immunotherapy Increases Pathological Complete Response Rate in dMMR/MSI-H Gastric Cancer: A Retrospective Cohort Study.
ImmunoTargets and therapy 2026
Dual-gold nanoprobe system enables highly sensitive SERS detection of glycosylated CEs1 in HepG-2 cell culture supernatant.
Talanta 2026
Microbial metabolite deoxycholic acid inhibits noncancerous NCM460 human colon cell proliferation: an inverse correlation between Bmal1:Clock gene expression and cell apoptosis.
Archives of biochemistry and biophysics 2026
Economic Evaluation of Inavolisib Combined With Palbociclib-Fulvestrant for PIK3CA-Mutated, HR+/HER2- Advanced Breast Cancer in USA.
Technology in cancer research & treatment 2026