← 뒤로

Evaluating the performance of large language models on the ASPS In-Service Examination: A comparative analysis with resident norms.

Journal of plastic, reconstructive & aesthetic surgery : JPRAS 2026 Vol.113() p. 164-167 🌐 cited 3 Artificial Intelligence in Healthcar

TL;DR In conclusion, Modern LLMs demonstrate consistent and high-level performance on the PSITE, frequently exceeding the median performance of plastic surgery residents and practitioners.

OpenAlex 토픽 · Artificial Intelligence in Healthcare and Education Diversity and Career in Medicine Social Media in Health Education

Shekouhi R, Holohan MM, Mirzalieva O, Byrd B, Guidry MF, Palines PA, Chim H

원문 ↗ DOI ↗ BibTeX ↓ RIS ↓

Abstract

The emergence of large language models (LLMs) has raised critical questions about their potential roles in surgical education. This study aims to evaluate the accuracy and comparative performance of three leading LLMs including ChatGPT 4.0, DeepSeek V3, and Gemini 2.5, on the American Board of Plastic Surgery Plastic Surgery In-Service Training Examination (PSITE) across a 20-year period. Our results showed that ChatGPT achieved the highest overall accuracy (75.0%), followed closely by DeepSeek (74.8%) and Gemini (74.5%), with no significant differences between models (p>0.05). When benchmarked against normative data, DeepSeek reached the highest percentile ranks (81st among residents, 89th among practitioners), followed by ChatGPT (78th and 84th), and Gemini (72nd and 90th), without significant differences in rankings across LLMs (p > 0.05). In conclusion, Modern LLMs demonstrate consistent and high-level performance on the PSITE, frequently exceeding the median performance of plastic surgery residents and practitioners.

추출된 의학 개체 (NER)

유형	영어 표현	UMLS CUI	출처	등장
약물	`DeepSeek V3`		scispacy	1
약물	`Gemini`		scispacy	1
약물	`ChatGPT`		scispacy	1
약물	`84th`		scispacy	1
질환	`ASPS`	C0206293 Asp snake	scispacy	1
기타	`ChatGPT`		scispacy	1

MeSH Terms

Humans; Internship and Residency; Surgery, Plastic; Educational Measurement; Clinical Competence; Language; United States; Education, Medical, Graduate; Large Language Models

같은 제1저자의 인용 많은 논문 (3)

Preoperative Tranexamic Acid Use in Free Flap Breast Reconstruction: A Propensity-Matched Analysis of Postoperative Outcomes.
Microsurgery 2026
Older age is a predictor for hardware failure in open lower extremity fractures requiring free flap coverage.
Journal of hand and microsurgery 2025
Diagnostic Accuracy of Artificial Intelligence Models for Predicting Postoperative Complications Following Free Flap Reconstruction: A Systematic Review and Meta-Analysis.
Microsurgery 2025