← 뒤로

Evaluating the Performance of Different Large Language Models on Plastic and Aesthetic Surgery: A Cross-Sectional Blinded Study.

Aesthetic plastic surgery 2026 Vol.50(1) p. 375-385

Cui H, Zhang Z, Zhang J, Du C, Zhang W, Cui H

원문 ↗ DOI ↗

Abstract

[BACKGROUND] Large language models (LLMs) have demonstrated potential in various medical fields. However, their application in aesthetic plastic surgery remains largely unexplored, particularly in clinical decision support and patient consultations. Given plastic surgery integrates medical knowledge, aesthetic judgment, and doctor patient communication, a systematic evaluation of LLMs performance is needed.

[OBJECTIVES] This study aims to assess the capabilities of three widely used LLMs-GPT-4o (OpenAI), DeepSeek R1 (DeepSeek), and Claude 3.5 (Anthropic)-in aesthetic plastic surgery including facial aesthetics, body contouring, and nonsurgical interventions, aiming to provide evidence-based recommendations for model selection across different clinical contexts and to inform future improvements in the design and optimization of domain-specific language models.

[METHODS] A total of 125 questions were designed, covering multiple-choice examinations, clinical case analysis, expert guideline adherence, and patient consultation scenarios. Responses from each model were evaluated by three blinded plastic surgery experts based on predefined criteria, including accuracy, comprehensiveness, readability, humanistic care, and ethical considerations.

[RESULTS] DeepSeek R1 demonstrated performance that was superior to or at least comparable to GPT-4o and Claude 3.5 in multiple aspects, particularly in comprehensiveness (P = 0.04), readability (P < 0.001), and humanistic care (P < 0.001). While all models maintained reasonable safety and ethical standards, Claude 3.5 showed lower scores in trustworthiness and comprehensiveness, limiting its reliability in clinical decision support.

[CONCLUSIONS] Among the three evaluated LLMs, DeepSeek R1 excelled in comprehensiveness, readability, and humanistic care; GPT-4o performed well in scientific accuracy and safety, while Claude 3.5 showed relative strengths in logical coherence.

[NO LEVEL ASSIGNED] This journal requires that authors assign a level of evidence to each submission to which Evidence-Based Medicine rankings are applicable. This excludes Review Articles, Book Reviews, and manuscripts that concern Basic Science, Animal Studies, Cadaver Studies, and Experimental Studies. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .

추출된 의학 개체 (NER)

유형	영어 표현	출처	등장
해부	`Cadaver`	scispacy	1
합병증	`facial`	scispacy	1
약물	`[BACKGROUND] Large`	scispacy	1
약물	`[OBJECTIVES]`	scispacy	1
약물	`[RESULTS] DeepSeek R1`	scispacy	1
약물	`[CONCLUSIONS]`	scispacy	1
약물	`[NO`	scispacy	1
질환	`Language`	scispacy	1
질환	`GPT-4o`	scispacy	1
기타	`patient`	scispacy	1
기타	`DeepSeek R1`	scispacy	1

MeSH Terms

Humans; Surgery, Plastic; Plastic Surgery Procedures; Cross-Sectional Studies; Female; Language; Male; Adult; Esthetics; Middle Aged; Single-Blind Method; Large Language Models