Key Measures for Evaluating Diagnostic Accuracy in Multi-Class Classification: An Overview and Simulation-Based Comparison.

Ryu L; Han K; Jung I; Park YW; Ahn SS; Hwang D

doi:10.3348/kjr.2025.1447

← 뒤로

Key Measures for Evaluating Diagnostic Accuracy in Multi-Class Classification: An Overview and Simulation-Based Comparison.

Korean journal of radiology 2026 Vol.27(4) p. 344-355

Ryu L, Han K, Jung I, Park YW, Ahn SS, Hwang D

PMC 전문 ↗ 원문 ↗ DOI ↗ BibTeX ↓ RIS ↓

📝 환자 설명용 한 줄

Recent advancements in artificial intelligence have led to increased interest in predictive modeling across various domains, including medicine.

이 논문을 인용하기

BibTeX ↓ RIS ↓

APA Ryu L, Han K, et al. (2026). Key Measures for Evaluating Diagnostic Accuracy in Multi-Class Classification: An Overview and Simulation-Based Comparison.. Korean journal of radiology, 27(4), 344-355. https://doi.org/10.3348/kjr.2025.1447

MLA Ryu L, et al.. "Key Measures for Evaluating Diagnostic Accuracy in Multi-Class Classification: An Overview and Simulation-Based Comparison.." Korean journal of radiology, vol. 27, no. 4, 2026, pp. 344-355.

PMID 41914484

DOI 10.3348/kjr.2025.1447

Abstract

Recent advancements in artificial intelligence have led to increased interest in predictive modeling across various domains, including medicine. Although numerous metrics have been established for binary classification, the growing adoption of multi-class systems necessitates robust evaluation measures. However, comprehensive simulation studies comparing the performance of existing multi-class metrics under diverse data conditions remain limited. In this study, we first provide a concise overview of commonly used accuracy metrics for multi-class classification. Then, we report a simulation study that systematically evaluates several diagnostic accuracy measures under a wide range of scenarios, including three- and five-class settings, balanced and imbalanced sample sizes, and different distributional assumptions for predictors. We assessed each metric's performance in terms of bias and 95% confidence interval coverage. Under balanced conditions, most metrics demonstrated stable and unbiased performance, closely approximating the true values. However, under imbalanced conditions, greater bias was observed, with the M-index and polytomous discrimination index exhibiting comparatively more stable performance across various scenarios. The micro-averaged receiver operating characteristic curve area consistently showed higher bias under class imbalance. Finally, we applied these metrics to a glioma tumor grading task using external datasets. This study provides a systematic comparison of commonly used metrics and offers practical guidance for selecting appropriate measures in multi-class classification tasks.

MeSH Terms

Humans; Computer Simulation; ROC Curve; Artificial Intelligence; Glioma; Brain Neoplasms