본문으로 건너뛰기
← 뒤로

Diagnostic performance of deep learning for predicting glioma isocitrate dehydrogenase and 1p/19q co-deletion in MRI: a systematic review and meta-analysis.

메타분석 1/5 보강
European radiology 📖 저널 OA 34.7% 2022: 1/4 OA 2023: 0/7 OA 2024: 2/11 OA 2025: 18/71 OA 2026: 71/165 OA 2022~2026 2026 Vol.36(2) p. 1562-1591
Retraction 확인
출처

Farahani S, Hejazi M, Tabassum M, Di Ieva A, Mahdavifar N, Liu S

📝 환자 설명용 한 줄

[OBJECTIVES] We aimed to evaluate the diagnostic performance of deep learning (DL)-based radiomics models for the noninvasive prediction of isocitrate dehydrogenase (IDH) mutation and 1p/19q co-deleti

🔬 핵심 임상 통계 (초록에서 자동 추출 — 원문 검증 권장)
  • 95% CI 0.77-0.83
  • Sensitivity 85%
  • 연구 설계 meta-analysis

이 논문을 인용하기

↓ .bib ↓ .ris
APA Farahani S, Hejazi M, et al. (2026). Diagnostic performance of deep learning for predicting glioma isocitrate dehydrogenase and 1p/19q co-deletion in MRI: a systematic review and meta-analysis.. European radiology, 36(2), 1562-1591. https://doi.org/10.1007/s00330-025-11898-2
MLA Farahani S, et al.. "Diagnostic performance of deep learning for predicting glioma isocitrate dehydrogenase and 1p/19q co-deletion in MRI: a systematic review and meta-analysis.." European radiology, vol. 36, no. 2, 2026, pp. 1562-1591.
PMID 40817944 ↗

Abstract

[OBJECTIVES] We aimed to evaluate the diagnostic performance of deep learning (DL)-based radiomics models for the noninvasive prediction of isocitrate dehydrogenase (IDH) mutation and 1p/19q co-deletion status in glioma patients using MRI sequences, and to identify methodological factors influencing accuracy and generalizability.

[MATERIALS AND METHODS] Following PRISMA guidelines, we systematically searched major databases (PubMed, Scopus, Embase, Web of Science, and Google Scholar) up to March 2025, screening studies that utilized DL to predict IDH and 1p/19q co-deletion status from MRI data. We assessed study quality and risk of bias using the Radiomics Quality Score and the QUADAS-2 tool. Our meta-analysis employed a bivariate model to compute pooled sensitivity and specificity, and meta-regression to assess interstudy heterogeneity.

[RESULTS] Among the 1517 unique publications, 104 were included in the qualitative synthesis, and 72 underwent meta-analysis. Pooled estimates for IDH prediction in test cohorts yielded a sensitivity of 0.80 (95% CI: 0.77-0.83) and specificity of 0.85 (95% CI: 0.81-0.87). For 1p/19q co-deletion, sensitivity was 0.75 (95% CI: 0.65-0.82) and specificity was 0.82 (95% CI: 0.75-0.88). Meta-regression identified the tumor segmentation method and the extent of DL integration into the radiomics pipeline as significant contributors to interstudy variability.

[CONCLUSION] Although DL models demonstrate strong potential for noninvasive molecular classification of gliomas, clinical translation requires several critical steps: harmonization of multi-center MRI data using techniques such as histogram matching and DL-based style transfer; adoption of standardized and automated segmentation protocols; extensive multi-center external validation; and prospective clinical validation.

[KEY POINTS] Question Can DL based radiomics using routine MRI noninvasively predict IDH mutation and 1p/19q co-deletion status in gliomas, and what factors affect diagnostic accuracy? Findings Meta-analysis showed 80% sensitivity and 85% specificity for predicting IDH mutation, and 75% sensitivity and 82% specificity for 1p/19q co-deletion status. Clinical relevance MRI-based DL models demonstrate clinically useful accuracy for noninvasive glioma molecular classification, but data harmonization, standardized automated segmentation, and rigorous multi-center external validation are essential for clinical adoption.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

📖 전문 본문 읽기 PMC JATS · ~125 KB · 영문

Introduction

Introduction
Gliomas, the most common and lethal primary central nervous system (CNS) tumors, show significant histological and molecular variability, making accurate diagnosis essential [1]. Key genetic markers, including isocitrate dehydrogenase (IDH) mutation and 1p/19q co-deletion, guide classification and treatment decisions [2]. Traditional biopsies are invasive and limited by tumor heterogeneity [3]. MRI is central to noninvasive glioma assessment, supported by the European Association of Neuro-Oncology (EANO) guidelines [4]. However, interpreting MRI data can be challenging due to human limitations and radiological “mimics,” which make distinguishing gliomas from conditions such as inflammatory diseases, stroke, and infections difficult [5].
Advancements in radiomics have begun to address these challenges by extracting intricate features from medical images [6]. Radiomics analysis involves two primary methodologies: feature-engineered and deep learning (DL)-based radiomics modeling [7]. The former involves processes such as image segmentation, feature extraction, and statistical analysis, each of which significantly influences subsequent outcomes, particularly in MRI models [8]. Subjective handcrafted features spurred DL integration into radiomics. DL can replace individual steps or operate end-to-end for direct classification [7, 9].
Since the introduction of DL into radiomics, numerous studies have predicted IDH and 1p/19q co-deletion [10–12]. Given the extensive research, there is a critical need for a systematic review to synthesize and thoroughly quantify existing data. Current reviews often focus on conventional radiomics, primarily analyzing radiomic features with machine learning methods. Additionally, some works concentrate solely on specific glioma grades or particular MRI modalities (e.g., dynamic susceptibility contrast (DSC) MR perfusion imaging and T2-FLAIR mismatch) for predicting either IDH mutation or 1p/19q co-deletion, often neglecting the simultaneous prediction of these biomarkers across various glioma grades and imaging techniques [13–17]. To address this gap, our study conducts a comprehensive systematic review and meta-regression to evaluate the accuracy and reliability of DL-based models in predicting IDH mutations and 1p/19q co-deletion using MRI, thereby consolidating evidence on their effectiveness.

Methods

Methods
We performed a PRISMA-guided systematic review and meta-analysis (PROSPERO: CRD42024542505) [18].

Search strategy and study selection
We systematically searched the PubMed, Scopus, Embase, Web of Science, and Google Scholar for DL-based radiomics studies in glioma up to March 28, 2025, with no time or language restrictions (Supplementary Section 1). We also screened relevant article bibliographies for further identification. The inclusion criteria were studies of gliomas (any World Health Organization grade) that predicted IDH and/or 1p/19q co-deletion status using MRI and incorporated DL algorithms in their radiomics workflow. For inclusion in the meta-analysis, studies had to report sufficient information to allow reconstruction of a 2 × 2 diagnostic table. Those without sufficient validation metrics were restricted to qualitative synthesis. Non-original and non-human studies were excluded. Records were managed via Zotero software (version 6.0.36). Two reviewers (S.F. and M.T.) independently screened the abstracts and full texts in two rounds, resolving disagreements through discussion.

Data extraction
Two reviewers (S.F. and M.T.) independently collected data on study design, patient characteristics, datasets used, MRI sequences, data augmentation techniques, and computational methodologies using a standardized form (Supplementary Section 2). Performance metrics for constructing the diagnostic confusion matrix were obtained from both internal validation methods (e.g., k-fold and leave-one-out cross-validation) and test datasets, prioritizing external validation cohorts when available, or otherwise using held-out test sets.
When diagnostic table counts were not explicitly reported, we first contacted the corresponding authors by email. If no data were provided, we reconstructed these values from reported sensitivity, specificity, and total sample size using standard formulas. When only receiver-operating characteristic (ROC) curves were available, we extracted sensitivity and specificity at the point closest to the top-left corner (Youden index) using WebPlotDigitizer v4.7. All imputed counts were rounded to the nearest whole number. No imputation was performed for missing clinical or demographic covariates; these data were excluded from subgroup analyses. This process aligns with Cochrane Handbook guidance for transparency and reproducibility [19]. In publications reporting multiple DL models or MRI modalities, the best-performing model was selected for the meta-analysis. However, the full range of results was included and analyzed separately in the subgroup analyses.

Quality assessment
The risk of bias and applicability concerns were evaluated via a modified QUADAS-2 tool [20], which incorporates relevant items from the Checklist for Artificial Intelligence in Medical Imaging (CLAIM) and the radiomics quality score (RQS). Key considerations included clarity in imaging protocols, appropriate data selection and missing data handling, use of reliable reference standards, and avoidance of severe genotype imbalances. Additionally, the index test evaluation assessed the use of multiple segmentations and the robustness of the model predictions. Concerns about applicability, particularly regarding validation on external datasets, were addressed to ensure generalizability across diverse clinical settings. If the data were insufficient, we contacted the authors for clarification via email. Moreover, the methodologies, strengths, limitations, quality, and translatability of studies were evaluated using RQS, which assesses each study on 16 components, with cumulative scores ranging from −8 to 36 [21]. Three reviewers (S.F. and N.M. for QUADAS-2 and S.F. and M.T. for RQS) independently conducted assessments, resolving discrepancies through discussion (Supplementary, Sections 3 and 4).

Statistical analysis
A bivariate random effects model was used to pool the sensitivity, specificity, and 95% confidence intervals (CIs) across studies (≥ 5) and summary receiver operating characteristic (SROC) curves. Heterogeneity was evaluated using Cochran's Q-test, the I2 statistic, prediction intervals (p < 0.05), and the Spearman correlation coefficients (SCC) between sensitivity and the false positive rate (threshold effect indicated by an SCC > 0.6) [22, 23]. Subgroup analyses explored sources of heterogeneity by subgroup analyses in instances with enough studies [24]. A leave-one-out meta-analysis assessed each study's impact on effect size. Publication bias was evaluated using funnel plots and Egger’s test. The Trim and Fill method by Duval and Tweedie was applied to adjust pooled sensitivity and specificity estimates in the presence of asymmetry. Statistical power was also calculated across effect sizes [25]. Analyses were performed with R packages ‘mada’, ‘metameta’, ‘metafor’ (R v4.4.1), and MetaBayesDTA (v1.5.2) [26].

Results

Results

Study characteristics
A total of 1517 unique publications were initially identified through primary searches and relevant study bibliographies. Following screening and full-text reviews, 104 studies were eligible for qualitative analysis, of which 72 were included in the meta-analysis (Fig. 1). One study [27] was excluded because it served solely for the external validation of another study [28]. Details on the inclusion and exclusion status of studies in the meta-analysis are summarized in Supplementary Table 4.
Our analysis revealed that China and the USA dominate global research in this field, significantly outpacing other countries' publication volume (Fig. 2A). Additionally, the surveyed studies spanned various sample sizes, ranging from 41 [29] to 2776 patients [30]. Over 23% of the included studies employed genotyping methods, such as immunohistochemistry, DNA sequencing, or fluorescence in situ hybridization, to directly assess genetic variants. In contrast, 44% used molecular typing, which classifies tumors into clinically relevant subgroups based on their genotyping results [2]. Approximately 20% of studies combined both approaches by integrating multiple cohorts, whereas 12% did not report the reference standard used to determine biomarker status (Table 1).
In our qualitative analysis, we identified three primary imaging data sources: private (in-house), public, and a combination of both (Fig. 2B). In-house collections accounted for 35% of the data, whereas 26% relied solely on public datasets, especially The Cancer Imaging Archive (TCIA). Approximately 38% of the studies combined both sources for enhanced research robustness and data diversity. Furthermore, around 63% of the studies implemented data augmentation techniques—either conventional methods [12, 31–35] or generative adversarial networks (GANs) [36–38]—to mitigate overfitting and address class imbalance related to genotype distributions (Supplementary Table 5).
The extensive use of public datasets influenced MRI sequence choices, with conventional methods employed in 81% of the studies. The combination of T1, T1CE, T2, and T2-FLAIR sequences was most prevalent, accounting for 40% of cases. Notably, T1CE was the most frequently utilized, appearing in 26% of studies, followed by T2-FLAIR and T2. Advanced imaging techniques, such as diffusion- and perfusion-weighted imaging, were less commonly employed, appearing in 8% [29, 33, 39–44] and 3% [45–47] of studies, respectively, or in 5% [48–50] when used in combination (Fig. 2C).
DL methods, predominantly based on convolutional neural networks (CNNs), were used for tumor segmentation in 43% of the studies, followed by manual segmentation (26%) and semi-automatic methods (12%) (Fig. 2D). Twenty studies did not incorporate precisely delineated tumors into their prediction models; among these, 55% relied on whole preprocessed MRI slices, 30% utilized regions of interest (ROIs) confined to bounding boxes, and 15% used cropped tumor-bearing regions. Feature extraction primarily relied on CNN-based models, such as ResNet [35, 43, 51–57], DenseNet [32, 58–60], and Inception [61, 62], which were employed in more than half of the studies. Hybrid CNN–radiomics approaches [63–67] and transformer-based models [68–75] followed, appearing in approximately 12% and 7% of cases, respectively. Less commonly used methods included hybrid DL models [36, 41, 64, 65, 76–86], CAE [87, 88], graph neural networks (GNNs) [89], and recurrent neural networks (RNNs) [45] (Fig. 2E). More details on DL model architectures are provided in Supplementary Table 5.
Our review highlights the rise in DL-based radiomics research since 2016 [90]. Initially, CNNs dominated exclusively, making up 100% of methodologies from 2016 to 2018. However, diversification has since increased. By 2020, CNNs still led at 60%, with CAEs [87] and hybrid models combining CNNs with attention mechanisms [36] emerging as alternative models. Transformers, introduced in recent years, peaked at 25% in 2024 (Fig. 2G), indicating a shift to more complex architectures. These networks were primarily integrated into prediction frameworks in an end-to-end manner [33, 41, 51, 52, 62, 71–73,75, 77, 82, 84, 87, 91–94]. In some cases, DL was specifically applied for tumor segmentation [95–97], image preprocessing [98], or classification [48, 99] within the radiomics pipeline.
Pretrained models were used in approximately 38% of the studies (Supplementary Table 5). Among these, 75% fine-tuned the pretrained weights on their own training data, while the remaining studies applied them “as-is,” primarily for tumor segmentation. Notably, several studies that did not fine-tune the segmentation models incorporated expert review and manual correction of the automatically generated ROIs to ensure accuracy [29, 35, 44, 81, 96, 100]. Additionally, clinical parameters, mainly age and sex, were incorporated in 23% of the studies [12, 28, 31, 33, 42, 43, 51, 52, 59, 63, 67, 70, 96, 101–105] (Supplementary Table 5). Regarding model development and evaluation, 37 studies performed external validation, while 65 studies relied solely on internal validation. Figure 2F summarizes the internal validation strategies, highlighting the predominance of the held-out test set approach, followed by K-fold cross-validation.

Quality assessment
The median RQS was 15 (41.67%), ranging from 7 (19.44%) to 22 (61.11%) out of 36. In Domain 1 (mean score: 2.47 ± 0.92), most studies reported image protocols, but none included multiple time points or phantom studies; however, 71 studies conducted multiple segmentations. Domain 2 scored the highest (mean score: 5.65 ± 1.60), with 35% of studies validating their findings on external datasets. In Domain 3 (mean score: 2.60 ± 0.76), 29% of studies included multivariable analyses incorporating non-radiomic features, and 22% explored biological correlates. Domain 4 had a mean score of 2.74 ± 0.61, with most studies conducting statistical analysis. More than half of the studies used resampling techniques, though only three reported calibration statistics. All studies were retrospective and lacked prospective validation or cost-effectiveness analysis. For Domain 6 (average score: 1.51 ± 1.12), 64% of the studies used open-source data, but only 22% made their code available (Fig. 2H and Supplementary Section 4).
According to the QUADAS-2, the overall risk of bias was high in 48 studies and low in 53 studies, mainly due to limited segmentation methods or the lack of resampling techniques to mitigate overfitting. Additionally, 55 studies raised applicability concerns primarily due to a lack of validation on external datasets (Fig. 2I, J and Supplementary Section 3).

Publication bias and statistical power
Funnel plot asymmetry and Egger’s test indicated potential publication bias in IDH studies for both internal validation and test sets (p < 0.05), whereas no significant bias was detected in 1p/19q studies (p > 0.05). To account for the potential bias in IDH prediction, we applied the Trim and Fill method by Duval and Tweedie to adjust the pooled estimates of sensitivity and specificity (Supplementary Sections 7 and 9). The statistical power analysis revealed a high detection capability for larger effect sizes in most included studies but relatively lower power for detecting smaller sensitivity and specificity measures (< 0.3) in some studies [49, 87, 106, 107] (Supplementary Section 11).

IDH mutation
Most models primarily targeted IDH mutation, either alone in 60% or alongside 1p/19q prediction in 36% of the studies. Over 60% of the studies focused on Grades 2, 3, and 4 gliomas. Grade 4 gliomas were exclusively studied in 12% of the experiments, while Grade 2 gliomas were addressed in only four studies. In the meta-analysis, 75% of studies used DL–based features and 25% relied on conventional radiomics; among the latter, 10 studies applied DL solely for tumor segmentation and 3 for classification.

Meta-analysis
In both the internal validation and test cohorts, there was no significant correlation between sensitivity and specificity for IDH prediction, as indicated by SCC of 0.04 (95% CI: −0.27 to 0.33) for sensitivity and 0.01 (95% CI: −0.26 to 0.28) for specificity. In the test cohorts, the bivariate model estimated a pooled sensitivity of 80.4% (95% CI: 77.5–83.0%) and specificity of 84.6% (95% CI: 81.1–87.5%), with 95% prediction intervals ranging from 0.62 to 0.92 for sensitivity and 0.55 to 0.96 for specificity. Although unadjusted heterogeneity was moderate (I2 = 38.1–69.4%, p < 0.001), it was markedly reduced to 2.9–3.5% after adjusting for sample size. Similar performance was achieved for internal validation cohorts (Table 2 and Supplementary Figs. 13 and 15). These results are illustrated in the SROC curves (Fig. 4A, B), which demonstrate strong overall diagnostic performance, with an area under the curve (AUC) of 0.88 for test cohorts and 0.93 for internal validation cohorts. Separate analyses of studies employing DL-based and conventional radiomic features are provided in Table 2. Additionally, forest plots with pooled estimates, including original and imputed studies using the Duval & Tweedie Trim-and-Fill method, are presented in Supplementary Section 9. Sensitivity analyses are reported in Supplementary Section 8.

Subgroup analysis
We restricted the subgroup analysis to test cohorts (Table 3). Except for the segmentation method and the level of DL integration within the radiomics pipeline, none of the between-group differences reached statistical significance. Semi-automatic segmentation yielded the highest sensitivity, followed by DL-based and manual approaches. End-to-end DL pipelines outperformed those using DL only for feature extraction.

1p/19q Co-deletion
Approximately 5% of the studies focused only on 1p/19q co-deletion, whereas 34% addressed both 1p/19q co-deletion and IDH prediction, mainly in Grades 2 and 3 gliomas. The diagnostic performance of 1p/19q co-deletion in the internal validation and test cohorts (Fig. 3C, D) showed no significant correlation between sensitivity and specificity, with SCCs of 0.08 (95% CI: −0.47 to 0.58) and 0.03 (95% CI: −0.49 to 0.54), respectively. Meta-analysis of test datasets yielded a pooled sensitivity of 74.6% (95% CI: 64.9–82.3%) and specificity of 82.2% (95% CI: 74.8–87.8%) across fourteen experiments. Significant heterogeneity was observed (I² = 45.9–67.5%, p < 0.001), as shown in the SROC curves (Fig. 4C, D) by the wide, non-overlapping 95% confidence and prediction regions. However, heterogeneity was notably reduced to 5.0–5.8% following sample-size adjustment using the Holling method. Internal validation cohorts demonstrated higher predictive performance (Table 2 and Supplementary Section 9). One study used conventional radiomic features by employing DL solely for image segmentation [29]. Furthermore, sensitivity analyses are detailed in Supplementary Section 8.

Subgroup analysis
Due to the limited number of studies per subgroup, meta-regression was not feasible for most covariates. As detailed in Table 4, studies using only in-house datasets demonstrated higher sensitivity but lower specificity compared to those trained and validated on a combination of in-house and public datasets, though the differences were not statistically significant.

Discussion

Discussion
Our systematic review and meta-analysis critically evaluated the diagnostic performance of MRI-based DL models for predicting IDH mutation and 1p/19q co-deletion in glioma patients. Consistent with prior research [17, 108–110], our findings demonstrate promising overall performance but reveal substantial between-study heterogeneity. Notably, heterogeneity declined markedly after adjusting for sample size, indicating that most of the observed variability in sensitivity and specificity stems from sampling error rather than systematic study differences. Subsequent subgroup analyses confirmed the stability of our pooled estimates, as most examined covariates had no significant effect on model performance. Moreover, our statistical power analysis shows that while some studies had low power for small changes, most were sufficiently powered to detect pooled estimates.
Our meta-regression analysis highlighted tumor segmentation as a major source of variability. Since most features are extracted from defined ROIs, variations in segmentation methods can significantly impact feature reproducibility [111, 112]. We also refined our QUADAS-2 assessment to include an evaluation of segmentation methods, identifying one-third of the studies as unclear or high risk due to inadequate segmentation approaches. To mitigate this, future studies should standardize and streamline the segmentation process. Utilizing robust automated segmentation tools or well-validated semi-automated pipelines can reduce inter-observer variability. Where manual segmentation is unavoidable, having multiple raters and using consensus or average segmentations might improve reliability [113]. Furthermore, several strategies have been proposed to enhance automatic segmentation. For example, applying small dilations and erosions to masks during model training can improve tolerance to boundary shifts [114]. Uncertainty in segmentation can also be estimated using ensemble methods or Monte Carlo dropout [115]. Finally, segmentation-free DL approaches offer a way to bypass manual ROI delineation entirely, potentially avoiding this source of heterogeneity. Adopting these strategies can enhance the reliability of model performance, independent of the segmentation method used.
Studies employing DL in an end-to-end approach outperformed those using DL solely for feature extraction in radiomics workflows. This direct method minimizes potential errors, enhances reproducibility, and improves predictive accuracy. Previous studies have indicated that DL, particularly CNNs, bypasses traditional complexities associated with radiomics workflows, leading to more robust feature extraction [28, 116]. However, our analysis did not reveal any significant differences in predictive performance between radiomic and DL-based features. Importantly, no variation in performance was observed across different DL architectures, including CNNs, GNNs, and transformers. It is worth noting, though, that the limited sample size in some subgroups, such as GNN-based studies with only 304 cases, may affect the reliability of these findings.
Although not statistically significant, consistent trends were observed across both IDH and 1p/19q co-deletion studies regarding data sources. Models trained and validated on the same dataset demonstrated higher pooled sensitivity compared to those using multi-center datasets. In-house datasets, with standardized imaging protocols, typically offer more uniform data quality. In contrast, multi-center datasets introduce greater diversity in scanner vendors, MRI protocols, and patient populations, which can challenge models and lead to apparent performance drops, but ultimately confer more robustness. To reduce scanner- or site-specific biases, various harmonization methods have been developed [117]. At the feature level, techniques like ComBat help align radiomic feature distributions across different scanners by correcting for batch-related effects [118]. On the image level, DL–based approaches such as cycle-consistent generative adversarial networks (CycleGANs) and style transfer can be used to standardize image appearance across datasets [117, 119]. Additionally, fundamental preprocessing steps, such as correcting for bias field inhomogeneity, applying noise reduction filters, and normalizing intensities through methods like z-score scaling or histogram matching, may reduce image heterogeneity at its source [120].
The quality assessments in our systematic review revealed several areas for improvement and current limitations in the field. Consistent with previous reviews [116, 121], the median RQS score of 15 (41.67%) indicates moderate methodological quality, with deficiencies across several domains. Many studies detailed image protocols but lacked multiple time points or phantom studies, reducing reproducibility. Although over 70% of studies evaluated their models on unseen data, nearly half did not use external test datasets. This raises concerns about real-world applicability, as reflected in the RQS and QUADAS-2 assessments. Recent Food and Drug Administration (FDA) guidance on AI-enabled devices highlights that models can inadvertently overfit to features unique to a particular scanner or site [122]. To address this, it is essential to include multi-center training and external validation, and when performance declines across datasets, strategies such as domain adaptation should be employed. Foundation-based DL models offer a promising way forward. Pretrained on large, multi-institutional datasets, they tend to capture more stable and biologically meaningful features, making them more robust to variations in input data [123]. Moreover, to promote fairness across varied populations, models should undergo rigorous testing on diverse subgroups during the development and validation phases. This need is underscored by the fact that fewer than 4% of FDA-approved AI devices report race or ethnicity data [124].
Demonstrating technical performance is only one step; prospective clinical validation under real-world conditions is indispensable to bridging the gap to clinical adoption. In our review, all studies were retrospective. Prospective validation through real-time studies or clinical trials is crucial to show that DL models not only achieve high diagnostic accuracy but also improve patient outcomes compared to standard care. Unlike retrospective studies, prospective validation captures the full clinical workflow—data acquisition, model inference, and clinician decision-making—without hindsight bias, providing a more realistic assessment of the model [125]. Prospective validation also builds the case for regulatory approval and clinical acceptance, as required by related standards such as International Organization for Standardization (ISO) 13485 [126] and International Electrotechnical Commission (IEC) 62304 [127]. For instance, ISO 13485 involves risk management, documentation of design processes, and predefined acceptance criteria for performance. Collaborating with clinical partners to test the model prospectively in a workflow-simulated environment can generate the clinical evidence needed for eventual translation. Finally, incorporating DL into clinical workflows demands compatibility with electronic health records, clinician training, and robust IT infrastructure to support continuous model updates and real-time data integration. It incurs hardware, software, staffing, and maintenance costs that hospitals must weigh against potential benefits [128]. Overcoming these challenges is essential to move DL models from research into practice and advancing personalized oncology care.
This systematic review has several limitations. We focused on top-performing DL models and categorized them broadly due to a scarcity of articles. Nevertheless, we considered variations such as including clinical data, radiomic features, or different MRI sequences within a single study as separate experiments for more detailed analysis. However, these findings are observational rather than causal because randomization did not occur between studies, which is typical in most meta-analyses [22]. There may be other confounding variables influencing these results. Although reconstructing 2 × 2 tables increased the number of studies eligible for meta-analysis, imputation may introduce minor biases. Moreover, we did not assess potential patient overlap across studies. Approximately 26% of included studies relied exclusively on public datasets (mainly TCIA). While this raises the possibility of patient-level overlap, excluding these studies could introduce bias, as model performance on the same dataset can vary considerably depending on the DL framework. Our subgroup analysis further confirmed that the segmentation method and DL integration, rather than dataset origin, were the primary sources of heterogeneity.
In conclusion, our review highlights the promising performance of MRI-based DL models in accurately predicting IDH and 1p/19q co-deletion in glioma patients. To enhance the rigor and facilitate clinical translation of DL models for glioma molecular diagnosis, we propose the following minimum standards identified by our comprehensive analysis: use validated automated or consensus-based segmentation protocols, harmonize multi-center MRI data through methods such as ComBat or DL-based style transfer, incorporate phantom studies to assess feature stability, perform independent external validations without model retraining, and open data and code sharing. The next critical steps are to embed these models in prospective, multi-institutional clinical trials, integrating them into electronic health record workflows, assessing diagnostic accuracy, clinical impact, and cost-effectiveness in real time, and gathering the regulatory evidence needed for safe and effective routine use in neuro-oncology.

Supplementary information

Supplementary information

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

🟢 PMC 전문 열기