Classification of Salivary Gland Tumors on Ultrasound Using Artificial Intelligence: A Systematic Review and Meta-Analysis.
메타분석
1/5 보강
[OBJECTIVE] Accurate classification of salivary gland tumors is critical to guiding appropriate management.
- 표본수 (n) 4721
APA
Chau IJ, Kim CH, et al. (2026). Classification of Salivary Gland Tumors on Ultrasound Using Artificial Intelligence: A Systematic Review and Meta-Analysis.. Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery, 174(3), 601-612. https://doi.org/10.1002/ohn.70102
MLA
Chau IJ, et al.. "Classification of Salivary Gland Tumors on Ultrasound Using Artificial Intelligence: A Systematic Review and Meta-Analysis.." Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery, vol. 174, no. 3, 2026, pp. 601-612.
PMID
41485242 ↗
Abstract 한글 요약
[OBJECTIVE] Accurate classification of salivary gland tumors is critical to guiding appropriate management. This study evaluates the diagnostic performance of artificial intelligence models in classifying salivary gland tumors on ultrasound.
[DATA SOURCES] A comprehensive search of CINAHL, PubMed, and Scopus was conducted through January 28, 2025.
[REVIEW METHODS] Two independent reviewers screened articles and extracted data following Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines. Studies evaluating the diagnostic performance of artificial intelligence in classifying salivary gland tumors were included for fixed and random-effects meta-analyses. The Quality Assessment of Diagnostic Accuracy Studies-2 tool for systematic reviews of diagnostic accuracy studies was used to assess study quality and risk of bias.
[RESULTS] Out of 741 articles identified, 12 studies (N = 4721) met inclusion criteria. Nine studies evaluated artificial intelligence models differentiating benign from malignant tumors, and three studies assessed classification of pleomorphic adenomas versus Warthin tumors. For benign versus malignant tumors, sensitivity was 0.91 (95% CI: 0.86, 0.95), specificity was 0.86 (95% CI: 0.80, 0.92), and accuracy was 0.85 (95% CI: 0.81, 0.90). For pleomorphic adenomas versus Warthin tumors, sensitivity was 0.81 (95% CI: 0.74, 0.89), specificity was 0.88 (95% CI: 0.81, 0.95), and accuracy was 0.84 (95% CI: 0.79, 0.90).
[CONCLUSION] Artificial intelligence models demonstrate strong diagnostic accuracy in ultrasound-based classification of salivary gland tumors. These results highlight the potential of artificial intelligence as a diagnostic tool, though broader validation is needed before routine clinical implementation.
[DATA SOURCES] A comprehensive search of CINAHL, PubMed, and Scopus was conducted through January 28, 2025.
[REVIEW METHODS] Two independent reviewers screened articles and extracted data following Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines. Studies evaluating the diagnostic performance of artificial intelligence in classifying salivary gland tumors were included for fixed and random-effects meta-analyses. The Quality Assessment of Diagnostic Accuracy Studies-2 tool for systematic reviews of diagnostic accuracy studies was used to assess study quality and risk of bias.
[RESULTS] Out of 741 articles identified, 12 studies (N = 4721) met inclusion criteria. Nine studies evaluated artificial intelligence models differentiating benign from malignant tumors, and three studies assessed classification of pleomorphic adenomas versus Warthin tumors. For benign versus malignant tumors, sensitivity was 0.91 (95% CI: 0.86, 0.95), specificity was 0.86 (95% CI: 0.80, 0.92), and accuracy was 0.85 (95% CI: 0.81, 0.90). For pleomorphic adenomas versus Warthin tumors, sensitivity was 0.81 (95% CI: 0.74, 0.89), specificity was 0.88 (95% CI: 0.81, 0.95), and accuracy was 0.84 (95% CI: 0.79, 0.90).
[CONCLUSION] Artificial intelligence models demonstrate strong diagnostic accuracy in ultrasound-based classification of salivary gland tumors. These results highlight the potential of artificial intelligence as a diagnostic tool, though broader validation is needed before routine clinical implementation.
🏷️ 키워드 / MeSH 📖 같은 키워드 OA만
📖 전문 본문 읽기 PMC JATS · ~46 KB · 영문
Methods
Methods
Search Strategy
Developed using population, intervention, comparison, outcome (PICO) criteria, this systematic review and meta‐analysis aims to assess the diagnostic performance (outcome) of AI models (intervention) in classifying salivary gland tumors on ultrasound compared to traditional diagnostic methods (comparison) in patients with salivary gland tumors (population). A study protocol was registered on PROSPERO (ID CRD420250651068).
The current study was conducted according to Preferred Reporting Items for Systematic Reviews and Meta‐Analyses (PRIMSA) guidelines.
31
Three databases—PubMed (National Library of Medicine—National Institutes of Health), Scopus (Elsevier), and CINAHL (EBSCO)—were searched from inception through January 28, 2025. Key terms used in the search included but were not limited to “artificial intelligence,” “machine learning,” “deep learning,” “salivary gland tumor,” and “parotid tumor.” Our full search strategy is detailed in Supplement S1, available online. Article screening was performed using Covidence (Veritas Health Innovation Ltd.). During the screening process, references of all included and relevant articles were assessed for potential inclusion.
Selection Criteria
Two investigators (IJC and CHK) independently conducted the initial screen of title and abstracts, reviewed full texts of the remaining studies, and selected studies for final inclusion. Each step was completed independently, and any conflicts were resolved through discussion or by consulting a third reviewer (SAN). Studies considered for inclusion met the following criteria:
(1) studies that included salivary gland lesions from patients of all ages evaluated by ultrasound;
(2) studies that reported performance outcomes, specifically accuracy, sensitivity, and specificity, or a confusion matrix with true positives, true negatives, false positives, and false negatives such that these outcomes could be calculated;
(3) studies that assessed salivary gland lesions using artificial intelligence in combination with ultrasound to classify tissue.
Cohort studies, randomized control trials, or case series were considered for inclusion, while non‐human studies, case reports, studies that did not utilize artificial intelligence, and studies without performance outcomes were excluded. The classification of salivary gland tissue was flexible, meaning that the model could have been utilized to distinguish benign from malignant tumors, pleomorphic adenomas from Warthin tumors, or other types of tumors if available.
Data Extraction
For each study, 2 investigators (IJC and CHK) independently extracted data on study characteristics, patient characteristics, AI model used, and outcomes of interest (accuracy, sensitivity, specificity). Following extraction, any discrepancies in extracted data were resolved through discussion or by consulting a third reviewer (SAN).
Critical Appraisal
Level of evidence of each study was assessed using the Oxford Centre for Evidence‐Based Medicine criteria.
32
Risk of bias was assessed using the QUADAS‐2 (Quality Assessment of Diagnostic Accuracy Studies‐2) tool for systematic reviews of diagnostic accuracy studies.
33
This tool is comprised of four domains: patient selection, index test, reference standard, and flow and timing. Risk of bias is assessed for all domains, and concerns about applicability (whether the study matches the review question) are assessed for the first three domains. The patient selection domain considers whether the study included patients representative of the population to which AI model will be applied. The index test domain assesses appropriate use of the AI model and whether its results were interpreted without bias. The reference standard domain evaluates the standard likely to correctly classify the target condition, such as the histopathological diagnosis of the salivary gland tumor, against which the results of the index test (AI model) are compared. Lastly, the flow and timing domain evaluates whether the timing of the reference standard and index test was appropriate. Two investigators (IJC and JDM) independently reviewed each included study and performed risk assessments, judging the risk of bias and concerns about applicability as “low,” “high,” or “unclear” when insufficient data were reported. This method helped ensure that only high‐quality studies are included in our systematic review.
Statistical Analysis
Meta‐analysis of continuous measures (sensitivity, specificity, and accuracy) was performed with Cochrane Review Manager (RevMan) version 5.4 (The Cochrane Collaboration 2020, UK). A meta‐analysis of proportions (gender) was performed using MedCalc 22.017 (Med‐Calc Software). Each measure (mean [log(mean)]/proportion (%) and 95% confidence interval (CI)) was weighted according to the number of patients affected. Heterogeneity among studies was assessed using χ
2 and I
2 statistics with fixed effects (I
2 < 50%) and random effects (I
2 > 50%).
34
,
35
In addition, potential publication bias was evaluated by visual inspection of the funnel plot and Egger's regression test, which statistically examines the asymmetry of the funnel plot.
36
,
37
A P < .05 was considered to indicate a significant difference for all statistical tests.
Search Strategy
Developed using population, intervention, comparison, outcome (PICO) criteria, this systematic review and meta‐analysis aims to assess the diagnostic performance (outcome) of AI models (intervention) in classifying salivary gland tumors on ultrasound compared to traditional diagnostic methods (comparison) in patients with salivary gland tumors (population). A study protocol was registered on PROSPERO (ID CRD420250651068).
The current study was conducted according to Preferred Reporting Items for Systematic Reviews and Meta‐Analyses (PRIMSA) guidelines.
31
Three databases—PubMed (National Library of Medicine—National Institutes of Health), Scopus (Elsevier), and CINAHL (EBSCO)—were searched from inception through January 28, 2025. Key terms used in the search included but were not limited to “artificial intelligence,” “machine learning,” “deep learning,” “salivary gland tumor,” and “parotid tumor.” Our full search strategy is detailed in Supplement S1, available online. Article screening was performed using Covidence (Veritas Health Innovation Ltd.). During the screening process, references of all included and relevant articles were assessed for potential inclusion.
Selection Criteria
Two investigators (IJC and CHK) independently conducted the initial screen of title and abstracts, reviewed full texts of the remaining studies, and selected studies for final inclusion. Each step was completed independently, and any conflicts were resolved through discussion or by consulting a third reviewer (SAN). Studies considered for inclusion met the following criteria:
(1) studies that included salivary gland lesions from patients of all ages evaluated by ultrasound;
(2) studies that reported performance outcomes, specifically accuracy, sensitivity, and specificity, or a confusion matrix with true positives, true negatives, false positives, and false negatives such that these outcomes could be calculated;
(3) studies that assessed salivary gland lesions using artificial intelligence in combination with ultrasound to classify tissue.
Cohort studies, randomized control trials, or case series were considered for inclusion, while non‐human studies, case reports, studies that did not utilize artificial intelligence, and studies without performance outcomes were excluded. The classification of salivary gland tissue was flexible, meaning that the model could have been utilized to distinguish benign from malignant tumors, pleomorphic adenomas from Warthin tumors, or other types of tumors if available.
Data Extraction
For each study, 2 investigators (IJC and CHK) independently extracted data on study characteristics, patient characteristics, AI model used, and outcomes of interest (accuracy, sensitivity, specificity). Following extraction, any discrepancies in extracted data were resolved through discussion or by consulting a third reviewer (SAN).
Critical Appraisal
Level of evidence of each study was assessed using the Oxford Centre for Evidence‐Based Medicine criteria.
32
Risk of bias was assessed using the QUADAS‐2 (Quality Assessment of Diagnostic Accuracy Studies‐2) tool for systematic reviews of diagnostic accuracy studies.
33
This tool is comprised of four domains: patient selection, index test, reference standard, and flow and timing. Risk of bias is assessed for all domains, and concerns about applicability (whether the study matches the review question) are assessed for the first three domains. The patient selection domain considers whether the study included patients representative of the population to which AI model will be applied. The index test domain assesses appropriate use of the AI model and whether its results were interpreted without bias. The reference standard domain evaluates the standard likely to correctly classify the target condition, such as the histopathological diagnosis of the salivary gland tumor, against which the results of the index test (AI model) are compared. Lastly, the flow and timing domain evaluates whether the timing of the reference standard and index test was appropriate. Two investigators (IJC and JDM) independently reviewed each included study and performed risk assessments, judging the risk of bias and concerns about applicability as “low,” “high,” or “unclear” when insufficient data were reported. This method helped ensure that only high‐quality studies are included in our systematic review.
Statistical Analysis
Meta‐analysis of continuous measures (sensitivity, specificity, and accuracy) was performed with Cochrane Review Manager (RevMan) version 5.4 (The Cochrane Collaboration 2020, UK). A meta‐analysis of proportions (gender) was performed using MedCalc 22.017 (Med‐Calc Software). Each measure (mean [log(mean)]/proportion (%) and 95% confidence interval (CI)) was weighted according to the number of patients affected. Heterogeneity among studies was assessed using χ
2 and I
2 statistics with fixed effects (I
2 < 50%) and random effects (I
2 > 50%).
34
,
35
In addition, potential publication bias was evaluated by visual inspection of the funnel plot and Egger's regression test, which statistically examines the asymmetry of the funnel plot.
36
,
37
A P < .05 was considered to indicate a significant difference for all statistical tests.
Results
Results
Search Strategy
Our search strategy from three databases yielded 741 articles, which were uploaded into Covidence for review. After 236 duplicates were removed, 505 studies were screened based on title and abstract. Of these, 116 studies were assessed in full text review. Ultimately, 12 studies evaluating the performance of AI models in ultrasound classification of salivary gland tumors met inclusion criteria and were included for analysis. Reasons for exclusion included wrong study outcomes, wrong study design, wrong patient population, and duplicate patient population. The PRISMA flow diagram for this study is shown in Figure 1.
38
All 12 included studies were retrospective cohort studies with Oxford Level of Evidence (OLE) 2.
Quality Assessment
Critical appraisal of the included studies using the QUADAS‐2 tool found that each study had an acceptably low risk of bias (Figure 2). Regarding reference standard, all studies consistently used final histopathological diagnosis from surgical specimens as the reference standard that confirmed the diagnosis of salivary gland cancer. Cheng et al and Li et al also accepted CNB results if postoperative histopathology was not available.
39
,
40
In these studies, CNB was considered reliable given prior evidence showing CNB achieves sensitivities and specificities above 90% for salivary gland tumors. Potential sources of bias were seen in flow and timing, index test, and patient selection domains. An Egger's test for publication bias was not significant (0.09, 95% CI: −0.32, 0.51, P = .33), and a funnel plot demonstrated that all studies except Jiang et al were within the funnel with little asymmetry, suggesting low risk of publication bias (Figure 3A). Using only test cohorts, an Egger's test was not significant (0.24, 95% CI: −0.44, 0.93, P = .81), and the funnel plot demonstrated that all studies were within the funnel (Figure 3B).
Included Studies
The characteristics of the 12 selected studies, published from 2022 to 2025, are provided in Table 1. These studies comprised a total of 4721 patients, of which 57.1% were male. There were 907 patients in the test cohorts of included studies. The mean age of the test cohort was 55.2 years (range: 12‐93 years), and 54.2% were male. Mean maximum lesion diameter was 27.4 mm (95% CI: 26.0, 28.8). AI models were used to classify salivary gland tumors as benign or malignant in nine studies
41
,
42
,
43
,
44
,
45
,
46
,
47
,
48
,
49
and as pleomorphic adenomas or Warthin tumors in three studies.
50
,
51
,
52
These studies used data primarily from China (n = 10) and Taiwan (n = 2), demonstrating the geographic concentration of AI‐integrated ultrasound research. Eight studies used deep learning, two studies used machine learning, one study combined machine and deep learning, and one study used radiomics. All studies used conventional ultrasound (CUS) except for Shan et al, which used contrast‐enhanced ultrasound (CEUS). Additionally, studies used a variety of ultrasound systems (eg, Philips iU‐22, GE LOGIQ E11) with corresponding high‐frequency probes.
AI models in our included studies used both conventional ultrasound descriptors and advanced radiomic or deep learning features to classify salivary gland tumors. For pleomorphic adenoma versus Warthin tumor, models incorporated features such as lesion margins, echogenicity, cystic change, vascularity, size ratios, and textural heterogeneity, with visualization methods (eg, Grad‐CAM, SHAP) confirming emphasis on boundary and internal texture cues. For benign versus malignant classification, models highlighted irregular margins, heterogeneous echotexture, calcifications, posterior acoustic features, vascular and perfusion patterns, lesion depth, and regional lymph node changes. Deep learning approaches often quantified these features automatically and, in some studies, combined them with radiomic or clinical variables to improve performance.
Differentiation of Benign and Malignant Tumors
Nine studies (N = 3743) assessed the performance of AI models in classifying salivary gland tumors as benign or malignant. A total of 772 cases were included in the test cohorts across studies, of which 613 were benign and 159 were malignant. The sensitivity of studies was 0.91 (95% CI: 0.86, 0.95), specificity was 0.86 (95% CI: 0.80, 0.92), and accuracy was 0.85 (95% CI: 0.81, 0.90) in test cohorts. Figure 4 shows the analysis. Because studies exhibited heterogeneity with I2 values greater than 50%, random‐effects analyses were used to account for this heterogeneity.
The subtypes of benign and malignant salivary gland tumors included in each study are shown in Supplement S2, available online. All 9 studies included pleomorphic adenomas and Warthin tumors, the 2 most common benign tumors. Most studies also included basal cell adenomas and a variety of other benign lesions. Malignant tumors included mucoepidermoid carcinoma, adenocarcinoma, adenoid cystic carcinoma, acinic cell carcinoma, myoepithelial carcinoma, and undifferentiated carcinoma, among others.
Differentiation of Pleomorphic Adenomas and Warthin Tumors
Three studies (N = 978) assessed the performance of AI models in classifying salivary gland tumors as a pleomorphic adenoma or Warthin tumor. Across studies, 135 cases were included in the test cohort, of which 69 were pleomorphic adenomas and 66 were Warthin tumors. The sensitivity of the studies was 0.81 (95% CI: 0.74, 0.89), specificity was 0.88 (95% CI: 0.81, 0.95), and accuracy was 0.84 (95% CI: 0.79, 0.90) in test cohorts. These results are shown in Figure 5, in which fixed‐effects analyses were conducted as I
2 values were below 50%.
Diagnostic Performance of Physicians
Five studies that classified benign and malignant tumors
42
,
44
,
47
,
48
,
49
and three studies that classified pleomorphic adenomas and Warthin tumors
50
,
51
,
52
reported diagnostic performances of both AI models and physicians. Physicians with varying levels of experience, ranging from approximately 2 to over 20 years, were evaluated.
For benign versus malignant tumor classification, Jiang et al. found that their image‐based deep learning model with an AUC of 0.947 outperformed 6 radiologists with 4 to 22 years of experience, whose AUCs ranged from 0.591 to 0.776.
42
However, with model assistance, the diagnostic performance of five radiologists improved to AUCs ranging from 0.653 to 0.852. Similarly, Wei et al evaluated deep learning‐assisted models and found that these models improved the performance of two of three radiologists.
48
In another study, Shan et al demonstrated that their SVM model outperformed clinical diagnosis.
44
Wang et al developed a deep learning model with accuracies ranging from 77% to 81% that outperformed a radiologist with fewer than 5 years of experience but did not outperform two radiologists with 8 or more years of experience.
47
Zhang et al. compared deep learning‐based models to two radiologists with 2 years of experience and one with over 5 years of experience.
49
The model achieved a higher diagnostic accuracy (95.8%) than each of the radiologists, whose accuracies ranged from 62.5% to 82.5% with clinical data available.
Of the studies differentiating pleomorphic adenomas from Warthin tumors, Liu et al found the average diagnostic accuracy of two ultrasound physicians (57.6%) and the accuracy of FNAC (69.9%) were lower than those of five ultrasound image‐based deep learning models, which ranged from 72.2% to 83.3%.
50
Physicians in this study had 5 and 10 years of experience and were blinded to clinical information. Similarly, He et al. found that an ultrasound‐based machine learning model incorporating clinical factors, ultrasound factors, and radiomic features outperformed experienced physicians who provided diagnoses based only on ultrasound images.
52
In this study, physicians achieved an AUC of 0.681 (95% CI: 0.486, 0.838), and the model achieved an AUC of 0.833 (95% CI: 0.653, 0.944). Rather than comparing model performance to physician performance, Mao et al. evaluated 6 ultrasound physicians before and after using deep learning radiomics assistance.
51
This study found that diagnostic accuracy increased from 68.6% to 83.4% for 2 resident physicians, from 75.7% to 85.7% for 2 attending physicians, and from 87.9% to 91.5% for 2 chief physicians with model assistance.
Search Strategy
Our search strategy from three databases yielded 741 articles, which were uploaded into Covidence for review. After 236 duplicates were removed, 505 studies were screened based on title and abstract. Of these, 116 studies were assessed in full text review. Ultimately, 12 studies evaluating the performance of AI models in ultrasound classification of salivary gland tumors met inclusion criteria and were included for analysis. Reasons for exclusion included wrong study outcomes, wrong study design, wrong patient population, and duplicate patient population. The PRISMA flow diagram for this study is shown in Figure 1.
38
All 12 included studies were retrospective cohort studies with Oxford Level of Evidence (OLE) 2.
Quality Assessment
Critical appraisal of the included studies using the QUADAS‐2 tool found that each study had an acceptably low risk of bias (Figure 2). Regarding reference standard, all studies consistently used final histopathological diagnosis from surgical specimens as the reference standard that confirmed the diagnosis of salivary gland cancer. Cheng et al and Li et al also accepted CNB results if postoperative histopathology was not available.
39
,
40
In these studies, CNB was considered reliable given prior evidence showing CNB achieves sensitivities and specificities above 90% for salivary gland tumors. Potential sources of bias were seen in flow and timing, index test, and patient selection domains. An Egger's test for publication bias was not significant (0.09, 95% CI: −0.32, 0.51, P = .33), and a funnel plot demonstrated that all studies except Jiang et al were within the funnel with little asymmetry, suggesting low risk of publication bias (Figure 3A). Using only test cohorts, an Egger's test was not significant (0.24, 95% CI: −0.44, 0.93, P = .81), and the funnel plot demonstrated that all studies were within the funnel (Figure 3B).
Included Studies
The characteristics of the 12 selected studies, published from 2022 to 2025, are provided in Table 1. These studies comprised a total of 4721 patients, of which 57.1% were male. There were 907 patients in the test cohorts of included studies. The mean age of the test cohort was 55.2 years (range: 12‐93 years), and 54.2% were male. Mean maximum lesion diameter was 27.4 mm (95% CI: 26.0, 28.8). AI models were used to classify salivary gland tumors as benign or malignant in nine studies
41
,
42
,
43
,
44
,
45
,
46
,
47
,
48
,
49
and as pleomorphic adenomas or Warthin tumors in three studies.
50
,
51
,
52
These studies used data primarily from China (n = 10) and Taiwan (n = 2), demonstrating the geographic concentration of AI‐integrated ultrasound research. Eight studies used deep learning, two studies used machine learning, one study combined machine and deep learning, and one study used radiomics. All studies used conventional ultrasound (CUS) except for Shan et al, which used contrast‐enhanced ultrasound (CEUS). Additionally, studies used a variety of ultrasound systems (eg, Philips iU‐22, GE LOGIQ E11) with corresponding high‐frequency probes.
AI models in our included studies used both conventional ultrasound descriptors and advanced radiomic or deep learning features to classify salivary gland tumors. For pleomorphic adenoma versus Warthin tumor, models incorporated features such as lesion margins, echogenicity, cystic change, vascularity, size ratios, and textural heterogeneity, with visualization methods (eg, Grad‐CAM, SHAP) confirming emphasis on boundary and internal texture cues. For benign versus malignant classification, models highlighted irregular margins, heterogeneous echotexture, calcifications, posterior acoustic features, vascular and perfusion patterns, lesion depth, and regional lymph node changes. Deep learning approaches often quantified these features automatically and, in some studies, combined them with radiomic or clinical variables to improve performance.
Differentiation of Benign and Malignant Tumors
Nine studies (N = 3743) assessed the performance of AI models in classifying salivary gland tumors as benign or malignant. A total of 772 cases were included in the test cohorts across studies, of which 613 were benign and 159 were malignant. The sensitivity of studies was 0.91 (95% CI: 0.86, 0.95), specificity was 0.86 (95% CI: 0.80, 0.92), and accuracy was 0.85 (95% CI: 0.81, 0.90) in test cohorts. Figure 4 shows the analysis. Because studies exhibited heterogeneity with I2 values greater than 50%, random‐effects analyses were used to account for this heterogeneity.
The subtypes of benign and malignant salivary gland tumors included in each study are shown in Supplement S2, available online. All 9 studies included pleomorphic adenomas and Warthin tumors, the 2 most common benign tumors. Most studies also included basal cell adenomas and a variety of other benign lesions. Malignant tumors included mucoepidermoid carcinoma, adenocarcinoma, adenoid cystic carcinoma, acinic cell carcinoma, myoepithelial carcinoma, and undifferentiated carcinoma, among others.
Differentiation of Pleomorphic Adenomas and Warthin Tumors
Three studies (N = 978) assessed the performance of AI models in classifying salivary gland tumors as a pleomorphic adenoma or Warthin tumor. Across studies, 135 cases were included in the test cohort, of which 69 were pleomorphic adenomas and 66 were Warthin tumors. The sensitivity of the studies was 0.81 (95% CI: 0.74, 0.89), specificity was 0.88 (95% CI: 0.81, 0.95), and accuracy was 0.84 (95% CI: 0.79, 0.90) in test cohorts. These results are shown in Figure 5, in which fixed‐effects analyses were conducted as I
2 values were below 50%.
Diagnostic Performance of Physicians
Five studies that classified benign and malignant tumors
42
,
44
,
47
,
48
,
49
and three studies that classified pleomorphic adenomas and Warthin tumors
50
,
51
,
52
reported diagnostic performances of both AI models and physicians. Physicians with varying levels of experience, ranging from approximately 2 to over 20 years, were evaluated.
For benign versus malignant tumor classification, Jiang et al. found that their image‐based deep learning model with an AUC of 0.947 outperformed 6 radiologists with 4 to 22 years of experience, whose AUCs ranged from 0.591 to 0.776.
42
However, with model assistance, the diagnostic performance of five radiologists improved to AUCs ranging from 0.653 to 0.852. Similarly, Wei et al evaluated deep learning‐assisted models and found that these models improved the performance of two of three radiologists.
48
In another study, Shan et al demonstrated that their SVM model outperformed clinical diagnosis.
44
Wang et al developed a deep learning model with accuracies ranging from 77% to 81% that outperformed a radiologist with fewer than 5 years of experience but did not outperform two radiologists with 8 or more years of experience.
47
Zhang et al. compared deep learning‐based models to two radiologists with 2 years of experience and one with over 5 years of experience.
49
The model achieved a higher diagnostic accuracy (95.8%) than each of the radiologists, whose accuracies ranged from 62.5% to 82.5% with clinical data available.
Of the studies differentiating pleomorphic adenomas from Warthin tumors, Liu et al found the average diagnostic accuracy of two ultrasound physicians (57.6%) and the accuracy of FNAC (69.9%) were lower than those of five ultrasound image‐based deep learning models, which ranged from 72.2% to 83.3%.
50
Physicians in this study had 5 and 10 years of experience and were blinded to clinical information. Similarly, He et al. found that an ultrasound‐based machine learning model incorporating clinical factors, ultrasound factors, and radiomic features outperformed experienced physicians who provided diagnoses based only on ultrasound images.
52
In this study, physicians achieved an AUC of 0.681 (95% CI: 0.486, 0.838), and the model achieved an AUC of 0.833 (95% CI: 0.653, 0.944). Rather than comparing model performance to physician performance, Mao et al. evaluated 6 ultrasound physicians before and after using deep learning radiomics assistance.
51
This study found that diagnostic accuracy increased from 68.6% to 83.4% for 2 resident physicians, from 75.7% to 85.7% for 2 attending physicians, and from 87.9% to 91.5% for 2 chief physicians with model assistance.
Discussion
Discussion
This study evaluated the diagnostic performance of AI models in classifying salivary gland tumors using ultrasound imaging. Overall, AI models were highly accurate in discriminating between benign and malignant tumors and between pleomorphic adenomas and Warthin tumors, with pooled accuracies of approximately 85% for both analyses. Models also achieved high sensitivities (91% for benign versus malignant, 81% for pleomorphic adenomas versus Warthin tumors) and specificities (86% for benign versus malignant, 88% for pleomorphic adenomas versus Warthin tumors). Several of our included studies also compared AI to physician diagnostic performance, demonstrating that AI can enhance the performance of physicians when used as an assistive tool. These applications remain largely investigational, serving as decision‐support tools for lesion segmentation and tumor classification. Comprehensive validation in prospective, multicenter cohorts will be required before such models can be adopted into routine clinical practice. These studies overall highlight the potential utility of AI in classifying salivary gland tumors.
While FNA is considered the gold standard in salivary gland tumor diagnosis, it involves needle insertion, which can cause discomfort and inflammation.
14
Because ultrasonography is non‐invasive, low‐cost, widely available, and free of ionizing radiation, it is advantageous as a first‐line diagnostic tool. To our knowledge, this is the first meta‐analysis to specifically quantify pooled diagnostic performance of AI models in salivary gland tumor classification using ultrasound. In the differentiation of benign and malignant tumors, one meta‐analysis found that conventional ultrasound without AI could achieve a high specificity (92%) but moderate sensitivity (66%).
5
Based on these values, conventional ultrasound effectively identifies benign tumors but may incorrectly classify malignant tumors as benign. Our study found that integrating AI into ultrasound assessments demonstrated an improved sensitivity of 91% in benign versus malignant tumor classification, suggesting that AI may be particularly useful in optimizing our ability to correctly identify malignant tumors. Maximizing sensitivity ensures that malignant tumors are not missed.
AI models differentiated benign from malignant tumors with greater sensitivity than they differentiated pleomorphic adenomas from Warthin tumors. This likely reflects the greater similarity between pleomorphic adenomas and Warthin tumors, which are both benign tumors that share overlapping clinical and radiologic features, making them more challenging to distinguish. Pleomorphic adenomas and Warthin tumors are both painless, slow‐growing masses that can appear lobulated with cystic components.
53
Although Warthin tumors typically exhibit higher‐grade vascularity, mixed central and peripheral perfusion, and more areas of cystic change, these differences may be subtle, contributing to slightly lower model performance.
53
,
54
A main limitation of this meta‐analysis is study heterogeneity. Differences in AI models, training and validation methods, tumor subtypes, and ultrasound equipment may affect performance. Although each study validated their model in independent testing cohorts, many studies used internal rather than external testing cohorts. The lack of external testing cohorts limits the generalizability of model results. Moreover, most studies that met inclusion criteria were from China or Taiwan. The geographic concentration of the studies and datasets may further limit the generalizability of these results to populations in these regions. Lastly, retrospective data collection in our included studies could have introduced selection bias as well as variability in physician technique and ultrasound instruments. Overall, there is a need for the standardization of AI‐aided ultrasonography protocols and for multicenter studies with larger datasets. The ability to reproduce high performances in external testing datasets using uniform ultrasound procedures would strengthen the generalizability of these studies' findings.
In recent years, there has been a rapid increase in the number of studies exploring AI applications in clinical practice from diagnosis to management. This trend is evident in our meta‐analysis as all included studies were published within the last four years. By detecting patterns that may be imperceptible to humans, AI models can support clinical judgment and reduce reliance on subjective assessments, providing an additional safeguard to enhance diagnostic precision, especially for physician trainees. AI‐aided ultrasound may help stratify which patients are more likely to benefit from biopsy, reducing unnecessary procedures and capturing potentially missed malignancies. Moreover, AI‐aided ultrasonography may be particularly valuable in low‐resource settings, where access to advanced imaging and specialized radiologists is limited.
55
As Mao et al. demonstrated that radiologists achieved enhanced diagnostic accuracy after incorporating AI, clinicians can similarly benefit from integrating AI into their practice, with its utility further strengthened by their own clinical experience and expertise. While our findings suggest that AI can enhance diagnostic accuracy, these tools remain investigational and should be viewed as adjunctions rather than replacements for established diagnostic protocols. FNA remains the gold standard, and AI‐aided imaging should not be considered sufficient to deter biopsy at this stage.
For effective integration into routine clinical practice, ease of use and interpretability should be considered in the implementation of AI tools. With larger datasets, external testing cohorts, and standardization of AI and ultrasound methodologies, AI‐enhanced ultrasound has the potential to improve first‐line evaluations of salivary gland tumors, guide the need for FNA, and facilitate clinical decision‐making. Broader integration of AI into salivary gland neoplasm diagnosis can be anticipated through the combination of complementary AI approaches, enabling more robust and comprehensive analyses. Such integration will likely become feasible once the generalizability of these methods has been demonstrated across diverse populations, a wide spectrum of pathologies, and varied clinical practice settings.
This study evaluated the diagnostic performance of AI models in classifying salivary gland tumors using ultrasound imaging. Overall, AI models were highly accurate in discriminating between benign and malignant tumors and between pleomorphic adenomas and Warthin tumors, with pooled accuracies of approximately 85% for both analyses. Models also achieved high sensitivities (91% for benign versus malignant, 81% for pleomorphic adenomas versus Warthin tumors) and specificities (86% for benign versus malignant, 88% for pleomorphic adenomas versus Warthin tumors). Several of our included studies also compared AI to physician diagnostic performance, demonstrating that AI can enhance the performance of physicians when used as an assistive tool. These applications remain largely investigational, serving as decision‐support tools for lesion segmentation and tumor classification. Comprehensive validation in prospective, multicenter cohorts will be required before such models can be adopted into routine clinical practice. These studies overall highlight the potential utility of AI in classifying salivary gland tumors.
While FNA is considered the gold standard in salivary gland tumor diagnosis, it involves needle insertion, which can cause discomfort and inflammation.
14
Because ultrasonography is non‐invasive, low‐cost, widely available, and free of ionizing radiation, it is advantageous as a first‐line diagnostic tool. To our knowledge, this is the first meta‐analysis to specifically quantify pooled diagnostic performance of AI models in salivary gland tumor classification using ultrasound. In the differentiation of benign and malignant tumors, one meta‐analysis found that conventional ultrasound without AI could achieve a high specificity (92%) but moderate sensitivity (66%).
5
Based on these values, conventional ultrasound effectively identifies benign tumors but may incorrectly classify malignant tumors as benign. Our study found that integrating AI into ultrasound assessments demonstrated an improved sensitivity of 91% in benign versus malignant tumor classification, suggesting that AI may be particularly useful in optimizing our ability to correctly identify malignant tumors. Maximizing sensitivity ensures that malignant tumors are not missed.
AI models differentiated benign from malignant tumors with greater sensitivity than they differentiated pleomorphic adenomas from Warthin tumors. This likely reflects the greater similarity between pleomorphic adenomas and Warthin tumors, which are both benign tumors that share overlapping clinical and radiologic features, making them more challenging to distinguish. Pleomorphic adenomas and Warthin tumors are both painless, slow‐growing masses that can appear lobulated with cystic components.
53
Although Warthin tumors typically exhibit higher‐grade vascularity, mixed central and peripheral perfusion, and more areas of cystic change, these differences may be subtle, contributing to slightly lower model performance.
53
,
54
A main limitation of this meta‐analysis is study heterogeneity. Differences in AI models, training and validation methods, tumor subtypes, and ultrasound equipment may affect performance. Although each study validated their model in independent testing cohorts, many studies used internal rather than external testing cohorts. The lack of external testing cohorts limits the generalizability of model results. Moreover, most studies that met inclusion criteria were from China or Taiwan. The geographic concentration of the studies and datasets may further limit the generalizability of these results to populations in these regions. Lastly, retrospective data collection in our included studies could have introduced selection bias as well as variability in physician technique and ultrasound instruments. Overall, there is a need for the standardization of AI‐aided ultrasonography protocols and for multicenter studies with larger datasets. The ability to reproduce high performances in external testing datasets using uniform ultrasound procedures would strengthen the generalizability of these studies' findings.
In recent years, there has been a rapid increase in the number of studies exploring AI applications in clinical practice from diagnosis to management. This trend is evident in our meta‐analysis as all included studies were published within the last four years. By detecting patterns that may be imperceptible to humans, AI models can support clinical judgment and reduce reliance on subjective assessments, providing an additional safeguard to enhance diagnostic precision, especially for physician trainees. AI‐aided ultrasound may help stratify which patients are more likely to benefit from biopsy, reducing unnecessary procedures and capturing potentially missed malignancies. Moreover, AI‐aided ultrasonography may be particularly valuable in low‐resource settings, where access to advanced imaging and specialized radiologists is limited.
55
As Mao et al. demonstrated that radiologists achieved enhanced diagnostic accuracy after incorporating AI, clinicians can similarly benefit from integrating AI into their practice, with its utility further strengthened by their own clinical experience and expertise. While our findings suggest that AI can enhance diagnostic accuracy, these tools remain investigational and should be viewed as adjunctions rather than replacements for established diagnostic protocols. FNA remains the gold standard, and AI‐aided imaging should not be considered sufficient to deter biopsy at this stage.
For effective integration into routine clinical practice, ease of use and interpretability should be considered in the implementation of AI tools. With larger datasets, external testing cohorts, and standardization of AI and ultrasound methodologies, AI‐enhanced ultrasound has the potential to improve first‐line evaluations of salivary gland tumors, guide the need for FNA, and facilitate clinical decision‐making. Broader integration of AI into salivary gland neoplasm diagnosis can be anticipated through the combination of complementary AI approaches, enabling more robust and comprehensive analyses. Such integration will likely become feasible once the generalizability of these methods has been demonstrated across diverse populations, a wide spectrum of pathologies, and varied clinical practice settings.
Conclusion
Conclusion
This study demonstrates that AI models achieve strong diagnostic accuracy, sensitivity, and specificity in ultrasound‐based classification of salivary gland tumors. As ultrasound is a valuable screening tool in the initial evaluation and differentiation of salivary gland masses, the integration of assistive AI into ultrasound protocols may improve diagnostic accuracy and support physicians in diagnosis and treatment planning. While AI‐aided ultrasonography is a promising and rapidly expanding area, further research in diverse, multicenter clinical settings is needed to validate these models and determine their generalizability before routine clinical implementation. Future work will require collaboration among radiologists, pathologists, otolaryngologists, and AI specialists to advance diagnostic precision and enable clinical implementation.
This study demonstrates that AI models achieve strong diagnostic accuracy, sensitivity, and specificity in ultrasound‐based classification of salivary gland tumors. As ultrasound is a valuable screening tool in the initial evaluation and differentiation of salivary gland masses, the integration of assistive AI into ultrasound protocols may improve diagnostic accuracy and support physicians in diagnosis and treatment planning. While AI‐aided ultrasonography is a promising and rapidly expanding area, further research in diverse, multicenter clinical settings is needed to validate these models and determine their generalizability before routine clinical implementation. Future work will require collaboration among radiologists, pathologists, otolaryngologists, and AI specialists to advance diagnostic precision and enable clinical implementation.
Author Contributions
Author Contributions
Isabelle J. Chau, design, conduct, analysis, writing, figure generation, and editing; Cory Hyun‐su Kim, design, conduct, analysis, writing, figure generation, and editing; Shaun A. Nguyen, design, conduct, analysis, interpretation, writing, and editing; Lauren R. McCray, conduct, analysis, and editing; Jesse D. Murdaugh, conduct, analysis, and editing; Jason G. Newman, design, interpretation, and editing.
Isabelle J. Chau, design, conduct, analysis, writing, figure generation, and editing; Cory Hyun‐su Kim, design, conduct, analysis, writing, figure generation, and editing; Shaun A. Nguyen, design, conduct, analysis, interpretation, writing, and editing; Lauren R. McCray, conduct, analysis, and editing; Jesse D. Murdaugh, conduct, analysis, and editing; Jason G. Newman, design, interpretation, and editing.
Disclosures
Disclosures
Competing interests
None.
Funding source
None.
Competing interests
None.
Funding source
None.
Supporting information
Supporting information
AISGT_Supplement_R1.
AISGT_Supplement_R1.
출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.
🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반
- A Phase I Study of Hydroxychloroquine and Suba-Itraconazole in Men with Biochemical Relapse of Prostate Cancer (HITMAN-PC): Dose Escalation Results.
- Self-management of male urinary symptoms: qualitative findings from a primary care trial.
- Clinical and Liquid Biomarkers of 20-Year Prostate Cancer Risk in Men Aged 45 to 70 Years.
- Diagnostic accuracy of Ga-PSMA PET/CT versus multiparametric MRI for preoperative pelvic invasion in the patients with prostate cancer.
- Clinical Presentation and Outcomes of Patients Undergoing Surgery for Thyroid Cancer.
- Association of patient health education with the postoperative health related quality of life in low- intermediate recurrence risk differentiated thyroid cancer patients.