Machine Learning to Predict Extranodal Extension in Head and Neck Squamous Cell Carcinoma: A Systematic Review and Meta-Analysis.
메타분석
1/5 보강
PICO 자동 추출 (휴리스틱, conf 2/4)
유사 논문P · Population 대상 환자/모집단
1407 patients.
I · Intervention 중재 / 시술
추출되지 않음
C · Comparison 대조 / 비교
추출되지 않음
O · Outcome 결과 / 결론
The specificity and accuracy of MLA ranged from 72% to 96.2% and 66% to 92.2%, respectively, compared to that of radiologists, which ranged from 43.0% to 96.0% and 51.5% to 88.6%, respectively. [CONCLUSION] MLAs demonstrate superior diagnostic performance in predicting ENE in HNSCC and may serve as a valuable adjunct to radiologists in clinical practice.
[OBJECTIVE] To evaluate the clinical utility of machine learning algorithms (MLAs) in diagnosing extra-nodal extension (ENE) using CT imaging in HNSCC.
- p-value p < 0.001
- 연구 설계 Meta-analysis
APA
Aulakh A, Sarafan M, et al. (2026). Machine Learning to Predict Extranodal Extension in Head and Neck Squamous Cell Carcinoma: A Systematic Review and Meta-Analysis.. The Laryngoscope, 136(3), 1099-1108. https://doi.org/10.1002/lary.70194
MLA
Aulakh A, et al.. "Machine Learning to Predict Extranodal Extension in Head and Neck Squamous Cell Carcinoma: A Systematic Review and Meta-Analysis.." The Laryngoscope, vol. 136, no. 3, 2026, pp. 1099-1108.
PMID
41070703 ↗
Abstract 한글 요약
[OBJECTIVE] To evaluate the clinical utility of machine learning algorithms (MLAs) in diagnosing extra-nodal extension (ENE) using CT imaging in HNSCC.
[DATA SOURCES] A comprehensive literature search was conducted on MEDLINE (Ovid), EMBASE, Cochrane, Scopus, and Web of Science, from January 1, 2000, to February 12, 2025.
[REVIEW METHODS] Two independent reviewers selected studies reporting the diagnostic accuracy of MLAs in detecting ENE in patients with HNSCC. The review followed PRISMA guidelines. Meta-analysis was performed using MedCalc (23.0.2), with pooled estimates of the area under the curve (AUC) and corresponding 95% confidence intervals (CI) calculated. The Checklist for Artificial Intelligence in Medical Imaging (CLAIM) was used to analyze the methodological quality of the included studies.
[RESULTS] Of 57 articles retrieved, six met inclusion criteria, encompassing 2870 lymph nodes from 1407 patients. MLAs achieved a pooled AUC of 0.92 (95% CI [0.915, 0.923], p < 0.001; fixed-effects) and 0.91 (95% CI [0.882, 0.929], p < 0.001; random-effects), outperforming radiologists who had pooled AUCs of 0.65 (95% CI [0.645-0.654], p < 0.001; fixed-effects) and 0.65 (95% CI [0.591-0.708], p < 0.001; random-effects). Furthermore, MLA achieved a sensitivity ranging from 66.9% to 91.2%, compared to 24% to 96.0% by radiologists. The specificity and accuracy of MLA ranged from 72% to 96.2% and 66% to 92.2%, respectively, compared to that of radiologists, which ranged from 43.0% to 96.0% and 51.5% to 88.6%, respectively.
[CONCLUSION] MLAs demonstrate superior diagnostic performance in predicting ENE in HNSCC and may serve as a valuable adjunct to radiologists in clinical practice.
[DATA SOURCES] A comprehensive literature search was conducted on MEDLINE (Ovid), EMBASE, Cochrane, Scopus, and Web of Science, from January 1, 2000, to February 12, 2025.
[REVIEW METHODS] Two independent reviewers selected studies reporting the diagnostic accuracy of MLAs in detecting ENE in patients with HNSCC. The review followed PRISMA guidelines. Meta-analysis was performed using MedCalc (23.0.2), with pooled estimates of the area under the curve (AUC) and corresponding 95% confidence intervals (CI) calculated. The Checklist for Artificial Intelligence in Medical Imaging (CLAIM) was used to analyze the methodological quality of the included studies.
[RESULTS] Of 57 articles retrieved, six met inclusion criteria, encompassing 2870 lymph nodes from 1407 patients. MLAs achieved a pooled AUC of 0.92 (95% CI [0.915, 0.923], p < 0.001; fixed-effects) and 0.91 (95% CI [0.882, 0.929], p < 0.001; random-effects), outperforming radiologists who had pooled AUCs of 0.65 (95% CI [0.645-0.654], p < 0.001; fixed-effects) and 0.65 (95% CI [0.591-0.708], p < 0.001; random-effects). Furthermore, MLA achieved a sensitivity ranging from 66.9% to 91.2%, compared to 24% to 96.0% by radiologists. The specificity and accuracy of MLA ranged from 72% to 96.2% and 66% to 92.2%, respectively, compared to that of radiologists, which ranged from 43.0% to 96.0% and 51.5% to 88.6%, respectively.
[CONCLUSION] MLAs demonstrate superior diagnostic performance in predicting ENE in HNSCC and may serve as a valuable adjunct to radiologists in clinical practice.
🏷️ 키워드 / MeSH 📖 같은 키워드 OA만
같은 제1저자의 인용 많은 논문 (1)
📖 전문 본문 읽기 PMC JATS · ~52 KB · 영문
Introduction
1
Introduction
Spread of cancer cells outside the lymph node through its capsule into the surrounding connective tissue, termed extranodal extension (ENE), is a clinical scenario with profound prognostic and therapeutic implications in head and neck squamous cell carcinoma (HNSCC) [1]. Pathological ENE (pENE), considered the gold standard of ENE diagnosis, is histologically determined on postoperative surgical specimens, while image detected (iENE) or clinical (cENE) are evaluated during preoperative workup. It is thought that pENE represents the earliest detectable stage, with iENE and, ultimately, cENE manifesting as cancer progression advances [2]. Robust evidence supports the role of pENE as a critical prognostic factor for non‐human papillomavirus (HPV)‐associated HNSCC [3, 4]. These findings have led to the inclusion of ENE as a criterion in lymph node staging within the 8th edition of the American Joint Committee on Cancer Staging Manual [1]. Previously, it was thought that ENE played a limited role in HPV‐associated cases, but more recent evidence is contesting this assessment [4, 5, 6] Notably, regardless of HPV status, the presence of pENE in many cases warrants the escalation of adjuvant radiation to include concurrent chemotherapy, significantly increasing acute and chronic morbidity [5, 6, 7, 8, 9, 10]. Therefore, pre‐operative prediction of ENE is highly desirable for individualized surgical strategies and patient counseling. For one, this could mitigate unnecessary upfront surgery in patients with HPV‐associated HNSCC who need concurrent chemoradiation postoperatively. Moreover, not all patients undergo surgery to provide pENE information. Many patients undergo concurrent chemoradiation due to factors like large unresectable tumors or radiologic suspicion of ENE. However, these individuals could also benefit significantly from a valid preoperative identification of ENE. As such, reliable iENE assessment is a potential independent prognostic indicator alongside pENE and cENE. Radiological diagnosis of iENE is achieved through CT imaging, supported by defined diagnostic criteria [11, 12]. However, inter‐observer variability and suboptimal accuracy have posed significant limitations [13]. Machine learning algorithms (MLAs) have been developed as supportive clinical tools across medical specialties, ranging from radiology and pathology to clinical specialties like dermatology, with early evidence indicating promising results of diagnostic performance similar to that of trained clinicians [14, 15, 16, 17, 18, 19, 20]. Despite this potential, the accuracy of MLAs in predicting ENE in HNSCC remains underexplored and has been evaluated only in isolated studies [21, 22].
This systematic review and meta‐analysis aims to address a critical gap by summarizing the existing literature to characterize how MLAs are applied to predict ENE in patients with HNSCC. It also seeks to evaluate and compare their diagnostic accuracy, sensitivity, specificity, and predictive value. By consolidating current evidence, this study provides a foundation for advancing MLA integration into clinical practice.
Introduction
Spread of cancer cells outside the lymph node through its capsule into the surrounding connective tissue, termed extranodal extension (ENE), is a clinical scenario with profound prognostic and therapeutic implications in head and neck squamous cell carcinoma (HNSCC) [1]. Pathological ENE (pENE), considered the gold standard of ENE diagnosis, is histologically determined on postoperative surgical specimens, while image detected (iENE) or clinical (cENE) are evaluated during preoperative workup. It is thought that pENE represents the earliest detectable stage, with iENE and, ultimately, cENE manifesting as cancer progression advances [2]. Robust evidence supports the role of pENE as a critical prognostic factor for non‐human papillomavirus (HPV)‐associated HNSCC [3, 4]. These findings have led to the inclusion of ENE as a criterion in lymph node staging within the 8th edition of the American Joint Committee on Cancer Staging Manual [1]. Previously, it was thought that ENE played a limited role in HPV‐associated cases, but more recent evidence is contesting this assessment [4, 5, 6] Notably, regardless of HPV status, the presence of pENE in many cases warrants the escalation of adjuvant radiation to include concurrent chemotherapy, significantly increasing acute and chronic morbidity [5, 6, 7, 8, 9, 10]. Therefore, pre‐operative prediction of ENE is highly desirable for individualized surgical strategies and patient counseling. For one, this could mitigate unnecessary upfront surgery in patients with HPV‐associated HNSCC who need concurrent chemoradiation postoperatively. Moreover, not all patients undergo surgery to provide pENE information. Many patients undergo concurrent chemoradiation due to factors like large unresectable tumors or radiologic suspicion of ENE. However, these individuals could also benefit significantly from a valid preoperative identification of ENE. As such, reliable iENE assessment is a potential independent prognostic indicator alongside pENE and cENE. Radiological diagnosis of iENE is achieved through CT imaging, supported by defined diagnostic criteria [11, 12]. However, inter‐observer variability and suboptimal accuracy have posed significant limitations [13]. Machine learning algorithms (MLAs) have been developed as supportive clinical tools across medical specialties, ranging from radiology and pathology to clinical specialties like dermatology, with early evidence indicating promising results of diagnostic performance similar to that of trained clinicians [14, 15, 16, 17, 18, 19, 20]. Despite this potential, the accuracy of MLAs in predicting ENE in HNSCC remains underexplored and has been evaluated only in isolated studies [21, 22].
This systematic review and meta‐analysis aims to address a critical gap by summarizing the existing literature to characterize how MLAs are applied to predict ENE in patients with HNSCC. It also seeks to evaluate and compare their diagnostic accuracy, sensitivity, specificity, and predictive value. By consolidating current evidence, this study provides a foundation for advancing MLA integration into clinical practice.
Methodology
2
Methodology
2.1
Protocol and Registration
This systematic review and meta‐analysis was reported as per the PRISMA 2020 (Preferred Reporting Items for Systematic Reviews and Meta‐Analyses) guidelines [23]. The protocol was prospectively registered with the National Institute for Health Research International Registry of Systematic Reviews (PROSPERO ID: CRD42023429550).
2.2
Literature Search
The search strategy was first peer‐reviewed and approved by a medical librarian at the University of British Columbia. After approval, the search strategy was applied to conduct a comprehensive literature search on five databases: MEDLINE (Ovid), EMBASE, Cochrane, Scopus, and Web of Science.
The entire search string for PubMed involved the following: (Machine learning OR Artificial Intelligence OR Deep Learning) AND (Head and neck OR oral OR Oropharyngeal) AND (Cancer OR Carcinoma OR Squamous cell carcinoma) AND (Extranodal extension OR Extracapsular extension OR Nodal extension). Other database searches utilized a similar search strategy with altered search terms to fit database specifications, as shown in Table S1. During the database search, the reviewers also manually analyzed all the reference lists of the included articles to obtain any relevant articles.
All the databases were searched independently from January 1, 2000, to February 12, 2025, by three investigators (AA, MS, AS). The start date of January 2000 was chosen because advancements in machine learning for image processing have primarily occurred over the past two decades [24].
2.3
Eligibility Criteria
All articles retrieved from the databases were assessed according to the predetermined eligibility criteria using the PICO framework (Patient/Population, Intervention, Comparator/Control, and Outcome) [25]. The study population was defined as patients of any age who developed HNSCC with and without ENE, with an available preoperative CT scan and pathology report. The intervention of interest included all machine learning algorithms used to predict ENE. The comparators of the study were radiologists. The primary outcomes of interest were accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and the area under the curve (AUC) of the predictions made by MLAs or radiologists. Clinical trials, observational studies, or case–control studies were eligible for inclusion.
Studies were excluded from the analysis if they were not published in English, did not include patients with HNSCC, did not investigate the use of machine learning as one of the diagnostic interventions, did not investigate ENE of HNSCC as one of its outcomes, or had incomplete data.
2.4
Study Selection and Data Extraction
Three reviewers (AA, MS, AS) independently conducted the study selection and data extraction. The study selection entailed the removal of duplicate articles, screening of abstracts and titles, and, lastly, screening of available full texts to select those that met the inclusion criteria. After completing the study selection, the reviewers appropriately extracted all the relevant data from the included studies. The data extracted from each study included the author ID, the study design, the type of MLA being investigated, the comparator, the sample size, and the tumor the included patients had. Furthermore, the authors also extracted data regarding the reported outcomes of each of the studies, including the AUC, sensitivity, specificity, accuracy, PPV, and NPV. If necessary, attempts were made to contact study authors to obtain missing or unclear data.
2.5
Quality Assessment
The quality of the studies was measured using the checklist criteria for artificial intelligence (AI) in medical imaging (CLAIM) [26]. The checklist contains 42 components that are distributed as follows: the structure of the articles, abstract, title, and introduction (n = 4), methodology (n = 28), results section (n = 5), and discussion (n = 5). All eligible studies are evaluated for each component with a score allocation of 1 for adherence. If not compliant, the study is assigned 0 for that component. Each study's total score is added and presented as a summary (ƩCLAIM Score).
2.6
Statistical Analysis
The MedCalc Statistical Software version 23.0.2 (MedCalc Software Ltd. Ostend, Belgium) was used to pool and compare the outcomes of AUC and their respective standard errors (SE). Where authors did not provide SEs for reported AUC values, SEs were estimated using the AUC and sample size derived from the Hanley and McNeil exponential approximation method [27]. The pooled results were then presented using forest plots. Heterogeneity across the studies was assessed using the I
2 statistics. Heterogeneity was categorized as low if the I
2 was ≤ 50%, moderate between 51% and 74%, and high if ≥ 75% I
2 [28]. Potential publication bias was statistically determined using Egger's regression test [29] and Begg's rank correlation test [30].
Methodology
2.1
Protocol and Registration
This systematic review and meta‐analysis was reported as per the PRISMA 2020 (Preferred Reporting Items for Systematic Reviews and Meta‐Analyses) guidelines [23]. The protocol was prospectively registered with the National Institute for Health Research International Registry of Systematic Reviews (PROSPERO ID: CRD42023429550).
2.2
Literature Search
The search strategy was first peer‐reviewed and approved by a medical librarian at the University of British Columbia. After approval, the search strategy was applied to conduct a comprehensive literature search on five databases: MEDLINE (Ovid), EMBASE, Cochrane, Scopus, and Web of Science.
The entire search string for PubMed involved the following: (Machine learning OR Artificial Intelligence OR Deep Learning) AND (Head and neck OR oral OR Oropharyngeal) AND (Cancer OR Carcinoma OR Squamous cell carcinoma) AND (Extranodal extension OR Extracapsular extension OR Nodal extension). Other database searches utilized a similar search strategy with altered search terms to fit database specifications, as shown in Table S1. During the database search, the reviewers also manually analyzed all the reference lists of the included articles to obtain any relevant articles.
All the databases were searched independently from January 1, 2000, to February 12, 2025, by three investigators (AA, MS, AS). The start date of January 2000 was chosen because advancements in machine learning for image processing have primarily occurred over the past two decades [24].
2.3
Eligibility Criteria
All articles retrieved from the databases were assessed according to the predetermined eligibility criteria using the PICO framework (Patient/Population, Intervention, Comparator/Control, and Outcome) [25]. The study population was defined as patients of any age who developed HNSCC with and without ENE, with an available preoperative CT scan and pathology report. The intervention of interest included all machine learning algorithms used to predict ENE. The comparators of the study were radiologists. The primary outcomes of interest were accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and the area under the curve (AUC) of the predictions made by MLAs or radiologists. Clinical trials, observational studies, or case–control studies were eligible for inclusion.
Studies were excluded from the analysis if they were not published in English, did not include patients with HNSCC, did not investigate the use of machine learning as one of the diagnostic interventions, did not investigate ENE of HNSCC as one of its outcomes, or had incomplete data.
2.4
Study Selection and Data Extraction
Three reviewers (AA, MS, AS) independently conducted the study selection and data extraction. The study selection entailed the removal of duplicate articles, screening of abstracts and titles, and, lastly, screening of available full texts to select those that met the inclusion criteria. After completing the study selection, the reviewers appropriately extracted all the relevant data from the included studies. The data extracted from each study included the author ID, the study design, the type of MLA being investigated, the comparator, the sample size, and the tumor the included patients had. Furthermore, the authors also extracted data regarding the reported outcomes of each of the studies, including the AUC, sensitivity, specificity, accuracy, PPV, and NPV. If necessary, attempts were made to contact study authors to obtain missing or unclear data.
2.5
Quality Assessment
The quality of the studies was measured using the checklist criteria for artificial intelligence (AI) in medical imaging (CLAIM) [26]. The checklist contains 42 components that are distributed as follows: the structure of the articles, abstract, title, and introduction (n = 4), methodology (n = 28), results section (n = 5), and discussion (n = 5). All eligible studies are evaluated for each component with a score allocation of 1 for adherence. If not compliant, the study is assigned 0 for that component. Each study's total score is added and presented as a summary (ƩCLAIM Score).
2.6
Statistical Analysis
The MedCalc Statistical Software version 23.0.2 (MedCalc Software Ltd. Ostend, Belgium) was used to pool and compare the outcomes of AUC and their respective standard errors (SE). Where authors did not provide SEs for reported AUC values, SEs were estimated using the AUC and sample size derived from the Hanley and McNeil exponential approximation method [27]. The pooled results were then presented using forest plots. Heterogeneity across the studies was assessed using the I
2 statistics. Heterogeneity was categorized as low if the I
2 was ≤ 50%, moderate between 51% and 74%, and high if ≥ 75% I
2 [28]. Potential publication bias was statistically determined using Egger's regression test [29] and Begg's rank correlation test [30].
Results
3
Results
3.1
Search Results
Our database searches yielded 83 articles, of which 41 were removed after deduplication. Another 23 articles were excluded after the title and abstract level screening. The full text of the remaining 19 articles was retrieved and assessed; six met our eligibility criteria for inclusion in this systematic review and meta‐analysis. A PRISMA flowchart summarizing the study screening and selection process is outlined in Figure 1.
3.2
Study Characteristics
The characteristics of the included studies are presented in Table 1. Briefly, the included studies were published between 2018 and 2024. Five of the six articles were retrospective [22, 31, 32, 33, 34], and one was prospective [21]. The imaging modality used was CT‐based in all our included studies to evaluate for iENE. No studies assessed patients based on cENE. A range of primary HNSCC subsites were investigated. Two studies focused on single subsites: Kann et al. (20231) evaluated HPV‐associated oropharyngeal squamous cell carcinoma (OPSCC), while Ariji et al. 2023 evaluated oral cavity carcinoma. Wang et al. and Kann et al. 2020 did not report specific tumor subsites and grouped the study population as a generalized HNSCC group. Lastly, Kann et al. 2018 included patients with HNSCC in oropharynx, oral cavity, larynx, or salivary gland subsites.
Among the included studies, four [21, 31, 33, 34] used a pure deep learning model to detect ENE, while one study [32] used a radiomic‐based model. One study [22] incorporated both models as a means for comparison. Different MLA architectures were used in these studies; Ariji et al. [31] used a MLA with convolutional neural network architecture (AlexNET), Huang et al. [32] used an evolutionary learning architecture (EL‐ENE), and Wang et al. [34] used a Gradient Mapping Guided Explainable Network (GMGENet) framework. Two studies by Kann et al. [22, 33] used deep neural network‐based BoxNet, small‐scale internet networking driven SmallNet, and DualNet, which leverages two parallel deep convolutional neural networks to improve accuracy and performance. However, the most recent publication from this group [21] only used the DualNet system.
HNSCC diagnosis within the patient population varied between studies. Ariji et al. [31] included patients with oral cavity squamous cell carcinoma (predominantly HPV‐negative cohort), whereas Kann et al. [21] exclusively included patients with HPV‐mediated oropharyngeal carcinoma. Other included studies encompassed a range of HNSCC subsites, with both HPV‐mediated and HPV‐negative cases [22, 31, 32, 33].
Regarding comparators, five studies compared assessments made by MLA to those of trained neuroradiologists or radiologists [21, 22, 31, 32, 33]. Two studies [31, 32] used predefined imaging criteria to guide radiologists in their evaluation of ENE, while three studies [21, 22, 33] allowed radiologists to evaluate based on their clinical judgment. Across these studies, radiologists manually segmented lymph nodes prior to model training. Most studies used 3D segmentation, except one [31], which used 2D cropped axial images. One study compared assessment using the GMGENet framework with other models and used automated 3D segmentation without manual radiologist input [34].
3.3
Quality Assessment
The studies by Kann et al. [33], Kann et al. [21], and Huang et al. [32] had the highest total CLAIM score of 35, followed by Wang et al. [34] (ƩCLAIM score = 32), Kann et al. [22] (ƩCLAIM score = 30), and Ariji et al. [31] (ƩCLAIM score = 22) (Table 2). The detailed domain‐level CLAIM scores of the included studies are presented in Table S2.
3.4
Primary Outcomes
When comparing MLA to radiologists' assessments, the studies reported improved results for specificity (66%–92.2% versus 51.1%–88.6%), sensitivity (60.0%–91.2% versus 24.0%–96.0%), NPV (62.3%–95.0% versus 50.7%–97%), PPV (48.0%–96% versus 33%–80%), and AUC (0.76–0.92 versus 0.52–0.82) (Table 2).
3.5
Pooled Diagnostic Performance
Four studies [21, 22, 33, 34] provided data for pooled AUC estimation with MLA‐based diagnostic assessment, albeit with significant inter‐study heterogeneity (I
2 = 62.48%; p = 0.0461) [24, 25, 35, 36]. MLA‐based diagnostic assessment achieved a pooled AUC of 0.92 (95% CI [0.915, 0.923] p < 0.001) in the fixed and 0.905 (95% CI [0.882, 0.929] p < 0.001) in the random effects model (Figure 2A, Table S3). In comparison, in pooled analysis of four studies [21, 22, 31, 33], radiologists achieved a lower pooled AUC of 0.65 (95% CI [0.645–0.654] p < 0.001) and 0.65 (95% CI [0.591–0.708] p < 0.001) using the fixed and random effects models with high heterogeneity (I
2 = 97.90%; p < 0.001) (Figure 2B, Table S3). The pooled AUC in the random effects model was significantly different between MLA‐based and radiologist's assessments (difference = 0.255, SE = 0.0322, p < 0.001).
3.6
Publication Bias
No evidence of publication bias was noted in Egger's and Begg's tests for studies reporting data for MLA‐based (Intercept = −1.25, p = 0.3; Kendall's τ coefficient = 0.0, p = 1.0) and radiologist's (Intercept = −2.18, p = 0.74; Kendall's τ coefficient = −0.33, p = 0.5) assessment.
Results
3.1
Search Results
Our database searches yielded 83 articles, of which 41 were removed after deduplication. Another 23 articles were excluded after the title and abstract level screening. The full text of the remaining 19 articles was retrieved and assessed; six met our eligibility criteria for inclusion in this systematic review and meta‐analysis. A PRISMA flowchart summarizing the study screening and selection process is outlined in Figure 1.
3.2
Study Characteristics
The characteristics of the included studies are presented in Table 1. Briefly, the included studies were published between 2018 and 2024. Five of the six articles were retrospective [22, 31, 32, 33, 34], and one was prospective [21]. The imaging modality used was CT‐based in all our included studies to evaluate for iENE. No studies assessed patients based on cENE. A range of primary HNSCC subsites were investigated. Two studies focused on single subsites: Kann et al. (20231) evaluated HPV‐associated oropharyngeal squamous cell carcinoma (OPSCC), while Ariji et al. 2023 evaluated oral cavity carcinoma. Wang et al. and Kann et al. 2020 did not report specific tumor subsites and grouped the study population as a generalized HNSCC group. Lastly, Kann et al. 2018 included patients with HNSCC in oropharynx, oral cavity, larynx, or salivary gland subsites.
Among the included studies, four [21, 31, 33, 34] used a pure deep learning model to detect ENE, while one study [32] used a radiomic‐based model. One study [22] incorporated both models as a means for comparison. Different MLA architectures were used in these studies; Ariji et al. [31] used a MLA with convolutional neural network architecture (AlexNET), Huang et al. [32] used an evolutionary learning architecture (EL‐ENE), and Wang et al. [34] used a Gradient Mapping Guided Explainable Network (GMGENet) framework. Two studies by Kann et al. [22, 33] used deep neural network‐based BoxNet, small‐scale internet networking driven SmallNet, and DualNet, which leverages two parallel deep convolutional neural networks to improve accuracy and performance. However, the most recent publication from this group [21] only used the DualNet system.
HNSCC diagnosis within the patient population varied between studies. Ariji et al. [31] included patients with oral cavity squamous cell carcinoma (predominantly HPV‐negative cohort), whereas Kann et al. [21] exclusively included patients with HPV‐mediated oropharyngeal carcinoma. Other included studies encompassed a range of HNSCC subsites, with both HPV‐mediated and HPV‐negative cases [22, 31, 32, 33].
Regarding comparators, five studies compared assessments made by MLA to those of trained neuroradiologists or radiologists [21, 22, 31, 32, 33]. Two studies [31, 32] used predefined imaging criteria to guide radiologists in their evaluation of ENE, while three studies [21, 22, 33] allowed radiologists to evaluate based on their clinical judgment. Across these studies, radiologists manually segmented lymph nodes prior to model training. Most studies used 3D segmentation, except one [31], which used 2D cropped axial images. One study compared assessment using the GMGENet framework with other models and used automated 3D segmentation without manual radiologist input [34].
3.3
Quality Assessment
The studies by Kann et al. [33], Kann et al. [21], and Huang et al. [32] had the highest total CLAIM score of 35, followed by Wang et al. [34] (ƩCLAIM score = 32), Kann et al. [22] (ƩCLAIM score = 30), and Ariji et al. [31] (ƩCLAIM score = 22) (Table 2). The detailed domain‐level CLAIM scores of the included studies are presented in Table S2.
3.4
Primary Outcomes
When comparing MLA to radiologists' assessments, the studies reported improved results for specificity (66%–92.2% versus 51.1%–88.6%), sensitivity (60.0%–91.2% versus 24.0%–96.0%), NPV (62.3%–95.0% versus 50.7%–97%), PPV (48.0%–96% versus 33%–80%), and AUC (0.76–0.92 versus 0.52–0.82) (Table 2).
3.5
Pooled Diagnostic Performance
Four studies [21, 22, 33, 34] provided data for pooled AUC estimation with MLA‐based diagnostic assessment, albeit with significant inter‐study heterogeneity (I
2 = 62.48%; p = 0.0461) [24, 25, 35, 36]. MLA‐based diagnostic assessment achieved a pooled AUC of 0.92 (95% CI [0.915, 0.923] p < 0.001) in the fixed and 0.905 (95% CI [0.882, 0.929] p < 0.001) in the random effects model (Figure 2A, Table S3). In comparison, in pooled analysis of four studies [21, 22, 31, 33], radiologists achieved a lower pooled AUC of 0.65 (95% CI [0.645–0.654] p < 0.001) and 0.65 (95% CI [0.591–0.708] p < 0.001) using the fixed and random effects models with high heterogeneity (I
2 = 97.90%; p < 0.001) (Figure 2B, Table S3). The pooled AUC in the random effects model was significantly different between MLA‐based and radiologist's assessments (difference = 0.255, SE = 0.0322, p < 0.001).
3.6
Publication Bias
No evidence of publication bias was noted in Egger's and Begg's tests for studies reporting data for MLA‐based (Intercept = −1.25, p = 0.3; Kendall's τ coefficient = 0.0, p = 1.0) and radiologist's (Intercept = −2.18, p = 0.74; Kendall's τ coefficient = −0.33, p = 0.5) assessment.
Discussion
4
Discussion
Our systematic review of the available literature on the efficacy of MLAs for the presurgical determination of ENE indicated that they may have better specificity, sensitivity, NPV, PPV, and AUC than conventional image assessment. AUC is the primary performance metric clinicians use to decide whether a prediction model may be used in practice [35]. Here, the pooled data showed a significantly different pooled AUC of 0.91 for the MLA versus 0.65 for the non‐MLA diagnosis. However, our analysis indicated a high level of inter‐study heterogeneity. Likely, technical or computational differences in the various MLA architectures used by the included studies, as well as the focus on various HNSCC subsites analyzed within the studies, may have contributed to the observed heterogeneity. In the current study, four of the five studies [21, 22, 31, 33] that provided data for pooled analysis used a pure deep learning model to detect ENE, while Huang et al. 2023 [34] used a radiomic‐based model, and Kann et al. [22] incorporated both models. The variety of MLAs used increases heterogeneity and complicates analysis. Furthermore, differences in patient cohorts between studies, specifically HPV involvement, could have influenced diagnostic performance. HPV‐positive and HPV‐negative HNSCC are known to exhibit distinct patterns of nodal spread and presentation, which may affect the MLA detection of iENE [36].
Our results must be interpreted with the limitations of MLA architectures in mind. For example, a common problem with neural networks is overfitting training data, where small, imperceptible perturbations to the input image can lead to incorrect predictions [37, 38]. These methodological limitations can be particularly problematic in medical image analysis, which demands high accuracy despite significant variability between different imaging devices and patient populations [14, 37, 38, 39]. Given the high interstudy heterogeneity for MLA‐based diagnostic assessment in our meta‐analysis, the risk of overgeneralization persists.
On the other hand, deep learning models require a large volume of training data, which may not be readily available for specific conditions due to a small number of patient records or restricted access to such data due to privacy concerns [38, 39]. Even if the required training data is available, the decision‐making processes of deep learning models are complex to interpret [38, 39]. This low explainability of deep learning models can be a substantial issue in clinical settings where understanding the reasoning behind a diagnosis is crucial [38, 39].
Affordability should also be considered when adopting MLAs in clinical practice. Kann et al. [22] were the first to report an MLA specifically developed for HNSCC (DualNet), which achieved high diagnostic performance but did not discuss the cost of the system. Later, Ariji et al. [31] reported that the DualNet model required an expensive central processing unit (GPU) without providing a detailed cost analysis, which warranted the development of a cheaper model that required less memory and GPU.
Despite their limitations, currently available MLAs have tremendous potential to assist in ENE diagnosis, especially considering that unlike the studies discussed in this review which employed radiologists with more than 10 years of practice, many practitioners may need to make decisions on diagnosing iENE with much less exposure. Notably, for conventional image analysis, several studies have demonstrated low inter‐rater concordance for iENE with poor predictive value for pENE [40, 41, 42, 43, 44]. As previously noted by Morey et al. [44], the gold‐standard imaging modality for radiological assessment of ENE in HNSCC is yet to be established, and the clinician's choice of imaging techniques currently includes CT, MRI, PET, and ultrasound for which the assessment criteria have been largely subjective. In addition, Yan et al. [36] illustrated a significant difference in the diagnostic performance of CT‐detected iENE in evaluating pENE of HPV‐positive and HPV‐negative HNSCC, with a lower accuracy in HPV‐positive HNSCC. Although sharing experience among radiologists and consolidated operating definitions has been demonstrated to improve diagnostic reliability [41, 43], the first expert consensus definitions and diagnostic criteria for ENE in HNSCC have only been recently published in mid‐2024 [12]. The included studies employed variable and partially subjective criteria for diagnosing iENE, often relying on practitioner expertise. Ariji et al. [31] characterized iENE based on a minor axis > 11 mm, central necrosis, and irregular nodal borders, whereas Huang et al. [32] defined iENE using irregular nodal enhancement, poorly defined margins, infiltration of adjacent fat planes, central necrosis, and nodal matting. Other studies using iENE diagnosis as a comparator relied on the practitioners' judgment for diagnosis rather than a pre‐defined criterion. The most recent expert consensus recommends irregular borders, poorly defined margins, perinodal fat extension, and conglomerate, matted, or coalescent nodes as key diagnostic criteria while discouraging the use of central nodal necrosis [12]. Importantly, all studies included in this review focused exclusively on iENE, emphasizing the clinical relevance of MLA support in interpreting complex image features. The heterogeneity in diagnostic criteria across studies underscores the subjectivity inherent in iENE assessment and highlights the potential role of MLAs in standardizing and augmenting diagnostic accuracy.
While ENE prediction in non‐HPV‐associated HNSSC would currently not significantly change initial therapy, improved accuracy of iENE diagnosis can be important for eliminating the risk of trimodality therapy in HPV‐positive OPSCC patients with a high probability of ENE. This diagnostic dilemma was specifically highlighted by the results of the ECOG‐ACRIN E3311 trial, where the presence of previously undetected ENE was used as one of the primary clinical indications for postoperative radiation therapy, and patients with ENE larger than 1 mm were indicated for more aggressive radiation therapy and adjuvant chemotherapy [10]. Using data from the ECOG‐ACRIN E3311 trial, Kann et al. (2023) demonstrated that MLA outperformed radiologists in predicting iENE, particularly for ENE exceeding 1 mm in size [21]. Given the prognostic significance of detecting ENE larger than 1 mm in stratifying high‐risk patients and guiding adjuvant therapy, these findings underscore the potential of MLA to inform postoperative chemoradiation decisions. The performance of these models in the context of OPSCC is especially important, as HPV‐positive and HPV‐negative OPSCC exhibit distinct nodal characteristics and biological behavior, which may affect the patterns of iENE diagnosis that MLAs rely on [45]. However, recent studies have shown that MLAs trained predominantly on HPV‐negative OPSCC can generalize effectively to HPV‐associated cases, suggesting their potential applicability across both disease subtypes [21]. Nonetheless, ensuring adequate representation of both HPV‐mediated and HPV‐independent cancers in MLA training datasets will be crucial for maximizing model generalizability. The use of MLAs by practitioners can improve the sensitivity of diagnosing iENE in the preoperative setting, thereby reducing the probability of trimodality therapy in HPV‐associated HNSCC patients at higher risk of pENE.
Moreover, MLAs can assist physicians in optimizing targeted treatment during primary chemoradiation. ENE is widely recognized as a poor prognostic factor for overall survival and regional control in HNSCC [46]. As a result, current guidelines recommend considering higher radiation doses for nodal regions with ENE to ensure adequate treatment [46]. The use of MLAs can enhance the accuracy of iENE diagnosis, enabling more precise identification of nodes requiring more aggressive treatment, which may reduce the risk of recurrence. Furthermore, accurate prediction of the absence of pENE may allow for lower radiation doses, thereby reducing treatment‐related morbidities.
Despite the benefits of MLA in the context of iENE diagnosis, it remains unclear whether the additional ENE identified by MLAs provides true prognostic value. This parallels the ongoing debate surrounding pathological ENE, particularly regarding the appropriate millimeter cut‐off for prognostic and clinical relevance. Similarly, MLA‐identified ENE may overestimate certain cases, potentially classifying patients as high‐risk and driving their treatment towards a non‐surgical route when they may have been appropriate surgical candidates [47]. Furthermore, differences in biological sub‐entities of HNSCC, such as HPV‐associated cancer, might need separate cut‐offs or criteria. Ultimately, an ideal MLA assisting in the diagnosis of iENE would allow for adjustable sensitivity and certainty based on the clinical scenario and the specific needs of the referring clinician.
Discussion
Our systematic review of the available literature on the efficacy of MLAs for the presurgical determination of ENE indicated that they may have better specificity, sensitivity, NPV, PPV, and AUC than conventional image assessment. AUC is the primary performance metric clinicians use to decide whether a prediction model may be used in practice [35]. Here, the pooled data showed a significantly different pooled AUC of 0.91 for the MLA versus 0.65 for the non‐MLA diagnosis. However, our analysis indicated a high level of inter‐study heterogeneity. Likely, technical or computational differences in the various MLA architectures used by the included studies, as well as the focus on various HNSCC subsites analyzed within the studies, may have contributed to the observed heterogeneity. In the current study, four of the five studies [21, 22, 31, 33] that provided data for pooled analysis used a pure deep learning model to detect ENE, while Huang et al. 2023 [34] used a radiomic‐based model, and Kann et al. [22] incorporated both models. The variety of MLAs used increases heterogeneity and complicates analysis. Furthermore, differences in patient cohorts between studies, specifically HPV involvement, could have influenced diagnostic performance. HPV‐positive and HPV‐negative HNSCC are known to exhibit distinct patterns of nodal spread and presentation, which may affect the MLA detection of iENE [36].
Our results must be interpreted with the limitations of MLA architectures in mind. For example, a common problem with neural networks is overfitting training data, where small, imperceptible perturbations to the input image can lead to incorrect predictions [37, 38]. These methodological limitations can be particularly problematic in medical image analysis, which demands high accuracy despite significant variability between different imaging devices and patient populations [14, 37, 38, 39]. Given the high interstudy heterogeneity for MLA‐based diagnostic assessment in our meta‐analysis, the risk of overgeneralization persists.
On the other hand, deep learning models require a large volume of training data, which may not be readily available for specific conditions due to a small number of patient records or restricted access to such data due to privacy concerns [38, 39]. Even if the required training data is available, the decision‐making processes of deep learning models are complex to interpret [38, 39]. This low explainability of deep learning models can be a substantial issue in clinical settings where understanding the reasoning behind a diagnosis is crucial [38, 39].
Affordability should also be considered when adopting MLAs in clinical practice. Kann et al. [22] were the first to report an MLA specifically developed for HNSCC (DualNet), which achieved high diagnostic performance but did not discuss the cost of the system. Later, Ariji et al. [31] reported that the DualNet model required an expensive central processing unit (GPU) without providing a detailed cost analysis, which warranted the development of a cheaper model that required less memory and GPU.
Despite their limitations, currently available MLAs have tremendous potential to assist in ENE diagnosis, especially considering that unlike the studies discussed in this review which employed radiologists with more than 10 years of practice, many practitioners may need to make decisions on diagnosing iENE with much less exposure. Notably, for conventional image analysis, several studies have demonstrated low inter‐rater concordance for iENE with poor predictive value for pENE [40, 41, 42, 43, 44]. As previously noted by Morey et al. [44], the gold‐standard imaging modality for radiological assessment of ENE in HNSCC is yet to be established, and the clinician's choice of imaging techniques currently includes CT, MRI, PET, and ultrasound for which the assessment criteria have been largely subjective. In addition, Yan et al. [36] illustrated a significant difference in the diagnostic performance of CT‐detected iENE in evaluating pENE of HPV‐positive and HPV‐negative HNSCC, with a lower accuracy in HPV‐positive HNSCC. Although sharing experience among radiologists and consolidated operating definitions has been demonstrated to improve diagnostic reliability [41, 43], the first expert consensus definitions and diagnostic criteria for ENE in HNSCC have only been recently published in mid‐2024 [12]. The included studies employed variable and partially subjective criteria for diagnosing iENE, often relying on practitioner expertise. Ariji et al. [31] characterized iENE based on a minor axis > 11 mm, central necrosis, and irregular nodal borders, whereas Huang et al. [32] defined iENE using irregular nodal enhancement, poorly defined margins, infiltration of adjacent fat planes, central necrosis, and nodal matting. Other studies using iENE diagnosis as a comparator relied on the practitioners' judgment for diagnosis rather than a pre‐defined criterion. The most recent expert consensus recommends irregular borders, poorly defined margins, perinodal fat extension, and conglomerate, matted, or coalescent nodes as key diagnostic criteria while discouraging the use of central nodal necrosis [12]. Importantly, all studies included in this review focused exclusively on iENE, emphasizing the clinical relevance of MLA support in interpreting complex image features. The heterogeneity in diagnostic criteria across studies underscores the subjectivity inherent in iENE assessment and highlights the potential role of MLAs in standardizing and augmenting diagnostic accuracy.
While ENE prediction in non‐HPV‐associated HNSSC would currently not significantly change initial therapy, improved accuracy of iENE diagnosis can be important for eliminating the risk of trimodality therapy in HPV‐positive OPSCC patients with a high probability of ENE. This diagnostic dilemma was specifically highlighted by the results of the ECOG‐ACRIN E3311 trial, where the presence of previously undetected ENE was used as one of the primary clinical indications for postoperative radiation therapy, and patients with ENE larger than 1 mm were indicated for more aggressive radiation therapy and adjuvant chemotherapy [10]. Using data from the ECOG‐ACRIN E3311 trial, Kann et al. (2023) demonstrated that MLA outperformed radiologists in predicting iENE, particularly for ENE exceeding 1 mm in size [21]. Given the prognostic significance of detecting ENE larger than 1 mm in stratifying high‐risk patients and guiding adjuvant therapy, these findings underscore the potential of MLA to inform postoperative chemoradiation decisions. The performance of these models in the context of OPSCC is especially important, as HPV‐positive and HPV‐negative OPSCC exhibit distinct nodal characteristics and biological behavior, which may affect the patterns of iENE diagnosis that MLAs rely on [45]. However, recent studies have shown that MLAs trained predominantly on HPV‐negative OPSCC can generalize effectively to HPV‐associated cases, suggesting their potential applicability across both disease subtypes [21]. Nonetheless, ensuring adequate representation of both HPV‐mediated and HPV‐independent cancers in MLA training datasets will be crucial for maximizing model generalizability. The use of MLAs by practitioners can improve the sensitivity of diagnosing iENE in the preoperative setting, thereby reducing the probability of trimodality therapy in HPV‐associated HNSCC patients at higher risk of pENE.
Moreover, MLAs can assist physicians in optimizing targeted treatment during primary chemoradiation. ENE is widely recognized as a poor prognostic factor for overall survival and regional control in HNSCC [46]. As a result, current guidelines recommend considering higher radiation doses for nodal regions with ENE to ensure adequate treatment [46]. The use of MLAs can enhance the accuracy of iENE diagnosis, enabling more precise identification of nodes requiring more aggressive treatment, which may reduce the risk of recurrence. Furthermore, accurate prediction of the absence of pENE may allow for lower radiation doses, thereby reducing treatment‐related morbidities.
Despite the benefits of MLA in the context of iENE diagnosis, it remains unclear whether the additional ENE identified by MLAs provides true prognostic value. This parallels the ongoing debate surrounding pathological ENE, particularly regarding the appropriate millimeter cut‐off for prognostic and clinical relevance. Similarly, MLA‐identified ENE may overestimate certain cases, potentially classifying patients as high‐risk and driving their treatment towards a non‐surgical route when they may have been appropriate surgical candidates [47]. Furthermore, differences in biological sub‐entities of HNSCC, such as HPV‐associated cancer, might need separate cut‐offs or criteria. Ultimately, an ideal MLA assisting in the diagnosis of iENE would allow for adjustable sensitivity and certainty based on the clinical scenario and the specific needs of the referring clinician.
Limitations
5
Limitations
First, a limited number of studies were included in the review since research on using MLA to predict ENE in HNSCC is relatively new and still emerging. Second, none of the included studies assessed the impact of radiologist‐specific factors, such as subspecialty training or years of experience, on the accuracy of ENE diagnosis, which may be a significant confounding [48]. Therefore, the current study cannot make inferences on the clinical significance of incorporating MLA into radiologists' clinical practice, which is the main aim of developing MLAs. Third, the developed MLAs are not open source and thus not freely available for clinicians in different areas. This makes it challenging to validate the performance of these models in the clinical setting. Future programs should be available in an open‐source format to enable their validation in the clinical setting. Fourth, the included studies did not provide sufficient data for a subgroup analysis to assess the accuracy of MLA in detecting ENE in HPV‐associated HNSCC even though this is a clinically highly relevant subpopulation.
Limitations
First, a limited number of studies were included in the review since research on using MLA to predict ENE in HNSCC is relatively new and still emerging. Second, none of the included studies assessed the impact of radiologist‐specific factors, such as subspecialty training or years of experience, on the accuracy of ENE diagnosis, which may be a significant confounding [48]. Therefore, the current study cannot make inferences on the clinical significance of incorporating MLA into radiologists' clinical practice, which is the main aim of developing MLAs. Third, the developed MLAs are not open source and thus not freely available for clinicians in different areas. This makes it challenging to validate the performance of these models in the clinical setting. Future programs should be available in an open‐source format to enable their validation in the clinical setting. Fourth, the included studies did not provide sufficient data for a subgroup analysis to assess the accuracy of MLA in detecting ENE in HPV‐associated HNSCC even though this is a clinically highly relevant subpopulation.
Conclusion
6
Conclusion
Although research on MLA‐based medical image analysis is still emerging, the current evidence indicates that MLAs perform better than conventional analysis in diagnosing ENE in HNSCC patients. However, there are some limitations to the use of MLA in the diagnosis of ENE in HNSCC, which has resulted in its limited application in clinical practice in the diagnosis of ENE in HNSCC patients. Further prospective or randomized clinical trials should, therefore, be carried out to determine the diagnostic ability of radiologists when they incorporate MLA as an aid in diagnosing ENE within HNSCC. Further cross‐domain research is warranted to improve the accuracy and reliability of MLA‐based iENE diagnosis.
Conclusion
Although research on MLA‐based medical image analysis is still emerging, the current evidence indicates that MLAs perform better than conventional analysis in diagnosing ENE in HNSCC patients. However, there are some limitations to the use of MLA in the diagnosis of ENE in HNSCC, which has resulted in its limited application in clinical practice in the diagnosis of ENE in HNSCC patients. Further prospective or randomized clinical trials should, therefore, be carried out to determine the diagnostic ability of radiologists when they incorporate MLA as an aid in diagnosing ENE within HNSCC. Further cross‐domain research is warranted to improve the accuracy and reliability of MLA‐based iENE diagnosis.
Ethics Statement
Ethics Statement
This study did not require approval by an institutional review board.
This study did not require approval by an institutional review board.
Conflicts of Interest
Conflicts of Interest
The authors declare no conflicts of interest.
The authors declare no conflicts of interest.
Supporting information
Supporting information
Table S1: Detailed search strategy for various database searches.
Table S2: Checklist for Artificial Intelligence in Medical Imaging (CLAIM).
Table S3: Meta‐analysis of pooled AUC of MLAs and radiologists in the prediction of ENE.
Table S1: Detailed search strategy for various database searches.
Table S2: Checklist for Artificial Intelligence in Medical Imaging (CLAIM).
Table S3: Meta‐analysis of pooled AUC of MLAs and radiologists in the prediction of ENE.
출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.
🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반
- A Phase I Study of Hydroxychloroquine and Suba-Itraconazole in Men with Biochemical Relapse of Prostate Cancer (HITMAN-PC): Dose Escalation Results.
- Self-management of male urinary symptoms: qualitative findings from a primary care trial.
- Clinical and Liquid Biomarkers of 20-Year Prostate Cancer Risk in Men Aged 45 to 70 Years.
- Diagnostic accuracy of Ga-PSMA PET/CT versus multiparametric MRI for preoperative pelvic invasion in the patients with prostate cancer.
- Comprehensive analysis of androgen receptor splice variant target gene expression in prostate cancer.
- Clinical Presentation and Outcomes of Patients Undergoing Surgery for Thyroid Cancer.