Performance of artificial intelligence in breast cancer screening programmes: a systematic review.
메타분석
1/5 보강
[OBJECTIVE] With growing interest in applying artificial intelligence (AI) to population breast cancer screening, the evidence base has expanded rapidly.
- p-value p=0.0021
- 연구 설계 systematic review
APA
Jassim G, Otoom O, et al. (2025). Performance of artificial intelligence in breast cancer screening programmes: a systematic review.. BMJ open, 15(12), e111360. https://doi.org/10.1136/bmjopen-2025-111360
MLA
Jassim G, et al.. "Performance of artificial intelligence in breast cancer screening programmes: a systematic review.." BMJ open, vol. 15, no. 12, 2025, pp. e111360.
PMID
41475802 ↗
Abstract 한글 요약
[OBJECTIVE] With growing interest in applying artificial intelligence (AI) to population breast cancer screening, the evidence base has expanded rapidly. This systematic review aims to systematically review and summarise the published evidence on the use of AI in breast cancer screening.
[DESIGN] We conducted a systematic review of primary studies assessing AI for screening mammography, extracting test-accuracy metrics (sensitivity, specificity, recall and cancer detection rates) and workflow outcomes.
[DATA SOURCES] We searched the Cochrane Breast Cancer Group Specialised Register, Cochrane CENTRAL, PubMed, Embase (Elsevier), Scopus, ClinicalTrials.gov and the WHO International Clinical Trials Registry Platform from January 2012 to June 2025; we also screened reference lists of included studies and relevant reviews. No language restrictions were applied.
[ELIGIBILITY CRITERIA] Primary studies evaluating AI for screening mammography (digital mammography or digital breast tomosynthesis) in asymptomatic women, assessing AI as a standalone reader or AI-assisted radiologist workflows versus radiologists alone. Eligible designs included randomised trials, prospective paired reader studies, real-world implementation/registry cohorts, retrospective cohorts and multireader-multicase reader studies conducted in population-based or opportunistic screening settings. Key outcomes included diagnostic accuracy metrics (eg, sensitivity, specificity, Area Under the Curve (AUC) and/or programme metrics (cancer detection rate (CDR), recall/abnormal interpretation rate, positive predictive value, arbitration/workload). We excluded protocols, pilot/feasibility studies, case reports, editorials and studies without relevant accuracy or screening outcomes.
[DATA EXTRACTION AND SYNTHESIS] Two independent reviewers extracted data and assessed risk of bias. Study quality was appraised with Quality Assessment of Diagnostic Accuracy Studies-2 and an AI-specific critical appraisal tool, and findings were synthesised narratively with stratification by study design and AI integration role.
[RESULTS] 31 studies met the inclusion criteria, encompassing randomised controlled trials, prospective paired-reader studies, registry-based implementations and retrospective simulations, representing more than two million screening examinations across Europe, Asia, North America and Australia. When used as a second reader or within double-reading workflows, AI generally maintained or modestly increased sensitivity (up to +9 percentage points (PP)) while preserving or improving specificity. Triage and decision-referral configurations delivered the greatest operational benefit, reducing reading volumes by 40-90% while maintaining non-inferior cancer detection when thresholds were conservatively calibrated. Stand-alone AI achieved AUC values comparable to radiologists and similar cancer detection in real-world, non-enriched cohorts, although interval-cancer follow-up remains incomplete in several datasets. In prospective randomised evidence, including the Mammography Screening with Artificial Intelligence trial (MASAI) trial, AI-supported screening achieved higher CDRs (6.4 versus 5.0 per 1000; p=0.0021) with stable or reduced false-positive and recall rates. Across implementation and simulation settings, integration of AI reduced radiologist workload substantially, with triage and band-pass approaches reducing the number of reads by approximately 40-90%. Overall certainty is limited by heterogeneity across study designs, reliance on enriched datasets for some accuracy estimates and incomplete interval-cancer follow-up in several major studies.
[CONCLUSION] Contemporary AI systems show diagnostic performance that is broadly comparable to radiologists and can substantially reduce reading workload, particularly when used as a second reader or triage tool. Emerging prospective evidence supports their safe integration in these roles, although transparent reporting, standardised evaluation and long-term population studies are still required before considering AI as a stand-alone reader. AI may improve workflow efficiency and possibly cancer detection, but definitive evidence on safety, especially interval cancer outcomes, remains essential.
[DESIGN] We conducted a systematic review of primary studies assessing AI for screening mammography, extracting test-accuracy metrics (sensitivity, specificity, recall and cancer detection rates) and workflow outcomes.
[DATA SOURCES] We searched the Cochrane Breast Cancer Group Specialised Register, Cochrane CENTRAL, PubMed, Embase (Elsevier), Scopus, ClinicalTrials.gov and the WHO International Clinical Trials Registry Platform from January 2012 to June 2025; we also screened reference lists of included studies and relevant reviews. No language restrictions were applied.
[ELIGIBILITY CRITERIA] Primary studies evaluating AI for screening mammography (digital mammography or digital breast tomosynthesis) in asymptomatic women, assessing AI as a standalone reader or AI-assisted radiologist workflows versus radiologists alone. Eligible designs included randomised trials, prospective paired reader studies, real-world implementation/registry cohorts, retrospective cohorts and multireader-multicase reader studies conducted in population-based or opportunistic screening settings. Key outcomes included diagnostic accuracy metrics (eg, sensitivity, specificity, Area Under the Curve (AUC) and/or programme metrics (cancer detection rate (CDR), recall/abnormal interpretation rate, positive predictive value, arbitration/workload). We excluded protocols, pilot/feasibility studies, case reports, editorials and studies without relevant accuracy or screening outcomes.
[DATA EXTRACTION AND SYNTHESIS] Two independent reviewers extracted data and assessed risk of bias. Study quality was appraised with Quality Assessment of Diagnostic Accuracy Studies-2 and an AI-specific critical appraisal tool, and findings were synthesised narratively with stratification by study design and AI integration role.
[RESULTS] 31 studies met the inclusion criteria, encompassing randomised controlled trials, prospective paired-reader studies, registry-based implementations and retrospective simulations, representing more than two million screening examinations across Europe, Asia, North America and Australia. When used as a second reader or within double-reading workflows, AI generally maintained or modestly increased sensitivity (up to +9 percentage points (PP)) while preserving or improving specificity. Triage and decision-referral configurations delivered the greatest operational benefit, reducing reading volumes by 40-90% while maintaining non-inferior cancer detection when thresholds were conservatively calibrated. Stand-alone AI achieved AUC values comparable to radiologists and similar cancer detection in real-world, non-enriched cohorts, although interval-cancer follow-up remains incomplete in several datasets. In prospective randomised evidence, including the Mammography Screening with Artificial Intelligence trial (MASAI) trial, AI-supported screening achieved higher CDRs (6.4 versus 5.0 per 1000; p=0.0021) with stable or reduced false-positive and recall rates. Across implementation and simulation settings, integration of AI reduced radiologist workload substantially, with triage and band-pass approaches reducing the number of reads by approximately 40-90%. Overall certainty is limited by heterogeneity across study designs, reliance on enriched datasets for some accuracy estimates and incomplete interval-cancer follow-up in several major studies.
[CONCLUSION] Contemporary AI systems show diagnostic performance that is broadly comparable to radiologists and can substantially reduce reading workload, particularly when used as a second reader or triage tool. Emerging prospective evidence supports their safe integration in these roles, although transparent reporting, standardised evaluation and long-term population studies are still required before considering AI as a stand-alone reader. AI may improve workflow efficiency and possibly cancer detection, but definitive evidence on safety, especially interval cancer outcomes, remains essential.
🏷️ 키워드 / MeSH 📖 같은 키워드 OA만
📖 전문 본문 읽기 PMC JATS · ~79 KB · 영문
Introduction
Introduction
Breast cancer is the most common cancer among women. Its incidence continues to rise,1 and the only hope in the absence of an effective cure is early, accurate detection through routine screening, while minimising false positives, which result in patients undergoing unnecessary procedures. Mammography has been the gold standard test for breast cancer screening2; however, the interpretation is largely dependent on the expertise and training of radiologists.3 Furthermore, it is time-consuming as ideally it requires independent double reading to reduce recall rates.4
One possible solution to reduce the errors and workload is utilising artificial intelligence (AI) models, which automatically find abnormalities (detection) and categorise them either as normal or as abnormal.
Over the past decade, advances in deep learning, a major branch of AI, have transformed the development and deployment of image-based models across multiple sectors, including healthcare. Unlike traditional machine-learning approaches that require explicit feature engineering, deep learning methods learn hierarchical image representations directly from data, enabling substantial gains in performance for tasks involving unstructured inputs such as medical images, video and free text. Early progress in medical imaging was driven largely by convolutional neural networks (CNNs), a class of neural architectures specifically designed for image analysis. CNNs remain widely used in breast imaging because they can identify salient visual patterns without hand-crafted rules, learning decision boundaries from labelled examples and applying them to previously unseen examinations.5
Although CNNs provided the foundation for modern imaging AI, the field has continued to evolve rapidly. Contemporary systems increasingly incorporate hybrid architectures, including transformers and multimodal models that integrate imaging, clinical and demographic data. These advances reflect both methodological innovation and the growing availability of large-scale digital imaging datasets. Nevertheless, the performance and reliability of any deep-learning model depend critically on the quality, diversity and accuracy of the training data and the acquisition and expert annotation of medical images remain a major bottleneck in developing robust AI systems for clinical use.6
A small number of randomised/prospective and implementation studies alongside numerous retrospective simulations explored AI-based application effectiveness in the interpretation of mammograms with the aim of increasing the accuracy and reducing the workload and time required to go through the huge number of images as these algorithms are unaffected by fatigue.7 8
In terms of patient outcomes, AI solutions could result in fewer missed cancers and lower recall rates.9 10 However, they may also result in overdiagnosis and overtreatment, thus altering the balance of benefits and harms, especially if they differentially detect more microcalcifications (associated with lower grade ductal carcinoma in situ).11 In other words, they may result in overdiagnosis of indolent breast cancers that may not affect women in their lifetime. Additionally, algorithms within an AI system are not always transparent or explainable, making communicating to the patient their findings and indications for life-altering surgery potentially problematic. Furthermore, unlike human interpretation, how or why an algorithm has made a decision can be difficult to understand (known as the ‘black box’ problem).12 Another problem with AI imaging solutions is that algorithms that demonstrate superior performance on a particular dataset of mammograms may perform poorly on another. Thus, highlighting the importance of validating an AI model’s performance on data obtained from your target population ahead of clinical deployment.
Most studies evaluating the use of AI in the interpretation of mammograms compared it either as a standalone or a second reader in both screening and diagnosis of breast cancer, with many of them demonstrating added value.7 13 However, given the potential risks highlighted above, the true clinical applicability of these solutions remains uncertain.
There is potential that AI models may be able to detect smaller tumours before they become evident to the radiologist, thus reducing the interval cancer rates which have a much higher breast-cancer specific mortality than those with screen-detected breast cancers. Hence, realising the potential of AI may ultimately improve patients’ prognosis and mortality rates.14 Because AI may preferentially surface microcalcifications/Ductal Carcinoma In Situ (DCIS), any apparent gains in detection could increase overdiagnosis under current diagnostic/treatment thresholds; conversely, lower recall/detection might reduce overdiagnosis but risks delaying clinically relevant cases. The net balance is policy and threshold-dependent and must be monitored prospectively (including interval cancers).14
Additionally, AI systems have been shown to increase the efficiency of radiologists and decrease their workload by identifying and de-prioritising likely negative mammograms so that radiologists’ efforts can be focused on the interpretation of abnormal mammograms.14
The body of evidence supporting the use of AI in breast cancer screening is growing significantly. Meanwhile, real-life implementation of AI in medicine continues to lag behind other industries. Therefore, there is a need to evaluate and synthesise the evidence to translate it into a recommendation for clinical guidelines. Our work aims to complement and address the limitations of previously published reviews that addressed and reported the accuracy of AI in breast cancer screening.7 8 13
To our knowledge, one prior systematic review has examined AI in breast cancer screening, followed the stringent methodology for conducting systematic reviews but excluded non-English studies and thus might have excluded relevant evidence13; several other narrative reviews exist.7 8 14 These reviews addressed the test accuracy of the AI applications by comparing their performance (index test) against the gold standard (comparator test). Because of the retrospective nature of the included trials in these reviews, there is a lack of follow-up assessment of the true cancer status of AI positive and radiologist negative images.
Additionally, since the publication of these reviews, several prospective trials (where AI is integrated into the actual or simulated screening pathway and is compared with the conventional mammogram reading) have been published1521 and the evidence is yet to be compiled to assess whether the integration of AI in the actual screening programmes is an effective strategy.
With the emergence of new evidence from prospective trials of AI in breast screening, population screening programmes and imaging services, the new evidence needs to be compiled to inform guidelines and policy makers. This review will provide scientific justification for planning screening programmes by assessing the effectiveness of AI in various workflows and screen-reading strategies. Further, in this review, we will employ a specific tool to appraise studies that utilised AI for diagnostic purposes and to aid clinical decision support.22
Objective
This systematic review thus aims to systematically review and summarise the published evidence on the use of AI in breast cancer screening. Specifically, it will evaluate whether incorporating AI as a standalone or as a decision support tool in mammographic screening within a single or double reader workflow improves screening efficacy compared with conventional single or double reading without AI.
Breast cancer is the most common cancer among women. Its incidence continues to rise,1 and the only hope in the absence of an effective cure is early, accurate detection through routine screening, while minimising false positives, which result in patients undergoing unnecessary procedures. Mammography has been the gold standard test for breast cancer screening2; however, the interpretation is largely dependent on the expertise and training of radiologists.3 Furthermore, it is time-consuming as ideally it requires independent double reading to reduce recall rates.4
One possible solution to reduce the errors and workload is utilising artificial intelligence (AI) models, which automatically find abnormalities (detection) and categorise them either as normal or as abnormal.
Over the past decade, advances in deep learning, a major branch of AI, have transformed the development and deployment of image-based models across multiple sectors, including healthcare. Unlike traditional machine-learning approaches that require explicit feature engineering, deep learning methods learn hierarchical image representations directly from data, enabling substantial gains in performance for tasks involving unstructured inputs such as medical images, video and free text. Early progress in medical imaging was driven largely by convolutional neural networks (CNNs), a class of neural architectures specifically designed for image analysis. CNNs remain widely used in breast imaging because they can identify salient visual patterns without hand-crafted rules, learning decision boundaries from labelled examples and applying them to previously unseen examinations.5
Although CNNs provided the foundation for modern imaging AI, the field has continued to evolve rapidly. Contemporary systems increasingly incorporate hybrid architectures, including transformers and multimodal models that integrate imaging, clinical and demographic data. These advances reflect both methodological innovation and the growing availability of large-scale digital imaging datasets. Nevertheless, the performance and reliability of any deep-learning model depend critically on the quality, diversity and accuracy of the training data and the acquisition and expert annotation of medical images remain a major bottleneck in developing robust AI systems for clinical use.6
A small number of randomised/prospective and implementation studies alongside numerous retrospective simulations explored AI-based application effectiveness in the interpretation of mammograms with the aim of increasing the accuracy and reducing the workload and time required to go through the huge number of images as these algorithms are unaffected by fatigue.7 8
In terms of patient outcomes, AI solutions could result in fewer missed cancers and lower recall rates.9 10 However, they may also result in overdiagnosis and overtreatment, thus altering the balance of benefits and harms, especially if they differentially detect more microcalcifications (associated with lower grade ductal carcinoma in situ).11 In other words, they may result in overdiagnosis of indolent breast cancers that may not affect women in their lifetime. Additionally, algorithms within an AI system are not always transparent or explainable, making communicating to the patient their findings and indications for life-altering surgery potentially problematic. Furthermore, unlike human interpretation, how or why an algorithm has made a decision can be difficult to understand (known as the ‘black box’ problem).12 Another problem with AI imaging solutions is that algorithms that demonstrate superior performance on a particular dataset of mammograms may perform poorly on another. Thus, highlighting the importance of validating an AI model’s performance on data obtained from your target population ahead of clinical deployment.
Most studies evaluating the use of AI in the interpretation of mammograms compared it either as a standalone or a second reader in both screening and diagnosis of breast cancer, with many of them demonstrating added value.7 13 However, given the potential risks highlighted above, the true clinical applicability of these solutions remains uncertain.
There is potential that AI models may be able to detect smaller tumours before they become evident to the radiologist, thus reducing the interval cancer rates which have a much higher breast-cancer specific mortality than those with screen-detected breast cancers. Hence, realising the potential of AI may ultimately improve patients’ prognosis and mortality rates.14 Because AI may preferentially surface microcalcifications/Ductal Carcinoma In Situ (DCIS), any apparent gains in detection could increase overdiagnosis under current diagnostic/treatment thresholds; conversely, lower recall/detection might reduce overdiagnosis but risks delaying clinically relevant cases. The net balance is policy and threshold-dependent and must be monitored prospectively (including interval cancers).14
Additionally, AI systems have been shown to increase the efficiency of radiologists and decrease their workload by identifying and de-prioritising likely negative mammograms so that radiologists’ efforts can be focused on the interpretation of abnormal mammograms.14
The body of evidence supporting the use of AI in breast cancer screening is growing significantly. Meanwhile, real-life implementation of AI in medicine continues to lag behind other industries. Therefore, there is a need to evaluate and synthesise the evidence to translate it into a recommendation for clinical guidelines. Our work aims to complement and address the limitations of previously published reviews that addressed and reported the accuracy of AI in breast cancer screening.7 8 13
To our knowledge, one prior systematic review has examined AI in breast cancer screening, followed the stringent methodology for conducting systematic reviews but excluded non-English studies and thus might have excluded relevant evidence13; several other narrative reviews exist.7 8 14 These reviews addressed the test accuracy of the AI applications by comparing their performance (index test) against the gold standard (comparator test). Because of the retrospective nature of the included trials in these reviews, there is a lack of follow-up assessment of the true cancer status of AI positive and radiologist negative images.
Additionally, since the publication of these reviews, several prospective trials (where AI is integrated into the actual or simulated screening pathway and is compared with the conventional mammogram reading) have been published1521 and the evidence is yet to be compiled to assess whether the integration of AI in the actual screening programmes is an effective strategy.
With the emergence of new evidence from prospective trials of AI in breast screening, population screening programmes and imaging services, the new evidence needs to be compiled to inform guidelines and policy makers. This review will provide scientific justification for planning screening programmes by assessing the effectiveness of AI in various workflows and screen-reading strategies. Further, in this review, we will employ a specific tool to appraise studies that utilised AI for diagnostic purposes and to aid clinical decision support.22
Objective
This systematic review thus aims to systematically review and summarise the published evidence on the use of AI in breast cancer screening. Specifically, it will evaluate whether incorporating AI as a standalone or as a decision support tool in mammographic screening within a single or double reader workflow improves screening efficacy compared with conventional single or double reading without AI.
Methods
Methods
Design
This review was conducted in accordance with the Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy23 and is reported following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) extension for diagnostic test accuracy studies (online supplemental table S1) and the complementary synthesis without metanalysis (SWiM) reporting items (online supplemental table S2).
Data sources and searches
We searched the Cochrane Breast Cancer Group Specialised Register, CINAHL (https://about.ebsco.com/products/research-databases/cinahl-database) and the Cochrane Library (https://www.cochranelibrary.com/), PubMed, Medline (http://www.nlm.nih.gov/bsd/pmresources.html), Embase (Elsevier) (https://www.elsevier.com/products/embase), Scopus (https://www.scopus.com/pages/preview), World Health Organization International Clinical Trials Registry Platform (WHO ICTRP) and ClinicalTrials.gov (https://trialsearch.who.int/) from January 2012 to June 2025; we set January 2012 as the starting point because deep learning-based methods (now the predominant approach in AI for mammography) were not available before this period, and earlier Computer-Aided Diagnosis (CAD) systems differ substantially in architecture, performance characteristics and clinical applicability. Restricting inclusion to 2012 onwards ensured relevance to contemporary AI technologies and workflows. No language restrictions were applied. We also screened reference lists of included studies and relevant reviews. The full search terms used in this study are presented in online supplemental table S3. Search results from all sources were imported into Covidence,24 where duplicate records were automatically identified and removed. Titles and abstracts were screened for relevance, followed by full-text review of potentially eligible articles.
Study selection
Primary studies evaluating AI for screening mammography (digital mammography or digital breast tomosynthesis) in asymptomatic women, assessing AI as a standalone reader or AI-assisted radiologist workflows versus radiologists alone. Eligible designs included randomised trials, prospective paired reader studies, real-world implementation/registry cohorts, retrospective cohorts and multireader multicase reader studies conducted in population-based or opportunistic screening settings. No language restrictions were applied. Key outcomes included diagnostic accuracy metrics (eg, sensitivity, specificity, AUC) and/or programme metrics (CDR, recall/abnormal interpretation rate, positive predictive value, arbitration/workload). We excluded protocols and pilot/feasibility studies due to small sample sizes and studies lacking relevant test accuracy or screening performance outcomes.
Data extraction
The titles and abstracts were independently reviewed by two authors (GJ and JH). We retrieved the full text of all studies that potentially met the inclusion criteria. We contacted the authors of studies if the data were missing or unclear to make a judgement regarding inclusion. Disagreement was resolved by discussion, and if needed, the third author resolved any remaining dispute (OO).
We pilot tested the eligibility criteria on a sample of reports (six studies, including ones that are thought to be definitely eligible, definitely not eligible and doubtful). The pilot test was used to refine and clarify the eligibility criteria and to ensure consistency between investigators. Covidence software was used to automate the selection process.24
We used Covidence24 structured data extraction forms to gather pertinent information from included studies and then we exported the information to a specially designed data collection sheet where additional data relating to diagnostic accuracy studies was entered.
This includes characteristics of study populations, settings, interventions (index test), comparators, study designs, type of images, type of AI software, ground truth reference and outcomes. Two authors collected data from each study (GJ and JH).
Risk of Bias assessment
The quality of included studies was assessed by two review authors (GJ and JH) independently. A third review author (OO) was consulted in case of disagreement. Because most of the studies were retrospective/prospective diagnostic accuracy studies, we used the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) tool to assess the quality of primary diagnostic accuracy studies.25 This tool comprises four domains: patient selection, index test, reference standard, and flow and timing. Each domain is assessed in terms of risk of bias, and the first three domains are also assessed in terms of concerns regarding applicability. Signalling questions are included to help judge risk of bias.25 Further, we used a specific tool (APPRAISE AI) that appraises studies that utilised AI as an index test in clinical decision support studies.22 When applied in full, the tool awards a score out of 100 to evaluate the quality of the study based on the assessment of 24 features of the study spanning the introduction, methods, results, discussion and ‘other information’. We elected to evaluate only features 13 through 22 (results and discussion) as the remaining features (1–12 and 23–24) either overlapped with QUADAS-2 or were applicable to pilot studies that are required to report extensively on technical features including hyper-parameter tuning and data splitting, which are never reported when commercial AI solutions are utilised due to IP concerns. The results and discussions sections of APPRAISE AI comprise the following domains: cohort characteristics (Max score 4), model specification (Max score 3), model evaluation (Max score 5), clinical utility assessment (Max score 5), bias assessment (Max score 6), error analysis (Max score 4), model explanation (Not scored), critical analysis (Max score 5), implementation into practice (Max score 1) and limitations (Max score 2). We reported this tool in addition to QUADAS-2 to summarise reporting quality and implementation readiness (eg, calibration, fairness, explainability, economic endpoints) but were not used to include/exclude studies or grade bias.
For randomised controlled trials (RCTs) assessing the screening performance of AI versus standard practice in mammogram screening, we also applied an RCT risk-of-bias tool for the randomisation aspects in addition to QUADAS-2 and AI appraisal tools. Version 2 of the Cochrane risk-of-bias tool for randomised trials (RoB 2 Cochrane).26 RoB 2 is structured into a fixed set of domains of bias, focussing on different aspects of trial design, conduct and reporting. Within each domain, a series of questions (‘signalling questions’) aim to elicit information about features of the trial that are relevant to risk of bias. Judgement can be ‘Low’ or ‘High’ risk of bias or can express ‘'Some concerns’.
Outcomes
The primary outcomes were measures of diagnostic accuracy relevant to breast cancer screening, including sensitivity, specificity, CDR, recall rate, false-positive rate, positive predictive value and Receiver Operating Characteristic / Area Under the Curve (ROC/AUC). Where available, interval cancer rates and stage or tumour characteristics were extracted to assess downstream safety. Secondary outcomes included workflow and operational metrics, such as reading workload, proportion of examinations triaged or deferred and changes in single- versus double-reading allocation. For studies reporting enriched datasets, accuracy measures were extracted but interpreted separately from programme-level outcomes. All outcomes were recorded as reported in the original studies without re-calculation unless explicitly stated.
Statistical analysis
Given the substantial heterogeneity across studies in terms of population, design, setting, outcome measurement, index test and AI technologies, conducting a quantitative synthesis was deemed inappropriate. Therefore, the findings are presented as a systematic review.
Further, we assessed heterogeneity qualitatively rather than statistically. Differences in population characteristics, imaging modality Digital Mammography/Digital Breast Tomosynthesis (DM/DBT), AI platform, case-mix (including the use of enriched datasets) and outcome definitions were examined during data extraction to determine whether pooling would be appropriate. Substantial methodological and clinical heterogeneity, together with variable reporting of key performance measures, precluded meta-analysis. Accordingly, subgroup analyses were planned a priori to explore patterns in performance across (i) study design (randomised trials, prospective cohorts, retrospective programme datasets and enriched reader studies), (ii) the clinical role of AI (concurrent decision support, triage/decision-referral, stand-alone reader) and (iii) AI platform characteristics where reporting permitted. These subgroup analyses were used to structure the narrative synthesis and to avoid inappropriate cross-group pooling of diagnostic accuracy measures.
Certainty of evidence
Certainty of the synthesis findings was evaluated qualitatively, drawing on established domains relevant to diagnostic test accuracy evidence. We considered the risk of bias (QUADAS-2), applicability concerns, consistency of effects across studies, precision of estimates, indirectness related to enriched datasets or simulated workflows and reporting completeness. Because statistical pooling was not undertaken, we did not generate GRADE summary tables; instead, we applied GRADE-DTA principles narratively to judge the overall confidence in the direction and robustness of findings within each subgroup (study design and AI role). Particular weight was given to prospective and population-based evidence, while findings derived from enriched or laboratory-style datasets were rated as lower certainty due to limited generalisability. These assessments informed the interpretation of results and the strength of the conclusions drawn.
Patient and public involvement
None.
Design
This review was conducted in accordance with the Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy23 and is reported following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) extension for diagnostic test accuracy studies (online supplemental table S1) and the complementary synthesis without metanalysis (SWiM) reporting items (online supplemental table S2).
Data sources and searches
We searched the Cochrane Breast Cancer Group Specialised Register, CINAHL (https://about.ebsco.com/products/research-databases/cinahl-database) and the Cochrane Library (https://www.cochranelibrary.com/), PubMed, Medline (http://www.nlm.nih.gov/bsd/pmresources.html), Embase (Elsevier) (https://www.elsevier.com/products/embase), Scopus (https://www.scopus.com/pages/preview), World Health Organization International Clinical Trials Registry Platform (WHO ICTRP) and ClinicalTrials.gov (https://trialsearch.who.int/) from January 2012 to June 2025; we set January 2012 as the starting point because deep learning-based methods (now the predominant approach in AI for mammography) were not available before this period, and earlier Computer-Aided Diagnosis (CAD) systems differ substantially in architecture, performance characteristics and clinical applicability. Restricting inclusion to 2012 onwards ensured relevance to contemporary AI technologies and workflows. No language restrictions were applied. We also screened reference lists of included studies and relevant reviews. The full search terms used in this study are presented in online supplemental table S3. Search results from all sources were imported into Covidence,24 where duplicate records were automatically identified and removed. Titles and abstracts were screened for relevance, followed by full-text review of potentially eligible articles.
Study selection
Primary studies evaluating AI for screening mammography (digital mammography or digital breast tomosynthesis) in asymptomatic women, assessing AI as a standalone reader or AI-assisted radiologist workflows versus radiologists alone. Eligible designs included randomised trials, prospective paired reader studies, real-world implementation/registry cohorts, retrospective cohorts and multireader multicase reader studies conducted in population-based or opportunistic screening settings. No language restrictions were applied. Key outcomes included diagnostic accuracy metrics (eg, sensitivity, specificity, AUC) and/or programme metrics (CDR, recall/abnormal interpretation rate, positive predictive value, arbitration/workload). We excluded protocols and pilot/feasibility studies due to small sample sizes and studies lacking relevant test accuracy or screening performance outcomes.
Data extraction
The titles and abstracts were independently reviewed by two authors (GJ and JH). We retrieved the full text of all studies that potentially met the inclusion criteria. We contacted the authors of studies if the data were missing or unclear to make a judgement regarding inclusion. Disagreement was resolved by discussion, and if needed, the third author resolved any remaining dispute (OO).
We pilot tested the eligibility criteria on a sample of reports (six studies, including ones that are thought to be definitely eligible, definitely not eligible and doubtful). The pilot test was used to refine and clarify the eligibility criteria and to ensure consistency between investigators. Covidence software was used to automate the selection process.24
We used Covidence24 structured data extraction forms to gather pertinent information from included studies and then we exported the information to a specially designed data collection sheet where additional data relating to diagnostic accuracy studies was entered.
This includes characteristics of study populations, settings, interventions (index test), comparators, study designs, type of images, type of AI software, ground truth reference and outcomes. Two authors collected data from each study (GJ and JH).
Risk of Bias assessment
The quality of included studies was assessed by two review authors (GJ and JH) independently. A third review author (OO) was consulted in case of disagreement. Because most of the studies were retrospective/prospective diagnostic accuracy studies, we used the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) tool to assess the quality of primary diagnostic accuracy studies.25 This tool comprises four domains: patient selection, index test, reference standard, and flow and timing. Each domain is assessed in terms of risk of bias, and the first three domains are also assessed in terms of concerns regarding applicability. Signalling questions are included to help judge risk of bias.25 Further, we used a specific tool (APPRAISE AI) that appraises studies that utilised AI as an index test in clinical decision support studies.22 When applied in full, the tool awards a score out of 100 to evaluate the quality of the study based on the assessment of 24 features of the study spanning the introduction, methods, results, discussion and ‘other information’. We elected to evaluate only features 13 through 22 (results and discussion) as the remaining features (1–12 and 23–24) either overlapped with QUADAS-2 or were applicable to pilot studies that are required to report extensively on technical features including hyper-parameter tuning and data splitting, which are never reported when commercial AI solutions are utilised due to IP concerns. The results and discussions sections of APPRAISE AI comprise the following domains: cohort characteristics (Max score 4), model specification (Max score 3), model evaluation (Max score 5), clinical utility assessment (Max score 5), bias assessment (Max score 6), error analysis (Max score 4), model explanation (Not scored), critical analysis (Max score 5), implementation into practice (Max score 1) and limitations (Max score 2). We reported this tool in addition to QUADAS-2 to summarise reporting quality and implementation readiness (eg, calibration, fairness, explainability, economic endpoints) but were not used to include/exclude studies or grade bias.
For randomised controlled trials (RCTs) assessing the screening performance of AI versus standard practice in mammogram screening, we also applied an RCT risk-of-bias tool for the randomisation aspects in addition to QUADAS-2 and AI appraisal tools. Version 2 of the Cochrane risk-of-bias tool for randomised trials (RoB 2 Cochrane).26 RoB 2 is structured into a fixed set of domains of bias, focussing on different aspects of trial design, conduct and reporting. Within each domain, a series of questions (‘signalling questions’) aim to elicit information about features of the trial that are relevant to risk of bias. Judgement can be ‘Low’ or ‘High’ risk of bias or can express ‘'Some concerns’.
Outcomes
The primary outcomes were measures of diagnostic accuracy relevant to breast cancer screening, including sensitivity, specificity, CDR, recall rate, false-positive rate, positive predictive value and Receiver Operating Characteristic / Area Under the Curve (ROC/AUC). Where available, interval cancer rates and stage or tumour characteristics were extracted to assess downstream safety. Secondary outcomes included workflow and operational metrics, such as reading workload, proportion of examinations triaged or deferred and changes in single- versus double-reading allocation. For studies reporting enriched datasets, accuracy measures were extracted but interpreted separately from programme-level outcomes. All outcomes were recorded as reported in the original studies without re-calculation unless explicitly stated.
Statistical analysis
Given the substantial heterogeneity across studies in terms of population, design, setting, outcome measurement, index test and AI technologies, conducting a quantitative synthesis was deemed inappropriate. Therefore, the findings are presented as a systematic review.
Further, we assessed heterogeneity qualitatively rather than statistically. Differences in population characteristics, imaging modality Digital Mammography/Digital Breast Tomosynthesis (DM/DBT), AI platform, case-mix (including the use of enriched datasets) and outcome definitions were examined during data extraction to determine whether pooling would be appropriate. Substantial methodological and clinical heterogeneity, together with variable reporting of key performance measures, precluded meta-analysis. Accordingly, subgroup analyses were planned a priori to explore patterns in performance across (i) study design (randomised trials, prospective cohorts, retrospective programme datasets and enriched reader studies), (ii) the clinical role of AI (concurrent decision support, triage/decision-referral, stand-alone reader) and (iii) AI platform characteristics where reporting permitted. These subgroup analyses were used to structure the narrative synthesis and to avoid inappropriate cross-group pooling of diagnostic accuracy measures.
Certainty of evidence
Certainty of the synthesis findings was evaluated qualitatively, drawing on established domains relevant to diagnostic test accuracy evidence. We considered the risk of bias (QUADAS-2), applicability concerns, consistency of effects across studies, precision of estimates, indirectness related to enriched datasets or simulated workflows and reporting completeness. Because statistical pooling was not undertaken, we did not generate GRADE summary tables; instead, we applied GRADE-DTA principles narratively to judge the overall confidence in the direction and robustness of findings within each subgroup (study design and AI role). Particular weight was given to prospective and population-based evidence, while findings derived from enriched or laboratory-style datasets were rated as lower certainty due to limited generalisability. These assessments informed the interpretation of results and the strength of the conclusions drawn.
Patient and public involvement
None.
Results
Results
Search results
Database searches yielded 507 unique results, of which 473 were eligible for abstract screening. 142 potentially eligible full texts were assessed. 31 primary studies met the inclusion criteria and were included in this review. The PRISMA flow diagram is provided as online supplemental figure S1.
Characteristics of the included studies
We included 31 studies spanning Europe, Asia, Australia and North America, covering screening programmes in Israel, Korea, France, Sweden, Germany, Denmark, Spain, Norway, Australia, the UK, Austria and the USA. Several included studies were multinational, spanning sites in two different continents, for example, UK–USA27 28 or Korea–USA–UK.29 Designs ranged from a single RCT (MASAI), interim analysis (Lång) and final report (Herrnström)15 16 to prospective paired-reader non-inferiority studies,17 18 real-world implementation/registry cohorts,19 before–after implementations,21 retrospective cohorts and multireader-multicase (MRMC) reader studies, for example3032; several conducted retrospective simulation of AI-integration strategies against established double reading, for example.3336 Most studies were embedded in organised population screening with standard double reading; a minority assessed opportunistic screening in health-check centres.37 Sample sizes ranged from enriched MRMC sets of 240–314 exams to national programme registries with ≥1 million exams. The predominant modality was two-view 2D full-field digital mammography (FFDM) with CC/MLO views; a subset incorporated DBT (eg, Hologic DBT with synthetic 2D) or analysed DM and DBT in parallel (online supplemental tables S4 and S5).
Hernström 2025 and Lång 2023 are two publications from the same randomised MASAI trial in southwest Sweden (same protocol, sites, intervention with Transpara v1.7).Lång 2023 is the earlier, clinical-safety interim analysis (workload reduction, recall/false positive rate (FPR) safety),15 while Hernström 2025 reports early screening performance and tumour characteristics (eg, higher CDR, Positive Predictive Value (PPV); recall/FPR stable) with a larger/updated dataset.16 We included both publications in our review, counted them as one study but analysed them separately because of the different/updated dataset, re-randomisation and different outcomes in each study.
The detailed characteristics, funding and eligibility criteria of each included study can be found in online supplemental tables S4 and S5.
Population, eligibility and withdrawals
Most cohorts comprised consecutive, asymptomatic screening attendees (typically 50–69 years in organised European programmes), with clear exclusion criteria (implants, prior breast cancer/surgery, incomplete DICOM, missing follow-up). Korean cohorts sometimes included younger age ranges (≥34–40 years) in tertiary screening settings. Withdrawals/exclusions were reported with flow diagrams where applicable (eg, technical failures, incomplete workups, data mismatches), and simulation studies described dataset construction (eg, inclusion of all cancers and consensus-forwarded exams plus randomly sampled normal) to enable reweighting against programme distributions.
AI interventions
All studies employed an ensemble of CNN models for feature extraction but varied in backbone architectures and ensembling strategies. Additionally, most utilised subnetworks at lesion, view and exam levels with an averaging of outputs to produce the final score. Vendor diversity was broad (Hologic, GE, Siemens/Philips; single-vendor imaging in a few trials). Commercial AI systems dominated: Lunit INSIGHT (multiple studies across Sweden, Denmark, Korea), Transpara (ScreenPoint; RCTs and implementations), Vara MG (Germany), ProFound AI (iCAD; DBT), SaigeQ (DeepHealth/Paige), CureMetrix, Whiterabbit.ai and programme-specific ensembles (BRAIx). Non-commercial prototypes included IBM/Assuta and Google’s multi-network ensemble. Reported outputs were exam-level risk scores (eg, 0–100 or 1–10) and lesion/heatmap markings; operating points were pre-specified in prospective/implementation work, whereas some retrospective studies tuned thresholds within the study dataset. The heterogeneity of the models and the datasets on which they were evaluated thus makes it difficult to perform head-to-head comparisons or a meta-analysis of the results. Only Akselrod-Ballin’s (2019) and Pedemonte Pedemonte (2024) studies28 38 utilised clinical variables as inputs in addition to mammography images whereas all other models relied solely on imaging inputs. Furthermore, in Pedemonte’s (2024) study the high-level ‘metamodel’ additionally ingested prior mammograms thus providing additional context for assessment of lesions.28 Notably, none of the studies utilised Vision-Transformer architectures or multi-modal models, which are the current state-of-the-art models in computer vision; only Akselrod-BallinAkselrod-Ballin (2019) and Pedemonte Pedemonte (2024) used multi-modal inputs (images+clinical/prior data).28 38
Index test and comparators
Across studies, we observed the following recurring arms:
Standard double reading: Two independent readers; arbitration if discordant.
AI-supported double reading (concurrent decision support): Both readers see AI risk score/marks.
AI triage/normal triaging+safety net: AI autolabels a proportion ‘normal’ (no second human) and flags a small ‘safetynet’” subset for extra review (eg, Vara).
Single reader +AI (assist): One human reader sees AI output.
Reader replacement/deferral: AI replaces one reader or defers a subset to a single reader.
Standalone AI: AI classifies exams at a fixed operating point without a human confirmatory reader (programme policy permitting).
RCTs and real-world implementations principally compared AI-supported workflows versus standard double reading, while MRMC and simulation studies explored multiple threshold configurations (CDR-optimised, recall-optimised, balanced) against the programme reference.
Reference standards (ground truth)
Positive reference standards were uniformly histopathology for screen-detected cancers (biopsy or surgical specimen). Negative verification typically required registry-based or programme follow-up—most often ≥24 months, though some implementations reported ≥180–200 days interim follow-up at the time of analysis with longer ascertainment pending. Several studies explicitly captured interval cancers within 24 months; others (eg, recent implementations/RCTs) flagged interval cancer as the primary endpoint to be completed at ≥2 years.
Dataset composition and reader expertise
Online supplemental table S6 details how the included investigations assembled their evaluation cohorts and the radiologist panels against which AI systems were benchmarked.
Reader paradigms largely mirrored local screening practice: independent double reading with arbitration was the dominant reference, while opportunistic health-check cohorts typically used a single experienced reader. Some before–after implementations retained the same 10 radiologists across periods: MRMC reader studies explicitly documented expertise (eg, 14 MQSA-qualified radiologists in a fully crossed design). Prospective and randomised trials embedded AI into routine workflow with AI-stratified single versus double reading (eg, MASAI). Several studies also compared breast-subspecialist versus general radiologists in paired reads.
Across studies, dataset construction ranged from population screening cohorts to curated splits. Examples of explicit train/validation/test partitioning includeAkselrod-Ballin 2019 (9611/1055/2548 women; 73/8/19% by images),38 Frazer 2024 (random 20% client cohort split into 75% test/25% development; remaining 80% for training),35, Pedemonte 2024 (80/10/10% with an external US hold-out)28 and Leibig 2022 (70/15/15% internal plus two-site external test).39 Several implementation or evaluation-only papers used pretrained AI with no in-study split.1518 20 21 37 4043
Cancer enrichment varied between deliberately enriched datasets and datasets that are not enriched and reflect programme prevalence, for example.1518 20 21 37 4044
Reader expertise was typically strong and aligned with screening practice. Prospective/implementation settings reported breast-subspecialist panels Hernström 2025: (16 radiologists; most>5 years’ experience; high annual volumes)16 and Lauritzen 2024 (21 full-time breast radiologists; 19 senior (8–28 years), 2 junior).21 Programme cohorts detailed role-stratified experience, eg, Kühl 2024 (23 radiologists; first readers mainly 0–4 years; second readers/arbitrators largely ≥10 years).40 Reader numbers were large in programme-scale evaluations (Frazer weighted mean includes 125 readers),35 while MRMC/reader studies specified panels such as Dang 2022 (12 readers; 8 seniors, 4 juniors).30 Some papers did not report reader counts/experience (eg,38 38 41 41).
Full numeric details for all splits (train/validation/test), density distributions and country-level datasets are provided in online supplemental table S6.
Outcomes
The diagnostic-accuracy outcomes table condenses hundreds of individual point estimates into three pragmatic columns: (i) stand-alone AI performance, (ii) radiologist performance under conventional reading and (iii) the increment (Δ) when radiologists are supported by AI. For each metric, we list the full range observed across the 31 primary studies and, where≥10 studies reported the same statistic. A summary of these outcomes is outlined in table 1, and full details of each outcome are provided in online supplemental table S7.
Risk of Bias assessment
We judged methodological quality with QUADAS-2 (seven signalling domains)25 in addition to an AI specific critical appraisal tool (APPRAISE AI) (35-point scale).22 A summary of QUADAS-2 assessment is presented in table 2 and figure 1. Further, a full detailed description and justification of the decision in each domain in each included study is presented in online supplemental table S8. Additionally, the risk of bias assessment of the two included publications from the MASAI trial yielded low risk judgement on all domains in the Cochrane RoB 2 tool.
Quality Assessment of Diagnostic Accuracy Studies, version 2 (QUADAS-2)
Across the included studies, QUADAS-2 judgments showed a mixed but interpretable risk-of-bias profile. Patient selection was the most frequent early threat, high-risk calls typically reflected enriched/case–control or non-consecutive sampling and opportunistic cohorts, whereas consecutive programme cohorts were low risk. Patient applicability was generally good, with concerns limited to diagnostic or health-check settings. Index test bias was mostly acceptable, high-risk judgments arose when operating points were tuned on the test data or software/thresholds varied mid-study. Index applicability was uniformly strong because commercial AI on standard FFDM/DBT mirrored intended screening use.
For the reference standard, most studies were low risk, but partial or abbreviated verification, especially interim analyses without complete 24-month interval cancer capture, triggered high/unclear ratings. The flow and timing domain contributed the largest share of high-risk calls, driven by non-uniform verification windows, exclusions after the index test and operational changes (eg, AI outages/re-randomisation in MASAI’s early analysis16; mid-study threshold change in Lauritzen21; multiple viewer/model updates with short follow-up.19 In contrast, registry-anchored cohorts with uniform ≥24 month follow-up (eg, Larsen) showed consistently low risk for reference and flow.41 Six studies scored low risk on all domains.153537 42 44 Overall, programme-based prospective/registry studies exhibited the strongest internal validity and applicability, whereas enriched MRMC/simulation designs and interim implementations carried predictable risks from selection, verification and timing.
AI-specific appraisal tool (APPRAISE AI)
Methodological quality of the AI modelling studies (35-point scale) with detailed justification of the score in each domain for each study is reported in online supplemental table S9). The two ‘check-box’ domains: Implementation into clinical practice (max 1) and Limitations explicitly stated (max 2) were satisfied by nearly every paper, so they top the list in proportional terms. The remaining multipoint domains are discussed below.
Cohort characteristics (max 4)
Most programme-based studies scored 3–4/4 notable in large, consecutive screening cohorts and clear flow diagrams.15 19 34 41 45 Lower scores (1–2/4) clustered in small, enriched/MRMC or single-centre datasets30 31, and single-vendor or single-system sources tempered representativeness.38
Model specification (max 3)
Reporting was typically 1–2/3 for commercial tools (architecture/weights not disclosed; thresholds described). A few research models earned 3/3 by releasing code/architecture46 or giving fuller system detail (some Lunit/Transpara papers specified versions, scores and operating points).
Model evaluation (max 5)
Prospective/RCT or large registry evaluations commonly scored 3–4/5, with prespecified endpoints, CIs and appropriate comparative tests.16 18 19 Pure simulations or enriched reader studies were 2–3/5; classic Se/Sp/AUC or calibration was often missing in implementation reports, and many lacked external/prospective validation.
Clinical-utility assessment (max 5)
Many studies quantified CDR, recall/FPR, PPVs and workload (triage, reads saved), yielding 3–4/5. Decision-curve, cost-effectiveness and patient-centred outcomes were rare, limiting perfect scores.
Bias assessment/subgroup analysis (max 6)
Scores centred around 3–4/6. Highest scores occurred in prospective or randomised designs with robust verification.16 29 Lower scores reflected enrichment, same-sample thresholding, abbreviated follow-up for negatives or mid-study updates (typical of interim implementations and MRMC/simulation work).
Error analysis (max 4)
Most papers offered illustrative false negative/false positive (FN/FP) cases but few systematic taxonomies; domain scores were usually 1–2/4. Only a handful approached 3/4 by adding qualitative clusters or representation analyses.46 Formal root-cause matrices or reader-error audits were uncommon
Model explanation (not scored)
Nearly all studies provided risk scores and lesion markings/heatmaps; formal XAI (eg, SHAP/LIME, human-factors usability, trust calibration) was rare and typically not scored in the rubric.
Critical analysis / discussion quality (max 5)
This was a consistent strength (4/5 typical), most papers benchmarked against prior AI/reader literature, reflected on generalisability and threshold trade-offs, and explicitly framed future prospective work. Wu 2020 alone achieved five points with a thorough contextual critique.46
Across the evidence base, a consistent profile emerges; most AI-mammography studies are strong on who they studied and what they found, but far weaker on reporting how the algorithms were built, tuned and stress-tested.
Implementation into practice (max 1)
A minority documented live deployment within routine screening.16 19 21 39 Many remained simulations/retrospective without in-workflow metrics.
Limitations (max 2)
Most earned 2/2 by explicitly acknowledging design constraints (retrospective/simulation, enrichment, single-vendor/site, incomplete interval-cancer ascertainment, lack of calibration/fairness) and setting out prospective validation needs.
Sources of heterogeneity
Results varied widely across studies for three predictable reasons:
Study design: evidence spans randomised and prospective in-workflow evaluations, large registry cohorts, retrospective programme simulations and enriched MRMC reader studies; sampling (consecutive vs case-enriched), follow-up windows (complete interval cancer capture vs interim analyses), and reading paradigms (single vs double reading with arbitration) differ and affect comparability of CDR/recall/PPV as well as sensitivity/specificity.
AI role: systems were deployed as a concurrent second reader/decision support tool, as triage/decision referral (normal deferral, bandpass, reader replacement within double reading), or as standalone readers; each role targets different operating points and denominators, programme metrics dominate in in-workflow/triage studies, whereas ROC/AUC is emphasised in reader/MRMC work, so effect sizes are not interchangeable across roles or thresholds.
System/platform: studies used diverse commercial and investigational tools (eg, Transpara, Lunit, Vara, ProFound AI, CureMetrix, Google/BRAIx prototypes) on different imaging stacks (DM vs DBT; vendor hardware; availability of priors), with versioning and threshold setting (prespecified vs post hoc) further shifting observed performance.
To account for this, we were able to conduct meaningful subgroup analyses based on the AI’s clinical role and the study design. However, we were unable to conduct a meaningful comparative analysis based on AI platform/images, because platform-specific effects were not separable from study design, and many evaluations used mixed image types without reporting modality-stratified results.
Sensitivity analysis
We performed a sensitivity analysis stratified by study design (online supplemental table S10). Evidence is summarised in three subsections (Prospective randomised trials and real-world implementations, Prospective or paired-reader cohort studies, Retrospective simulations and reader/MRMC studies), and we present metrics only within the corresponding tier.
Prospective randomised trials and real-world implementations
We identified two publications from one randomised (MASAI) trial and multiple programme-level implementations/before–after studies.
Across the MASAI randomised controlled trial, both the interim and early performance publications showed consistent improvements in programme outcomes when AI was integrated into the screening workflow.15 16 In the interim analysis, AI-supported screening achieved a higher CDR than standard double reading (6.1 vs 5.1 per 1000), accompanied by similar recall rates and a higher positive predictive value of recall, while reducing the number of human reads by more than 40%. The updated early screening report confirmed these findings, with a CDR of 6.4 per 1000 in the AI arm compared with 5.0 per 1000 in the control arm, again without increasing recall. Positive predictive value improved, and human reading workload fell by approximately 44%. Interval cancer follow-up for MASAI remains ongoing.
Findings from before–after implementation studies were directionally similar. In the Danish implementation by Lauritzen and colleagues,21 the introduction of AI-based risk stratification was associated with lower recall and false-positive rates, a higher CDR and a marked improvement in the positive predictive value of recall. Reading workload also decreased by one-third as more examinations were routed to single reading. In the Spanish implementation by Elías-Cabot et al, AI support led to a substantial increase in cancer detection (rising from 5.8 to 9.0 per 1000), alongside modest increases in recall but notable gains in positive predictive value.20
The multicentre real-world non-inferiority implementation reported by Eisemann et al demonstrated the feasibility of integrating AI-supported workflows across 12 screening sites.19 Although specific numerical outcomes were not available in the present extraction, the study broadly supported the non-inferiority of AI-augmented screening pathways in routine practice and highlighted the potential for scalability across diverse clinical environments.
In summary, across prospective RCTs/implementations, AI support increased or maintained CDR, kept recall stable or lower in most settings, improved PPV and reduced reads by 40–45% in randomised evidence; interval cancer outcomes remain incomplete in interim reports.
Prospective or paired reader cohort studies
Prospective paired reader, non-enriched (Dembrower 2023b17): Reader1+AI sensitivity 78.6% (equal to double reading 78.6%); specificity 93.5% versus 94.4%; AIR 7.0% versus 6.06%; workload relative to double reading+15% at this threshold (vs+107% at a higher sensitivity threshold). CDR was not reported.
Prospective multicentre cohort, non-enriched18 18: Breast radiologists with AI versus without: CDR 5.70 versus 5.01/1000 (p < 0.001); recall 4.53% versus 4.48% (NS); PPV1 12.6% versus 11.2% (p < 0.001). Sensitivity/specificity/AUC were not reported.
In summary, in prospective paired reader/cohort designs, AI support either improved CDR and PPV at stable recall or preserved sensitivity/specificity close to double reading, with workload effects dependent on the chosen operating point.
Retrospective simulations and reader/MRMC studies (often enriched)
In this tier, we classify datasets as programme representative (non-enriched) when test/evaluation cohorts reflect routine screening prevalence (or are explicitly reweighted to programme prevalence), and as enriched when cancers are oversampled (eg, MRMC/laboratory sets) such that prevalence metrics are not directly interpretable for population screening.2730 32 3537 39
Programme representative (non-enriched or reweighted to prevalence)
Across large programme simulations and registries, AI-integrated pathways generally preserved or modestly improved CDR, while lowering recall and raising PPV at balanced operating points. Programme-level ‘decisionreferral/bandpass’ strategies typically reduced recalls by meaningful margins while keeping CDR non-inferior or higher; some configurations traded small CDR decreases for recall reductions, underscoring the importance of operating point selection and local calibration. In opportunistic single-reader settings, standalone AI often achieved higher specificity with lower recall and higher PPV at similar CDR, reflecting improved precision rather than increased yield.
When evaluated against routine double-reading or first-reader references, sensitivity was usually maintained, and specificity was stable or higher at conservative thresholds. Stand-alone or reader-aided AI reported AUCs commonly in the high-0.8s to mid-0.9s within this subcategory (eg, representative values around 0.89–0.97), consistent with non-inferior discrimination relative to radiologists in programme data.
Simulated or measured reading workload fell substantially under AI triage/deferral, frequently in the 40–90% range depending on the pathway (eg, readerreplacement vs bandpass) while maintaining screening performance at prespecified safety points; replacing a second reader preserved consensus accuracy with very large second reader workload reductions.
Where interval cancer outcomes were reported, findings varied with the chosen operating point; high specificity configurations can increase interval cancers relative to standard first readers, reinforcing the need to predefine thresholds and monitor interval cancer rates during rollout.
Enriched reader/MRMC and laboratory-like datasets
Because cancer prevalence is inflated in enriched sets, programme metrics (CDR, recall, PPV) are not directly comparable to routine screening and are interpreted qualitatively. Across these datasets, recall frequently rises at high-sensitivity operating points, illustrating threshold trade-offs rather than population-level performance.
Reader assistance on enriched MRMC datasets shows modest but consistent gains: ΔAUC typically+0.02 to +0.10, sensitivity increases of a few PP with neutral to slightly higher specificity when AI supports readers. Standalone AI commonly achieves AUC 0.88–0.95 on enriched test sets; hybrid (AI+reader) configurations outperform either alone on ROC. Triage simulations demonstrate large potential workload savings at stricter thresholds but at the expense of increasing misses; non-inferiority to readers’ AUC is generally preserved until very aggressive triage is applied. These patterns confirm directionality (ie, assistance helps and thresholds matter) but do not establish programme-level absolute effects.
The programme representative analyses suggest that AI can reduce recall and workload with non-inferior cancer detection, while enriched MRMC evidence demonstrates robust ROC improvements and clarifies sensitivity–specificity trade-offs as thresholds shift.
In summary, in retrospective programme datasets and MRMC/reader studies (many enriched), ROC/AUC is frequently reported and shows AI parity or gains at selected thresholds, with substantial simulated workload savings in triage/deferral pathways. Effects on real prevalence CDR/recall vary by operating point and modality; density and reader experience subgroup patterns are consistent with mechanism.
Subgroup analyses
We conducted a subgroup analysis based on the role of AI (online supplemental tables S11,S12 and S13), classifying studies into three groups: (1) concurrent second reader/decision support, (2) triage/decision-referral/reader replacement and (3) stand-alone AI readers. Here we summarise findings for the concurrent second reader/decision-support subgroup.
Concurrent second reader/decision support
In this first subgroup, AI systems were used alongside human readers in real time, either concurrently or sequentially, without directly driving triage or automation decisions. This included prospective and implementation studies (such as182049 and182049 50 50), as well as MRMC experiments where AI-assisted readers,30 31 Rodríguez-Ruiz 2019b303246 46. Across these decision-support deployments, reader aid consistently produced modest improvements in discrimination: AUC gains typically ranged from about+0.02 to +0.10. In MRMC reader-aid studies, sensitivity was usually maintained or slightly increased, while specificity improved by a few PP, with a median gain of around three PP. In real-world matched or paired cohorts, AI support generally maintained sensitivity while improving specificity and/or PPV.
Additionally, several consistent trends were observed. Both Chang 2025 and Elías-Cabot 2024 reported improvements in cancer detection and positive predictive value when AI was used alongside radiologists, although Elías-Cabot also observed a modest rise in recall while Chang showed stable recall among breast radiologists. 18 20 Kim 2023 similarly demonstrated improved specificity and a substantial reduction in recall, although without a change in CDR.49 In contrast, Letter 2023 did not demonstrate statistically significant differences, though effect estimates modestly favoured AI.50 Taken together, these studies indicate that decision-support AI generally improves specificity and PPV, with neutral or small gains in cancer detection, while recall tends to remain stable or decrease in several settings. Improvements in reader discrimination were echoed in reader-aid experiments such as,30 where performance increased but workload was unchanged, underscoring that decision-support alone does not meaningfully reduce reading time without accompanying triage or routing functions.30
Triage/decision-referral/reader-replacement
In the second subgroup, AI was used to route or replace human reads, including normal triage, band-pass/decision-referral, reader replacement within double reading and risk-based allocation of single versus double reading. Studies consistently showed that AI-guided routing, whether through normal triage, band-pass strategies, reader replacement within double reading or risk-based allocation of single versus double reading, preserved or improved diagnostic performance when conservative thresholds were applied. Both MASAI15 16 16 23 and21 21 demonstrated increases in cancer detection with AI-stratified workflows, with MASAI reporting a rise from 5.0 to 6.4 per 1000 and Lauritzen showing gains from 0.70% to 0.82%. These findings aligned with programme-level simulations,34 34 35 35 39 39 44 44 45 45 where triage or decision-referral frequently increased sensitivity by roughly 2–4 PP while maintaining or modestly improving specificity.
Across studies reporting recall and PPV, similar patterns were seen. MASAI showed stable recall alongside improved PPV,15 16 while Lauritzen documented decreases in both recall and false-positive rate, with a substantial increase in PPV.21 In simulated screening scenarios, Fisches found small increases in CDR and reductions in recall across German, UK and Swedish cohorts.45 Leibig similarly demonstrated sensitivity gains with preserved specificity, together with large proportions of examinations automatically triaged as low-risk.39 Raya-Povedano reported meaningful recall reductions for both DM and DBT without compromising cancer detection.44
Workload reduction was the most striking and consistent outcome across this subgroup. Real-world implementations showed large reductions, with MASAI15 16 reducing reading volume by approximately 44% and Lauritzen by around one-third.21 Simulations projected even greater savings: Fisches reported reductions ranging from 37% to 95% depending on strategy,45 and Frazer showed that band-pass approaches could reduce human reads by over 80%.35 Rodríguez-Ruiz 2019a demonstrated that workload reductions of up to 90% were feasible at stringent thresholds, although with increased miss rates when thresholds were pushed too far.51 Collectively, these studies indicate that triage and decision-referral approaches deliver the largest operational benefits, with non-inferior cancer detection and improved PPV when thresholds are calibrated appropriately, while overly aggressive triage in simulations may compromise safety.
Stand-alone AI readers
In the third subgroup, AI was evaluated as a full, stand-alone reader, either head-to-head against radiologists or as one arm within a multi-arm simulation, without human involvement in the primary decision. Results across programme-level and head-to-head comparisons showed that contemporary AI systems generally achieved discrimination comparable to radiologists. Programme-like datasets, including27 27 38 38 41 41 42 42 43 43 consistently reported AI AUC values in the range of 0.80–0.97, closely matching radiologist AUCs (approximately 0.78–0.98).
Direct comparisons of sensitivity, specificity and programme metrics reinforced this pattern but also highlighted the impact of threshold selection. In real-world, non-enriched cohorts,40 and37 37 40 showed that AI could achieve CDRs equivalent to radiologists, with Kwon reporting substantially higher specificity and PPV and a marked reduction in recall when AI was tuned towards specificity.37 Conversely, enriched or outpatient datasets, such as,48 demonstrated increased sensitivity but at the cost of higher recall and lower specificity.48 Threshold-dependent patterns were further illustrated by,36 where more conservative AI operating points maintained CDR while markedly reducing recall, whereas sensitivity-maximising thresholds increased recall and reduced specificity.36
Taken together, results from programme-like cohorts (Kühl, Kwon, Romero-Martín37 40 43) indicate that stand-alone AI can match radiologists in cancer detection and often improve PPV when operating at higher specificity, whereas enriched or MRMC-style studies primarily show the potential of AI under idealised conditions rather than real-world prevalence. Despite strong technical performance, stand-alone AI has not yet been adopted as a final reader in population screening programmes, and incomplete interval-cancer follow-up in key datasets limits firm conclusions regarding long-term safety.
Search results
Database searches yielded 507 unique results, of which 473 were eligible for abstract screening. 142 potentially eligible full texts were assessed. 31 primary studies met the inclusion criteria and were included in this review. The PRISMA flow diagram is provided as online supplemental figure S1.
Characteristics of the included studies
We included 31 studies spanning Europe, Asia, Australia and North America, covering screening programmes in Israel, Korea, France, Sweden, Germany, Denmark, Spain, Norway, Australia, the UK, Austria and the USA. Several included studies were multinational, spanning sites in two different continents, for example, UK–USA27 28 or Korea–USA–UK.29 Designs ranged from a single RCT (MASAI), interim analysis (Lång) and final report (Herrnström)15 16 to prospective paired-reader non-inferiority studies,17 18 real-world implementation/registry cohorts,19 before–after implementations,21 retrospective cohorts and multireader-multicase (MRMC) reader studies, for example3032; several conducted retrospective simulation of AI-integration strategies against established double reading, for example.3336 Most studies were embedded in organised population screening with standard double reading; a minority assessed opportunistic screening in health-check centres.37 Sample sizes ranged from enriched MRMC sets of 240–314 exams to national programme registries with ≥1 million exams. The predominant modality was two-view 2D full-field digital mammography (FFDM) with CC/MLO views; a subset incorporated DBT (eg, Hologic DBT with synthetic 2D) or analysed DM and DBT in parallel (online supplemental tables S4 and S5).
Hernström 2025 and Lång 2023 are two publications from the same randomised MASAI trial in southwest Sweden (same protocol, sites, intervention with Transpara v1.7).Lång 2023 is the earlier, clinical-safety interim analysis (workload reduction, recall/false positive rate (FPR) safety),15 while Hernström 2025 reports early screening performance and tumour characteristics (eg, higher CDR, Positive Predictive Value (PPV); recall/FPR stable) with a larger/updated dataset.16 We included both publications in our review, counted them as one study but analysed them separately because of the different/updated dataset, re-randomisation and different outcomes in each study.
The detailed characteristics, funding and eligibility criteria of each included study can be found in online supplemental tables S4 and S5.
Population, eligibility and withdrawals
Most cohorts comprised consecutive, asymptomatic screening attendees (typically 50–69 years in organised European programmes), with clear exclusion criteria (implants, prior breast cancer/surgery, incomplete DICOM, missing follow-up). Korean cohorts sometimes included younger age ranges (≥34–40 years) in tertiary screening settings. Withdrawals/exclusions were reported with flow diagrams where applicable (eg, technical failures, incomplete workups, data mismatches), and simulation studies described dataset construction (eg, inclusion of all cancers and consensus-forwarded exams plus randomly sampled normal) to enable reweighting against programme distributions.
AI interventions
All studies employed an ensemble of CNN models for feature extraction but varied in backbone architectures and ensembling strategies. Additionally, most utilised subnetworks at lesion, view and exam levels with an averaging of outputs to produce the final score. Vendor diversity was broad (Hologic, GE, Siemens/Philips; single-vendor imaging in a few trials). Commercial AI systems dominated: Lunit INSIGHT (multiple studies across Sweden, Denmark, Korea), Transpara (ScreenPoint; RCTs and implementations), Vara MG (Germany), ProFound AI (iCAD; DBT), SaigeQ (DeepHealth/Paige), CureMetrix, Whiterabbit.ai and programme-specific ensembles (BRAIx). Non-commercial prototypes included IBM/Assuta and Google’s multi-network ensemble. Reported outputs were exam-level risk scores (eg, 0–100 or 1–10) and lesion/heatmap markings; operating points were pre-specified in prospective/implementation work, whereas some retrospective studies tuned thresholds within the study dataset. The heterogeneity of the models and the datasets on which they were evaluated thus makes it difficult to perform head-to-head comparisons or a meta-analysis of the results. Only Akselrod-Ballin’s (2019) and Pedemonte Pedemonte (2024) studies28 38 utilised clinical variables as inputs in addition to mammography images whereas all other models relied solely on imaging inputs. Furthermore, in Pedemonte’s (2024) study the high-level ‘metamodel’ additionally ingested prior mammograms thus providing additional context for assessment of lesions.28 Notably, none of the studies utilised Vision-Transformer architectures or multi-modal models, which are the current state-of-the-art models in computer vision; only Akselrod-BallinAkselrod-Ballin (2019) and Pedemonte Pedemonte (2024) used multi-modal inputs (images+clinical/prior data).28 38
Index test and comparators
Across studies, we observed the following recurring arms:
Standard double reading: Two independent readers; arbitration if discordant.
AI-supported double reading (concurrent decision support): Both readers see AI risk score/marks.
AI triage/normal triaging+safety net: AI autolabels a proportion ‘normal’ (no second human) and flags a small ‘safetynet’” subset for extra review (eg, Vara).
Single reader +AI (assist): One human reader sees AI output.
Reader replacement/deferral: AI replaces one reader or defers a subset to a single reader.
Standalone AI: AI classifies exams at a fixed operating point without a human confirmatory reader (programme policy permitting).
RCTs and real-world implementations principally compared AI-supported workflows versus standard double reading, while MRMC and simulation studies explored multiple threshold configurations (CDR-optimised, recall-optimised, balanced) against the programme reference.
Reference standards (ground truth)
Positive reference standards were uniformly histopathology for screen-detected cancers (biopsy or surgical specimen). Negative verification typically required registry-based or programme follow-up—most often ≥24 months, though some implementations reported ≥180–200 days interim follow-up at the time of analysis with longer ascertainment pending. Several studies explicitly captured interval cancers within 24 months; others (eg, recent implementations/RCTs) flagged interval cancer as the primary endpoint to be completed at ≥2 years.
Dataset composition and reader expertise
Online supplemental table S6 details how the included investigations assembled their evaluation cohorts and the radiologist panels against which AI systems were benchmarked.
Reader paradigms largely mirrored local screening practice: independent double reading with arbitration was the dominant reference, while opportunistic health-check cohorts typically used a single experienced reader. Some before–after implementations retained the same 10 radiologists across periods: MRMC reader studies explicitly documented expertise (eg, 14 MQSA-qualified radiologists in a fully crossed design). Prospective and randomised trials embedded AI into routine workflow with AI-stratified single versus double reading (eg, MASAI). Several studies also compared breast-subspecialist versus general radiologists in paired reads.
Across studies, dataset construction ranged from population screening cohorts to curated splits. Examples of explicit train/validation/test partitioning includeAkselrod-Ballin 2019 (9611/1055/2548 women; 73/8/19% by images),38 Frazer 2024 (random 20% client cohort split into 75% test/25% development; remaining 80% for training),35, Pedemonte 2024 (80/10/10% with an external US hold-out)28 and Leibig 2022 (70/15/15% internal plus two-site external test).39 Several implementation or evaluation-only papers used pretrained AI with no in-study split.1518 20 21 37 4043
Cancer enrichment varied between deliberately enriched datasets and datasets that are not enriched and reflect programme prevalence, for example.1518 20 21 37 4044
Reader expertise was typically strong and aligned with screening practice. Prospective/implementation settings reported breast-subspecialist panels Hernström 2025: (16 radiologists; most>5 years’ experience; high annual volumes)16 and Lauritzen 2024 (21 full-time breast radiologists; 19 senior (8–28 years), 2 junior).21 Programme cohorts detailed role-stratified experience, eg, Kühl 2024 (23 radiologists; first readers mainly 0–4 years; second readers/arbitrators largely ≥10 years).40 Reader numbers were large in programme-scale evaluations (Frazer weighted mean includes 125 readers),35 while MRMC/reader studies specified panels such as Dang 2022 (12 readers; 8 seniors, 4 juniors).30 Some papers did not report reader counts/experience (eg,38 38 41 41).
Full numeric details for all splits (train/validation/test), density distributions and country-level datasets are provided in online supplemental table S6.
Outcomes
The diagnostic-accuracy outcomes table condenses hundreds of individual point estimates into three pragmatic columns: (i) stand-alone AI performance, (ii) radiologist performance under conventional reading and (iii) the increment (Δ) when radiologists are supported by AI. For each metric, we list the full range observed across the 31 primary studies and, where≥10 studies reported the same statistic. A summary of these outcomes is outlined in table 1, and full details of each outcome are provided in online supplemental table S7.
Risk of Bias assessment
We judged methodological quality with QUADAS-2 (seven signalling domains)25 in addition to an AI specific critical appraisal tool (APPRAISE AI) (35-point scale).22 A summary of QUADAS-2 assessment is presented in table 2 and figure 1. Further, a full detailed description and justification of the decision in each domain in each included study is presented in online supplemental table S8. Additionally, the risk of bias assessment of the two included publications from the MASAI trial yielded low risk judgement on all domains in the Cochrane RoB 2 tool.
Quality Assessment of Diagnostic Accuracy Studies, version 2 (QUADAS-2)
Across the included studies, QUADAS-2 judgments showed a mixed but interpretable risk-of-bias profile. Patient selection was the most frequent early threat, high-risk calls typically reflected enriched/case–control or non-consecutive sampling and opportunistic cohorts, whereas consecutive programme cohorts were low risk. Patient applicability was generally good, with concerns limited to diagnostic or health-check settings. Index test bias was mostly acceptable, high-risk judgments arose when operating points were tuned on the test data or software/thresholds varied mid-study. Index applicability was uniformly strong because commercial AI on standard FFDM/DBT mirrored intended screening use.
For the reference standard, most studies were low risk, but partial or abbreviated verification, especially interim analyses without complete 24-month interval cancer capture, triggered high/unclear ratings. The flow and timing domain contributed the largest share of high-risk calls, driven by non-uniform verification windows, exclusions after the index test and operational changes (eg, AI outages/re-randomisation in MASAI’s early analysis16; mid-study threshold change in Lauritzen21; multiple viewer/model updates with short follow-up.19 In contrast, registry-anchored cohorts with uniform ≥24 month follow-up (eg, Larsen) showed consistently low risk for reference and flow.41 Six studies scored low risk on all domains.153537 42 44 Overall, programme-based prospective/registry studies exhibited the strongest internal validity and applicability, whereas enriched MRMC/simulation designs and interim implementations carried predictable risks from selection, verification and timing.
AI-specific appraisal tool (APPRAISE AI)
Methodological quality of the AI modelling studies (35-point scale) with detailed justification of the score in each domain for each study is reported in online supplemental table S9). The two ‘check-box’ domains: Implementation into clinical practice (max 1) and Limitations explicitly stated (max 2) were satisfied by nearly every paper, so they top the list in proportional terms. The remaining multipoint domains are discussed below.
Cohort characteristics (max 4)
Most programme-based studies scored 3–4/4 notable in large, consecutive screening cohorts and clear flow diagrams.15 19 34 41 45 Lower scores (1–2/4) clustered in small, enriched/MRMC or single-centre datasets30 31, and single-vendor or single-system sources tempered representativeness.38
Model specification (max 3)
Reporting was typically 1–2/3 for commercial tools (architecture/weights not disclosed; thresholds described). A few research models earned 3/3 by releasing code/architecture46 or giving fuller system detail (some Lunit/Transpara papers specified versions, scores and operating points).
Model evaluation (max 5)
Prospective/RCT or large registry evaluations commonly scored 3–4/5, with prespecified endpoints, CIs and appropriate comparative tests.16 18 19 Pure simulations or enriched reader studies were 2–3/5; classic Se/Sp/AUC or calibration was often missing in implementation reports, and many lacked external/prospective validation.
Clinical-utility assessment (max 5)
Many studies quantified CDR, recall/FPR, PPVs and workload (triage, reads saved), yielding 3–4/5. Decision-curve, cost-effectiveness and patient-centred outcomes were rare, limiting perfect scores.
Bias assessment/subgroup analysis (max 6)
Scores centred around 3–4/6. Highest scores occurred in prospective or randomised designs with robust verification.16 29 Lower scores reflected enrichment, same-sample thresholding, abbreviated follow-up for negatives or mid-study updates (typical of interim implementations and MRMC/simulation work).
Error analysis (max 4)
Most papers offered illustrative false negative/false positive (FN/FP) cases but few systematic taxonomies; domain scores were usually 1–2/4. Only a handful approached 3/4 by adding qualitative clusters or representation analyses.46 Formal root-cause matrices or reader-error audits were uncommon
Model explanation (not scored)
Nearly all studies provided risk scores and lesion markings/heatmaps; formal XAI (eg, SHAP/LIME, human-factors usability, trust calibration) was rare and typically not scored in the rubric.
Critical analysis / discussion quality (max 5)
This was a consistent strength (4/5 typical), most papers benchmarked against prior AI/reader literature, reflected on generalisability and threshold trade-offs, and explicitly framed future prospective work. Wu 2020 alone achieved five points with a thorough contextual critique.46
Across the evidence base, a consistent profile emerges; most AI-mammography studies are strong on who they studied and what they found, but far weaker on reporting how the algorithms were built, tuned and stress-tested.
Implementation into practice (max 1)
A minority documented live deployment within routine screening.16 19 21 39 Many remained simulations/retrospective without in-workflow metrics.
Limitations (max 2)
Most earned 2/2 by explicitly acknowledging design constraints (retrospective/simulation, enrichment, single-vendor/site, incomplete interval-cancer ascertainment, lack of calibration/fairness) and setting out prospective validation needs.
Sources of heterogeneity
Results varied widely across studies for three predictable reasons:
Study design: evidence spans randomised and prospective in-workflow evaluations, large registry cohorts, retrospective programme simulations and enriched MRMC reader studies; sampling (consecutive vs case-enriched), follow-up windows (complete interval cancer capture vs interim analyses), and reading paradigms (single vs double reading with arbitration) differ and affect comparability of CDR/recall/PPV as well as sensitivity/specificity.
AI role: systems were deployed as a concurrent second reader/decision support tool, as triage/decision referral (normal deferral, bandpass, reader replacement within double reading), or as standalone readers; each role targets different operating points and denominators, programme metrics dominate in in-workflow/triage studies, whereas ROC/AUC is emphasised in reader/MRMC work, so effect sizes are not interchangeable across roles or thresholds.
System/platform: studies used diverse commercial and investigational tools (eg, Transpara, Lunit, Vara, ProFound AI, CureMetrix, Google/BRAIx prototypes) on different imaging stacks (DM vs DBT; vendor hardware; availability of priors), with versioning and threshold setting (prespecified vs post hoc) further shifting observed performance.
To account for this, we were able to conduct meaningful subgroup analyses based on the AI’s clinical role and the study design. However, we were unable to conduct a meaningful comparative analysis based on AI platform/images, because platform-specific effects were not separable from study design, and many evaluations used mixed image types without reporting modality-stratified results.
Sensitivity analysis
We performed a sensitivity analysis stratified by study design (online supplemental table S10). Evidence is summarised in three subsections (Prospective randomised trials and real-world implementations, Prospective or paired-reader cohort studies, Retrospective simulations and reader/MRMC studies), and we present metrics only within the corresponding tier.
Prospective randomised trials and real-world implementations
We identified two publications from one randomised (MASAI) trial and multiple programme-level implementations/before–after studies.
Across the MASAI randomised controlled trial, both the interim and early performance publications showed consistent improvements in programme outcomes when AI was integrated into the screening workflow.15 16 In the interim analysis, AI-supported screening achieved a higher CDR than standard double reading (6.1 vs 5.1 per 1000), accompanied by similar recall rates and a higher positive predictive value of recall, while reducing the number of human reads by more than 40%. The updated early screening report confirmed these findings, with a CDR of 6.4 per 1000 in the AI arm compared with 5.0 per 1000 in the control arm, again without increasing recall. Positive predictive value improved, and human reading workload fell by approximately 44%. Interval cancer follow-up for MASAI remains ongoing.
Findings from before–after implementation studies were directionally similar. In the Danish implementation by Lauritzen and colleagues,21 the introduction of AI-based risk stratification was associated with lower recall and false-positive rates, a higher CDR and a marked improvement in the positive predictive value of recall. Reading workload also decreased by one-third as more examinations were routed to single reading. In the Spanish implementation by Elías-Cabot et al, AI support led to a substantial increase in cancer detection (rising from 5.8 to 9.0 per 1000), alongside modest increases in recall but notable gains in positive predictive value.20
The multicentre real-world non-inferiority implementation reported by Eisemann et al demonstrated the feasibility of integrating AI-supported workflows across 12 screening sites.19 Although specific numerical outcomes were not available in the present extraction, the study broadly supported the non-inferiority of AI-augmented screening pathways in routine practice and highlighted the potential for scalability across diverse clinical environments.
In summary, across prospective RCTs/implementations, AI support increased or maintained CDR, kept recall stable or lower in most settings, improved PPV and reduced reads by 40–45% in randomised evidence; interval cancer outcomes remain incomplete in interim reports.
Prospective or paired reader cohort studies
Prospective paired reader, non-enriched (Dembrower 2023b17): Reader1+AI sensitivity 78.6% (equal to double reading 78.6%); specificity 93.5% versus 94.4%; AIR 7.0% versus 6.06%; workload relative to double reading+15% at this threshold (vs+107% at a higher sensitivity threshold). CDR was not reported.
Prospective multicentre cohort, non-enriched18 18: Breast radiologists with AI versus without: CDR 5.70 versus 5.01/1000 (p < 0.001); recall 4.53% versus 4.48% (NS); PPV1 12.6% versus 11.2% (p < 0.001). Sensitivity/specificity/AUC were not reported.
In summary, in prospective paired reader/cohort designs, AI support either improved CDR and PPV at stable recall or preserved sensitivity/specificity close to double reading, with workload effects dependent on the chosen operating point.
Retrospective simulations and reader/MRMC studies (often enriched)
In this tier, we classify datasets as programme representative (non-enriched) when test/evaluation cohorts reflect routine screening prevalence (or are explicitly reweighted to programme prevalence), and as enriched when cancers are oversampled (eg, MRMC/laboratory sets) such that prevalence metrics are not directly interpretable for population screening.2730 32 3537 39
Programme representative (non-enriched or reweighted to prevalence)
Across large programme simulations and registries, AI-integrated pathways generally preserved or modestly improved CDR, while lowering recall and raising PPV at balanced operating points. Programme-level ‘decisionreferral/bandpass’ strategies typically reduced recalls by meaningful margins while keeping CDR non-inferior or higher; some configurations traded small CDR decreases for recall reductions, underscoring the importance of operating point selection and local calibration. In opportunistic single-reader settings, standalone AI often achieved higher specificity with lower recall and higher PPV at similar CDR, reflecting improved precision rather than increased yield.
When evaluated against routine double-reading or first-reader references, sensitivity was usually maintained, and specificity was stable or higher at conservative thresholds. Stand-alone or reader-aided AI reported AUCs commonly in the high-0.8s to mid-0.9s within this subcategory (eg, representative values around 0.89–0.97), consistent with non-inferior discrimination relative to radiologists in programme data.
Simulated or measured reading workload fell substantially under AI triage/deferral, frequently in the 40–90% range depending on the pathway (eg, readerreplacement vs bandpass) while maintaining screening performance at prespecified safety points; replacing a second reader preserved consensus accuracy with very large second reader workload reductions.
Where interval cancer outcomes were reported, findings varied with the chosen operating point; high specificity configurations can increase interval cancers relative to standard first readers, reinforcing the need to predefine thresholds and monitor interval cancer rates during rollout.
Enriched reader/MRMC and laboratory-like datasets
Because cancer prevalence is inflated in enriched sets, programme metrics (CDR, recall, PPV) are not directly comparable to routine screening and are interpreted qualitatively. Across these datasets, recall frequently rises at high-sensitivity operating points, illustrating threshold trade-offs rather than population-level performance.
Reader assistance on enriched MRMC datasets shows modest but consistent gains: ΔAUC typically+0.02 to +0.10, sensitivity increases of a few PP with neutral to slightly higher specificity when AI supports readers. Standalone AI commonly achieves AUC 0.88–0.95 on enriched test sets; hybrid (AI+reader) configurations outperform either alone on ROC. Triage simulations demonstrate large potential workload savings at stricter thresholds but at the expense of increasing misses; non-inferiority to readers’ AUC is generally preserved until very aggressive triage is applied. These patterns confirm directionality (ie, assistance helps and thresholds matter) but do not establish programme-level absolute effects.
The programme representative analyses suggest that AI can reduce recall and workload with non-inferior cancer detection, while enriched MRMC evidence demonstrates robust ROC improvements and clarifies sensitivity–specificity trade-offs as thresholds shift.
In summary, in retrospective programme datasets and MRMC/reader studies (many enriched), ROC/AUC is frequently reported and shows AI parity or gains at selected thresholds, with substantial simulated workload savings in triage/deferral pathways. Effects on real prevalence CDR/recall vary by operating point and modality; density and reader experience subgroup patterns are consistent with mechanism.
Subgroup analyses
We conducted a subgroup analysis based on the role of AI (online supplemental tables S11,S12 and S13), classifying studies into three groups: (1) concurrent second reader/decision support, (2) triage/decision-referral/reader replacement and (3) stand-alone AI readers. Here we summarise findings for the concurrent second reader/decision-support subgroup.
Concurrent second reader/decision support
In this first subgroup, AI systems were used alongside human readers in real time, either concurrently or sequentially, without directly driving triage or automation decisions. This included prospective and implementation studies (such as182049 and182049 50 50), as well as MRMC experiments where AI-assisted readers,30 31 Rodríguez-Ruiz 2019b303246 46. Across these decision-support deployments, reader aid consistently produced modest improvements in discrimination: AUC gains typically ranged from about+0.02 to +0.10. In MRMC reader-aid studies, sensitivity was usually maintained or slightly increased, while specificity improved by a few PP, with a median gain of around three PP. In real-world matched or paired cohorts, AI support generally maintained sensitivity while improving specificity and/or PPV.
Additionally, several consistent trends were observed. Both Chang 2025 and Elías-Cabot 2024 reported improvements in cancer detection and positive predictive value when AI was used alongside radiologists, although Elías-Cabot also observed a modest rise in recall while Chang showed stable recall among breast radiologists. 18 20 Kim 2023 similarly demonstrated improved specificity and a substantial reduction in recall, although without a change in CDR.49 In contrast, Letter 2023 did not demonstrate statistically significant differences, though effect estimates modestly favoured AI.50 Taken together, these studies indicate that decision-support AI generally improves specificity and PPV, with neutral or small gains in cancer detection, while recall tends to remain stable or decrease in several settings. Improvements in reader discrimination were echoed in reader-aid experiments such as,30 where performance increased but workload was unchanged, underscoring that decision-support alone does not meaningfully reduce reading time without accompanying triage or routing functions.30
Triage/decision-referral/reader-replacement
In the second subgroup, AI was used to route or replace human reads, including normal triage, band-pass/decision-referral, reader replacement within double reading and risk-based allocation of single versus double reading. Studies consistently showed that AI-guided routing, whether through normal triage, band-pass strategies, reader replacement within double reading or risk-based allocation of single versus double reading, preserved or improved diagnostic performance when conservative thresholds were applied. Both MASAI15 16 16 23 and21 21 demonstrated increases in cancer detection with AI-stratified workflows, with MASAI reporting a rise from 5.0 to 6.4 per 1000 and Lauritzen showing gains from 0.70% to 0.82%. These findings aligned with programme-level simulations,34 34 35 35 39 39 44 44 45 45 where triage or decision-referral frequently increased sensitivity by roughly 2–4 PP while maintaining or modestly improving specificity.
Across studies reporting recall and PPV, similar patterns were seen. MASAI showed stable recall alongside improved PPV,15 16 while Lauritzen documented decreases in both recall and false-positive rate, with a substantial increase in PPV.21 In simulated screening scenarios, Fisches found small increases in CDR and reductions in recall across German, UK and Swedish cohorts.45 Leibig similarly demonstrated sensitivity gains with preserved specificity, together with large proportions of examinations automatically triaged as low-risk.39 Raya-Povedano reported meaningful recall reductions for both DM and DBT without compromising cancer detection.44
Workload reduction was the most striking and consistent outcome across this subgroup. Real-world implementations showed large reductions, with MASAI15 16 reducing reading volume by approximately 44% and Lauritzen by around one-third.21 Simulations projected even greater savings: Fisches reported reductions ranging from 37% to 95% depending on strategy,45 and Frazer showed that band-pass approaches could reduce human reads by over 80%.35 Rodríguez-Ruiz 2019a demonstrated that workload reductions of up to 90% were feasible at stringent thresholds, although with increased miss rates when thresholds were pushed too far.51 Collectively, these studies indicate that triage and decision-referral approaches deliver the largest operational benefits, with non-inferior cancer detection and improved PPV when thresholds are calibrated appropriately, while overly aggressive triage in simulations may compromise safety.
Stand-alone AI readers
In the third subgroup, AI was evaluated as a full, stand-alone reader, either head-to-head against radiologists or as one arm within a multi-arm simulation, without human involvement in the primary decision. Results across programme-level and head-to-head comparisons showed that contemporary AI systems generally achieved discrimination comparable to radiologists. Programme-like datasets, including27 27 38 38 41 41 42 42 43 43 consistently reported AI AUC values in the range of 0.80–0.97, closely matching radiologist AUCs (approximately 0.78–0.98).
Direct comparisons of sensitivity, specificity and programme metrics reinforced this pattern but also highlighted the impact of threshold selection. In real-world, non-enriched cohorts,40 and37 37 40 showed that AI could achieve CDRs equivalent to radiologists, with Kwon reporting substantially higher specificity and PPV and a marked reduction in recall when AI was tuned towards specificity.37 Conversely, enriched or outpatient datasets, such as,48 demonstrated increased sensitivity but at the cost of higher recall and lower specificity.48 Threshold-dependent patterns were further illustrated by,36 where more conservative AI operating points maintained CDR while markedly reducing recall, whereas sensitivity-maximising thresholds increased recall and reduced specificity.36
Taken together, results from programme-like cohorts (Kühl, Kwon, Romero-Martín37 40 43) indicate that stand-alone AI can match radiologists in cancer detection and often improve PPV when operating at higher specificity, whereas enriched or MRMC-style studies primarily show the potential of AI under idealised conditions rather than real-world prevalence. Despite strong technical performance, stand-alone AI has not yet been adopted as a final reader in population screening programmes, and incomplete interval-cancer follow-up in key datasets limits firm conclusions regarding long-term safety.
Discussion
Discussion
Across 31 primary studies including two publications from a large, randomised trial and several large real-world implementations spanning millions of screening examinations, deep-learning systems (predominantly CNN-based) delivered diagnostic accuracy that was non-inferior to, and often matched or exceeded, radiologist performance in defined roles. Stand-alone AI reported ROC AUCs generally between 0.74 and 0.97 (median around 0.89), and in many direct, per-case comparisons achieved similar sensitivity to individual readers. When integrated as AI+radiologists (triage/decision-referral, reader-replacement, or band-pass pathways), programmes observed clinically meaningful gains: CDR increased in the prospective RCT and several implementations, recall/false-positive rates were stable or lower, PPV typically improved, and reading workload fell substantially (about 40–50% in prospective trials; 37–95% in simulations depending on threshold and pathway). Effects were often larger in dense breasts and when supporting less-experienced readers, consistent with mechanism. Most prospective/implementation papers prioritised programme metrics (CDR, recall/FPR, PPV, workload) over ROC, whereas reader studies and retrospective validations also reported AUC.
Yet the evidence base remains heterogeneous. A sizeable fraction of studies is retrospective (some cancer-enriched), and several implementation/RCT reports are interim (interval-cancer capture pending). Commercial systems dominate; operating points and versioning are usually described, but internal architecture/training detail is sparse. A minority of research papers provided deeper technical specifications; only a few truly multi-modal models (image+clinical/prior) are represented, and no Vision-Transformer-based mammography models were identified among the included studies.
QUADAS-2 showed a consistent pattern: risk decreases as designs approach routine screening. Patient selection drew the most high-risk calls, chiefly from enriched/case-control or non-consecutive cohorts. Index test bias was mostly low, with high-risk judgments driven by same-sample thresholding or mid-study software/threshold changes. Reference standard was usually appropriate, but high/unclear arose where follow-up for negatives was abbreviated (interim analyses). Flow and timing contributed the largest high-risk share, reflecting non-uniform verification windows, post-index exclusions and operational changes (eg, viewer/model updates). Applicability was generally strong for index tests (all studies used screening-relevant AI and modalities) and for patients in programme cohorts; concerns concentrated in opportunistic/diagnostic settings.
APPRAISE-AI mirrored this with an overall mean (SD) score for all domains in all included studies of 22.69 (2.22), range (18-28) out of a possible total score of 35.
While descriptive accuracy and operational gains are now well-documented in real screening contexts, selection/flow risks (when enrichment or interim follow-up is used) and limited technical transparency limit certainty about generalisability without local calibration and monitoring. Lack of model transparency chiefly impacts clinician trust and governance, not measured accuracy per se; notably, most included studies did not report formal explainability/user trust metrics, reinforcing the need for implementation science endpoints in future trials.
AI performance varied meaningfully according to both study design and integration role, underscoring that ‘AI in mammography’ is not a single intervention but a spectrum of workflows with distinct implications for safety and efficiency. Evidence from prospective RCTs and implementation studies demonstrated that AI-enabled stratification can maintain or increase cancer detection while reducing reading volume by approximately 40–45%, though interval-cancer follow-up remains incomplete in several major trials. In prospective paired-reader and cohort designs, AI support improved PPV and often increased CDR without inflating recall, while preserving sensitivity and specificity close to standard double reading. Retrospective programme-scale analyses and enriched MRMC studies showed consistent improvements in ROC/AUC at selected thresholds, but these datasets, particularly enriched ones, cannot be interpreted as programme-level effects without careful calibration. The primary inferences are drawn from RCT/implementation and prospective cohorts; retrospective/enriched findings are contextual and not used to infer clinical safety.
Across AI role subgroups, patterns were similarly coherent. When AI functioned as a concurrent second reader, enhancements in discrimination, specificity and PPV were reproducible, though workload remained essentially unchanged because routing rules were unaffected. In contrast, triage and decision-referral applications delivered the largest operational gains, reducing human reads by 40–90% while maintaining or improving CDR and PPV when conservative safety thresholds were applied; only overly aggressive triage in simulations produced meaningful safety trade-offs. Stand-alone AI readers achieved AUC values comparable to radiologists and, in real-world non-enriched cohorts, matched CDR with either lower or higher recall depending on threshold choice. However, the absence of complete interval-cancer follow-up and persistent caution in clinical practice means that AI has not yet been adopted as an autonomous final reader in population screening. Together, these findings highlight the importance of specifying both the design tier and the AI’s functional role when interpreting performance and implementation potential.
Our findings both reinforce and extend those of the most recent systematic review by Freeman,13 which drew cautious optimism from 12 largely retrospective studies but warned of sparse prospective data and small reader cohorts. By incorporating two RCTs and multiple large implementation cohorts, we show that prospective evidence is emerging and that AI can safely support or replace one reader in organised programmes without compromising cancer detection, often improving PPV and workload. Compared with prior wide AUC ranges, the present set shows consistently high AUCs for stand-alone AI while underscoring that non-inferiority rather than superiority is the realistic expectation in routine practice once enrichment and thresholding biases are controlled. This reflects improvements in algorithm maturity. Nevertheless, our risk-of-bias appraisal corroborates13 concern that cancer-enriched sampling inflates accuracy.
In summary, the most consistent signals from real-world and randomised evidence are workload reduction and non-inferior to improved cancer detection, whereas safety measured by interval cancer rates is not yet conclusive. Until full interval cancer analyses are reported, AI should be adopted only with robust governance, staged rollout and prospective monitoring of safety endpoints.
Implications for practice
The weight of evidence supports deploying AI as a work-saving second-reader/triage tool in double-reading programmes: at conservative operating points, services can remove a large fraction of human reads while maintaining sensitivity and often improving PPV. In single-reader settings, stand-alone AI can function as a safety-net/triage, but requires local calibration, governance and health-economic evaluation before scale-up. No included study prospectively allowed AI to over-rule a radiologist; regulatory/medicolegal frameworks must evolve in parallel with technical performance.
Strengths and limitations
Strengths include an exhaustive, no language restriction search; duplicate quality appraisal with QUADAS2 and APPRAISEAI; and broad coverage across geographies, vendors, modalities (DM and DBT), and integration strategies, supporting generalisability of our narrative synthesis. Limitations mirror the primary literature. First, design and reporting heterogeneity, including differences in study design (randomised/implementation vs retrospective simulations/MRMC), workflow and modality (DM vs DBT), precluded a meaningful meta-analysis, so we synthesised within design tiers and AI clinical role. Second, interval cancer outcomes remain pending in several major prospective and implementation studies, limiting certainty about long-term safety. Third, frequent case enrichment in retrospective and MRMC datasets inflates accuracy estimates and prevents direct interpretation of programme-level prevalence metrics (CDR/recall/PPV). Fourth, as captured by our APPRAISEAI extraction, many studies lacked calibration reporting, systematic fairness/bias analyses across subgroups and health economic (cost-effectiveness) evaluations or decision curve analyses, gaps that temper implementation guidance. Finally, possible publication bias and limited technical transparency (commercial sponsorship, sparse algorithmic detail) persist across the field.
Across 31 primary studies including two publications from a large, randomised trial and several large real-world implementations spanning millions of screening examinations, deep-learning systems (predominantly CNN-based) delivered diagnostic accuracy that was non-inferior to, and often matched or exceeded, radiologist performance in defined roles. Stand-alone AI reported ROC AUCs generally between 0.74 and 0.97 (median around 0.89), and in many direct, per-case comparisons achieved similar sensitivity to individual readers. When integrated as AI+radiologists (triage/decision-referral, reader-replacement, or band-pass pathways), programmes observed clinically meaningful gains: CDR increased in the prospective RCT and several implementations, recall/false-positive rates were stable or lower, PPV typically improved, and reading workload fell substantially (about 40–50% in prospective trials; 37–95% in simulations depending on threshold and pathway). Effects were often larger in dense breasts and when supporting less-experienced readers, consistent with mechanism. Most prospective/implementation papers prioritised programme metrics (CDR, recall/FPR, PPV, workload) over ROC, whereas reader studies and retrospective validations also reported AUC.
Yet the evidence base remains heterogeneous. A sizeable fraction of studies is retrospective (some cancer-enriched), and several implementation/RCT reports are interim (interval-cancer capture pending). Commercial systems dominate; operating points and versioning are usually described, but internal architecture/training detail is sparse. A minority of research papers provided deeper technical specifications; only a few truly multi-modal models (image+clinical/prior) are represented, and no Vision-Transformer-based mammography models were identified among the included studies.
QUADAS-2 showed a consistent pattern: risk decreases as designs approach routine screening. Patient selection drew the most high-risk calls, chiefly from enriched/case-control or non-consecutive cohorts. Index test bias was mostly low, with high-risk judgments driven by same-sample thresholding or mid-study software/threshold changes. Reference standard was usually appropriate, but high/unclear arose where follow-up for negatives was abbreviated (interim analyses). Flow and timing contributed the largest high-risk share, reflecting non-uniform verification windows, post-index exclusions and operational changes (eg, viewer/model updates). Applicability was generally strong for index tests (all studies used screening-relevant AI and modalities) and for patients in programme cohorts; concerns concentrated in opportunistic/diagnostic settings.
APPRAISE-AI mirrored this with an overall mean (SD) score for all domains in all included studies of 22.69 (2.22), range (18-28) out of a possible total score of 35.
While descriptive accuracy and operational gains are now well-documented in real screening contexts, selection/flow risks (when enrichment or interim follow-up is used) and limited technical transparency limit certainty about generalisability without local calibration and monitoring. Lack of model transparency chiefly impacts clinician trust and governance, not measured accuracy per se; notably, most included studies did not report formal explainability/user trust metrics, reinforcing the need for implementation science endpoints in future trials.
AI performance varied meaningfully according to both study design and integration role, underscoring that ‘AI in mammography’ is not a single intervention but a spectrum of workflows with distinct implications for safety and efficiency. Evidence from prospective RCTs and implementation studies demonstrated that AI-enabled stratification can maintain or increase cancer detection while reducing reading volume by approximately 40–45%, though interval-cancer follow-up remains incomplete in several major trials. In prospective paired-reader and cohort designs, AI support improved PPV and often increased CDR without inflating recall, while preserving sensitivity and specificity close to standard double reading. Retrospective programme-scale analyses and enriched MRMC studies showed consistent improvements in ROC/AUC at selected thresholds, but these datasets, particularly enriched ones, cannot be interpreted as programme-level effects without careful calibration. The primary inferences are drawn from RCT/implementation and prospective cohorts; retrospective/enriched findings are contextual and not used to infer clinical safety.
Across AI role subgroups, patterns were similarly coherent. When AI functioned as a concurrent second reader, enhancements in discrimination, specificity and PPV were reproducible, though workload remained essentially unchanged because routing rules were unaffected. In contrast, triage and decision-referral applications delivered the largest operational gains, reducing human reads by 40–90% while maintaining or improving CDR and PPV when conservative safety thresholds were applied; only overly aggressive triage in simulations produced meaningful safety trade-offs. Stand-alone AI readers achieved AUC values comparable to radiologists and, in real-world non-enriched cohorts, matched CDR with either lower or higher recall depending on threshold choice. However, the absence of complete interval-cancer follow-up and persistent caution in clinical practice means that AI has not yet been adopted as an autonomous final reader in population screening. Together, these findings highlight the importance of specifying both the design tier and the AI’s functional role when interpreting performance and implementation potential.
Our findings both reinforce and extend those of the most recent systematic review by Freeman,13 which drew cautious optimism from 12 largely retrospective studies but warned of sparse prospective data and small reader cohorts. By incorporating two RCTs and multiple large implementation cohorts, we show that prospective evidence is emerging and that AI can safely support or replace one reader in organised programmes without compromising cancer detection, often improving PPV and workload. Compared with prior wide AUC ranges, the present set shows consistently high AUCs for stand-alone AI while underscoring that non-inferiority rather than superiority is the realistic expectation in routine practice once enrichment and thresholding biases are controlled. This reflects improvements in algorithm maturity. Nevertheless, our risk-of-bias appraisal corroborates13 concern that cancer-enriched sampling inflates accuracy.
In summary, the most consistent signals from real-world and randomised evidence are workload reduction and non-inferior to improved cancer detection, whereas safety measured by interval cancer rates is not yet conclusive. Until full interval cancer analyses are reported, AI should be adopted only with robust governance, staged rollout and prospective monitoring of safety endpoints.
Implications for practice
The weight of evidence supports deploying AI as a work-saving second-reader/triage tool in double-reading programmes: at conservative operating points, services can remove a large fraction of human reads while maintaining sensitivity and often improving PPV. In single-reader settings, stand-alone AI can function as a safety-net/triage, but requires local calibration, governance and health-economic evaluation before scale-up. No included study prospectively allowed AI to over-rule a radiologist; regulatory/medicolegal frameworks must evolve in parallel with technical performance.
Strengths and limitations
Strengths include an exhaustive, no language restriction search; duplicate quality appraisal with QUADAS2 and APPRAISEAI; and broad coverage across geographies, vendors, modalities (DM and DBT), and integration strategies, supporting generalisability of our narrative synthesis. Limitations mirror the primary literature. First, design and reporting heterogeneity, including differences in study design (randomised/implementation vs retrospective simulations/MRMC), workflow and modality (DM vs DBT), precluded a meaningful meta-analysis, so we synthesised within design tiers and AI clinical role. Second, interval cancer outcomes remain pending in several major prospective and implementation studies, limiting certainty about long-term safety. Third, frequent case enrichment in retrospective and MRMC datasets inflates accuracy estimates and prevents direct interpretation of programme-level prevalence metrics (CDR/recall/PPV). Fourth, as captured by our APPRAISEAI extraction, many studies lacked calibration reporting, systematic fairness/bias analyses across subgroups and health economic (cost-effectiveness) evaluations or decision curve analyses, gaps that temper implementation guidance. Finally, possible publication bias and limited technical transparency (commercial sponsorship, sparse algorithmic detail) persist across the field.
Conclusions
Conclusions
Modern deep-learning systems can match organised programme performance while substantially reducing workload when integrated as AI+radiologists. Yet methodological weaknesses, chiefly cancer-enriched sampling, incomplete follow-up and limited algorithmic transparency, temper certainty. In the short term, AI is ready for cautious integration as a second reader in organised programmes with appropriate governance. Realising its full potential as a stand-alone triage or replacement tool will require the next generation of rigorously designed, externally validated, and transparently reported, and methodologically rigorous prospective trials. While efficiency gains are promising, clinical safety remains unconfirmed until the Interval Cancer Rate is fully reported.
Modern deep-learning systems can match organised programme performance while substantially reducing workload when integrated as AI+radiologists. Yet methodological weaknesses, chiefly cancer-enriched sampling, incomplete follow-up and limited algorithmic transparency, temper certainty. In the short term, AI is ready for cautious integration as a second reader in organised programmes with appropriate governance. Realising its full potential as a stand-alone triage or replacement tool will require the next generation of rigorously designed, externally validated, and transparently reported, and methodologically rigorous prospective trials. While efficiency gains are promising, clinical safety remains unconfirmed until the Interval Cancer Rate is fully reported.
Supplementary material
Supplementary material
10.1136/bmjopen-2025-111360online supplemental file 1
10.1136/bmjopen-2025-111360online supplemental file 1
출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.
🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반
- A Phase I Study of Hydroxychloroquine and Suba-Itraconazole in Men with Biochemical Relapse of Prostate Cancer (HITMAN-PC): Dose Escalation Results.
- Self-management of male urinary symptoms: qualitative findings from a primary care trial.
- Clinical and Liquid Biomarkers of 20-Year Prostate Cancer Risk in Men Aged 45 to 70 Years.
- Diagnostic accuracy of Ga-PSMA PET/CT versus multiparametric MRI for preoperative pelvic invasion in the patients with prostate cancer.
- Association of patient health education with the postoperative health related quality of life in low- intermediate recurrence risk differentiated thyroid cancer patients.
- Early local immune activation following intra-operative radiotherapy in human breast tissue.