본문으로 건너뛰기
← 뒤로

Evaluating AI-driven precision oncology for breast cancer in low- and middle-income countries: a review of machine learning performance, genomic data use, and clinical feasibility.

1/5 보강
Frontiers in digital health 2025 Vol.7() p. 1702339
Retraction 확인
출처

Salazar-Garcés LF, Morales-Urrutia E, Cashabamba F, Proaño Alulema RX, Leiva Suero LE

📝 환자 설명용 한 줄

[BACKGROUND] Artificial intelligence (AI) systems are increasingly used to support treatment decision-making in breast cancer, yet their performance and feasibility in low- and middle-income countries

이 논문을 인용하기

↓ .bib ↓ .ris
APA Salazar-Garcés LF, Morales-Urrutia E, et al. (2025). Evaluating AI-driven precision oncology for breast cancer in low- and middle-income countries: a review of machine learning performance, genomic data use, and clinical feasibility.. Frontiers in digital health, 7, 1702339. https://doi.org/10.3389/fdgth.2025.1702339
MLA Salazar-Garcés LF, et al.. "Evaluating AI-driven precision oncology for breast cancer in low- and middle-income countries: a review of machine learning performance, genomic data use, and clinical feasibility.." Frontiers in digital health, vol. 7, 2025, pp. 1702339.
PMID 41550348 ↗

Abstract

[BACKGROUND] Artificial intelligence (AI) systems are increasingly used to support treatment decision-making in breast cancer, yet their performance and feasibility in low- and middle-income countries (LMICs) remain incompletely defined. Many high-performing models, particularly genomic and multimodal systems trained on The Cancer Genome Atlas (TCGA), raise questions about cross-domain generalizability and equity.

[METHODS] We conducted an AI-assisted scoping review combining Boolean database searches with semantic retrieval tools (Elicit, Semantic Scholar, Connected Papers). From 497 unique records, 43 studies met inclusion criteria and 34 reported quantitative metrics. Data extraction included study design, AI model type (treatment-recommendation, prognostic, or diagnostic/subtyping), input modalities, and validation strategies. Risk of bias was assessed using a hybrid PROBAST-AI/QUADAS-AI framework.

[RESULTS] Treatment-recommendation systems (e.g., WFO, Navya) showed concordance ranges of 67%-97% in early-stage settings but markedly lower performance in metastatic disease. Prognostic and multimodal models frequently achieved AUCs of 0.90-0.99. HIC-trained genomic models demonstrated consistent declines during external LMIC validation (e.g., CDK4/6 response model: AUC 0.9956 → 0.9795). LMIC implementations reported reduced time-to-treatment and improved adherence to guidelines, but these gains were constrained by gaps in electronic health records, limited digital pathology, and insufficient local genomic testing capacity.

[CONCLUSIONS] AI-enabled systems show promise for improving breast cancer treatment planning, especially in early-stage disease and resource-limited settings. However, the evidence base remains dominated by HIC-derived datasets and retrospective analyses, with persistent challenges related to domain shift, data representativeness, and genomic governance. Advancing equitable AI-driven oncology will require prospective multicenter validation, expanded LMIC-based data generation, and context-specific implementation strategies.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

📖 전문 본문 읽기 PMC JATS · ~111 KB · 영문

Introduction

Introduction
Breast cancer is the most frequently diagnosed malignancy and the leading cause of cancer-related death among women worldwide, with a rising burden in low- and middle-income countries (LMICs) where access to advanced diagnostics and specialized oncology care remains limited (1). Conventional treatment decision-making relies on tumor stage, histopathological markers, and guideline-based algorithms. However, these frameworks often fall short of capturing the molecular and clinical heterogeneity of breast cancer, especially in resource-constrained settings where variability in diagnostic quality and therapeutic availability is common (2).
Artificial intelligence (AI) has emerged as a promising tool to enhance precision oncology. Different categories of AI models now contribute to treatment planning:
(a)treatment-recommendation systems [e.g., Watson for Oncology (WFO), Navya], which generate guideline-based suggestions using structured clinical inputs (3, 4).

(b)prognostic and risk-stratification models, often based on genomic, transcriptomic, or radiomic signatures (5, 6).

(c)diagnostic or subtyping models that refine molecular classification or infer actionable tumor biology.

These systems integrate multimodal data—including clinical variables, imaging, pathology, and genomics—to support complex therapeutic decisions. They may also reduce the workload of multidisciplinary tumor boards.
First, the alignment between AI-generated recommendations and local clinical realities varies substantially across settings. This gap is especially evident in LMIC environments, where diagnostic infrastructure, therapeutic availability, and resource constraints differ markedly from high-income countries (7, 8). Second, many high-performing models are trained on HIC datasets, especially The Cancer Genome Atlas (TCGA), raising concerns regarding external validity, domain shift, and the equitable applicability of these tools in emerging economies. Third, the rapid proliferation of algorithmic tools—with heterogeneous designs, inputs, and endpoints—has made it increasingly important to systematically assess their performance, reproducibility, and real-world feasibility.
This scoping review addresses these gaps by synthesizing evidence on the concordance, predictive accuracy, and implementation feasibility of AI and machine learning (ML) systems for breast cancer treatment decision-making. Using semantic AI tools for literature retrieval, combined with a structured extraction framework, enabled a comprehensive characterization of expert systems, machine-learning models, and multimodal pipelines across LMIC and HIC settings. The goal is to inform clinicians, researchers, and policymakers about the current capabilities, limitations, and translational potential of AI-driven decision-support systems in diverse healthcare environments.

Materials and methods

Materials and methods

Study design
This review was conducted as an AI-assisted scoping study designed to evaluate the performance, concordance, and implementation feasibility of artificial intelligence (AI) and machine learning (ML) tools used to support treatment decision-making in breast cancer.
The methodological approach combined established evidence-synthesis standards with advanced semantic retrieval techniques to ensure comprehensive coverage of emerging literature. The protocol adhered to key elements of PRISMA 2020, EQUATOR Network guidance for digital health research, and methodological principles from PROBAST-AI and QUADAS-AI. Because the review did not involve primary data collection, ethics approval was not required.
The overarching objective was to map the breadth of available AI systems, characterize their technical performance, and examine contextual feasibility in low- and middle-income countries (LMICs), where digital and clinical infrastructures differ significantly from high-income settings.

Eligibility criteria
Studies were eligible if they met all the following criteria:
Population: Adults diagnosed with breast cancer (any stage), or retrospective datasets derived from clinical, radiologic, pathologic, or genomic records of breast cancer patients.

Intervention/Exposure: AI- or ML-based systems designed to generate treatment recommendations, guide therapeutic decision-making, or predict therapeutic response, recurrence risk, or other clinically actionable outcomes.

Outcomes: Quantitative performance metrics such as concordance with multidisciplinary tumor boards (MTBs) or guideline-based decisions, AUC, C-index, sensitivity, specificity, accuracy, or positive/negative predictive values.

Study type: Retrospective or prospective evaluations, validation studies, multimodal modeling pipelines, or comparative decision-support analyses.

Setting: No restriction by country income category; however, the World Bank income classification for each study was extracted to enable LMIC-HIC subgroup comparisons.

Language: English.

Publication type: Peer-reviewed manuscripts.

Exclusion criteria comprised:
(a)diagnostic-only models without treatment relevance;

(b)image-segmentation or detection studies without therapeutic outputs;

(c)narrative reviews, protocols, or editorials;

(d)conference abstracts lacking methodological detail;

(e)studies without extractable performance metrics;

(f)datasets composed exclusively of synthetic cases without external validation (described narratively but excluded from quantitative synthesis).

Search strategy
A hybrid search strategy—combining traditional Boolean database queries with AI-assisted semantic retrieval—was designed to maximize both sensitivity and conceptual breadth. Searches were conducted from January to June 2025 across PubMed, Scopus, Web of Science, Semantic Scholar, Elicit.org, and Connected Papers.

Traditional database searches
Boolean queries were adapted to each platform to capture studies on AI-driven treatment decision support in breast cancer, with explicit inclusion of LMIC-related terms. Complete reproducible queries were:

PubMed
(“breast neoplasms”[MeSH Terms] OR “breast cancer”[Title/Abstract]) AND
(“artificial intelligence”[MeSH Terms] OR “machine learning”[MeSH Terms]
OR “deep learning”[Title/Abstract] OR “clinical decision support systems”[MeSH Terms]) AND
(“treatment”[Title/Abstract] OR “therapy”[Title/Abstract]
OR “treatment recommendation”[Title/Abstract]) AND
(“emerging economy”[Title/Abstract] OR “developing country”[Title/Abstract]
OR “low- and middle-income country”[Title/Abstract] OR LMIC[Title/Abstract])

Scopus
TITLE-ABS-KEY ((“breast cancer”) AND
(“artificial intelligence” OR “machine learning” OR “deep learning”) AND
(“treatment decision*” OR “therapy recommendation”) AND
(“LMIC” OR “low-income” OR “middle-income” OR “developing countr*”))

Web of science
TS = ((“breast cancer”) AND
(“artificial intelligence” OR “machine learning” OR “deep learning”) AND
(“clinical decision support” OR “treatment recommendation”) AND
(“LMIC” OR “developing country” OR “emerging economy”))

AI-assisted semantic retrieval
Three platforms—Elicit.org, Semantic Scholar, and Connected Papers—were used to identify conceptually related publications beyond keyword matches.
Natural language query (applied consistently):
“What degree of accuracy and clinical applicability can an artificial intelligence model, based on local clinical and international genomic data, achieve in recommending personalized treatments for patients with breast cancer in emerging economies?”
Seed papers (n = 5) were selected a priori for graph-based expansion due to high relevance and citation impact:
Somashekhar et al., 2017; Arriaga et al., 2020; Jacobs et al., 2020; Yang et al., 2023; Shamai et al., 2025.
Retrieval procedures:
Elicit.org: top 500 results ranked by semantic similarity.

Semantic Scholar: top 200 semantically associated articles using its proprietary embedding model.

Connected Papers: up to 30 prior and derivative works per seed, filtered at a relevance threshold ≥0.65.

Aggregation and deduplication
All records (Boolean + semantic outputs) were exported in RIS/CSV format, merged, and deduplicated in Rayyan with subsequent manual verification. The final dataset consisted of 497 unique records.
Two reviewers independently screened titles and abstracts; full texts were assessed according to eligibility criteria. Disagreements were resolved by consensus or adjudication by a third reviewer.
A full list of all 497 pre-screened DOIs is provided in Supplementary Table S1, ensuring transparency and reproducibility of the search. The full selection process is illustrated in Figure 1 (PRISMA 2020).

Data extraction
A structured matrix was developed to capture:
study characteristics (country, setting, income classification)

sample size and patient demographics

AI system type (expert system, ML, deep learning, multimodal, genomic)

input modalities (clinical, imaging, pathology, genomic, multi-omics)

validation strategy (internal, external, cross-validation, split-sample)

quantitative performance metrics (e.g., concordance, AUC, C-index, accuracy, sensitivity/specificity)

implementation considerations

reporting of funding sources and conflicts of interest

Extraction was performed independently by two reviewers with arbitration by a third where required.

Quality assessment and risk of bias
Because included studies spanned diverse methodologies—from multimodal prediction models to concordance assessments against tumor boards—an integrated, domain-based framework was developed using principles from PROBAST-AI and QUADAS-2/AI. Five cross-cutting domains were evaluated:
Participant selection: representativeness, case-mix, inclusion/exclusion criteria, and risk of spectrum bias.

Predictor measurement: completeness and standardization of input data; handling of missingness and preprocessing.

Outcome assessment: appropriateness and independence of reference standards; clarity and objectivity of outcome definitions.

Model overfitting and validation: risks of data leakage, adequacy of internal validation, presence/absence of external validation (particularly HIC → LMIC generalizability).

Funding and conflicts of interest: transparency regarding commercial involvement and potential promotional bias.

Each domain was rated as low, high, or unclear risk by two reviewers, with disagreements resolved via consensus.
Domain-level judgments are summarized in Table 1 (Risk-of-Bias Summary), and a visual traffic-light representation is provided in Figure 3.

Data synthesis
Given the substantial heterogeneity in study design, outcome definitions, model architectures, and performance metrics, a narrative synthesis was performed. Results were organized around five analytic dimensions:
Concordance-based systems (e.g., expert-system platforms such as WFO, Navya).

Multimodal ML/DL models integrating clinical, histopathology, imaging, or multi-omics data.

Genomic and transcriptomic models, including CDK4/6 inhibitor response predictors and RNA-based recurrence signatures.

Comparative evaluation across income settings (LMIC vs. HIC), with specific attention to cross-domain performance drops.

Implementation feasibility, including infrastructure requirements, interoperability challenges, and workflow integration in resource-constrained settings.

Subgroup analyses were conducted descriptively by disease stage, model type, and income classification.

Results

Results

Study selection and dataset overview
The AI-assisted search yielded 497 unique records, catalogued with complete metadata in Supplementary Table S1. After screening and full-text review, 43 studies met inclusion criteria, and 34 provided extractable quantitative data, forming the basis of the technical synthesis. The full selection process is summarized in Figure 1 (PRISMA 2020).

Overview and characteristics of included studies
The 43 eligible studies reflect three distinct categories of AI systems used for treatment-related decision support in breast cancer:
Treatment-recommendation systems (e.g., Watson for Oncology, Navya):
Systems that generate guideline-aligned therapeutic recommendations using structured clinical inputs.

Prognostic and risk-prediction models:
Algorithms using genomic, transcriptomic, radiomic, or multi-omics signatures to estimate outcomes such as response, recurrence, or survival.

Diagnostic/subtyping models with therapeutic implications:
Tools that refine molecular classification or infer tumor biology relevant for treatment selection.

This explicit categorization corrects one of the major concerns raised by the second reviewer regarding conceptual mixing across model types.
Geographically, LMIC studies represented ∼60% of included quantitative analyses (e.g., China, India, Ghana, Palestine) (3, 7, 9–11). Conversely, most genomic, radiomics, and multimodal fusion models originated from high-income or multicenter settings, frequently using TCGA or similarly high-quality datasets (5, 6, 12).
Study designs were predominantly retrospective (≈28 studies), aligning with the early translational stage of AI-enabled oncology tools. Only two studies incorporated prospective components (13, 14).
Data modalities varied substantially:
Clinical-only inputs: Most common among expert-system concordance evaluations (3, 9, 15).

Genomic or transcriptomic signatures: Associated with the highest AUC estimates but more sensitive to domain shift (12, 16, 17).

Digital histopathology and radiomics: Primarily used in high-income cohorts (5, 18).

Multimodal fusion architectures: Combining clinical, imaging, histopathology, and multi-omics inputs (6, 18).

All study characteristics—including AI category, validation type, and primary performance metrics—are summarized in Table 2.

Clinical concordance and discriminative performance

Concordance with MDTs and guideline-based decisions
Across concordance studies, treatment-recommendation systems—principally WFO and the Navya AI platform—demonstrated strong alignment with expert clinical decision-making in structured or early-stage settings, but performance varied considerably in complex or metastatic disease. Overall concordance across studies ranged from 40.4% to 100%, reflecting differences in disease stage, availability of systemic therapies, and the rigidity of local treatment algorithms.
High concordance was reported in early-stage disease and settings with well-codified therapeutic pathways:
Printz (2017): concordance ≈90%.

Nair et al. (2015): 100% for major recommendations (98.6% for minor decisions).

Badwe et al. (2024): 97% concordance with academic MDTs.

Somashekhar et al. (2018): 93% concordance overall.

These findings correspond to clinical contexts where treatment sequencing is relatively standardized, reducing ambiguity and facilitating alignment between rule-based systems and clinician judgment.
In contrast, concordance declined markedly in metastatic disease, where treatment decisions require balancing competing objectives (disease control, toxicity, quality of life), patient-specific factors, and evolving therapeutic lines:
Zhao et al. (2020): 27.5% concordance.

Somashekhar et al. (2017): ∼45% concordance.

Xu et al. (2019b): 70% concordance.

This performance deterioration underscores the inherent challenge of encoding nuanced, preference-sensitive, and context-dependent decisions into predominantly rule-based frameworks.

AUC, C-index, sensitivity, and other discriminative metrics
In contrast to expert-system concordance studies, machine learning (ML) and deep learning (DL) models were evaluated using discrimination metrics such as AUC, C-index, sensitivity, and specificity. These approaches generally demonstrated high discriminative performance, particularly when leveraging genomic or multi-omics data.

Genomic-driven models
Genomic and transcriptomic models yielded some of the highest discrimination values:
CDK4/6 inhibitor response model (CRM; Yang et al., 2023a):
○TCGA internal validation: AUC 0.9956

○External LMIC cohort: AUC 0.9795
This illustrates a modest but consistent performance drop during cross-domain validation—an issue emphasized by both reviewers.

AIRS recurrence model (Zhao et al., 2025):
○AUC >0.95 consistently across genomic validation sets.

These findings reflect the high signal-to-noise ratio inherent in transcriptomic signatures for drug-response and recurrence prediction, while also highlighting vulnerability to domain shift when transferring models from HIC-derived datasets to LMIC contexts.

Multimodal Ml/Dl systems
Models integrating clinical variables with digital pathology or imaging features demonstrated excellent robustness across training, validation, and test cohorts:
Yu et al. (2023): neoadjuvant chemotherapy response prediction with
○AUC 0.999 (train), 0.995 (validation), 0.981 (test).

Witowski et al. (2024): multimodal prognostic vision-transformer model with
○C-index 0.71 (TNBC),

○0.67 (HER2+),

○0.655 (HR + premenopausal).

Shamai et al. (2025):
○Sensitivity 94.8%,

○Specificity 95.9%,

○Negative predictive value 0.982.

These results suggest that multimodal architectures may offer greater stability across domains than genome-only models, potentially due to complementary contributions of histopathology and clinical variables.

Radiomics–genomics fusion models
Hybrid models combining imaging-derived features with genomic characteristics showed moderate yet clinically meaningful discrimination:
Jacobs et al. (2020):
○AUC 0.89,

○Sensitivity 95%,

○Specificity 83%.

These findings highlight the prognostic value of tumor microenvironment–derived radiomic signatures as adjuncts to genomic predictors.

Interpretation in the context of risk of bias
The performance estimates above should be contextualized using the risk-of-bias evaluations summarized in Supplementary Table S2, which identify recurring methodological issues—particularly limited external validation, variability in predictor measurement, inconsistent reporting of preprocessing steps, and potential spectrum bias in single-center or tertiary-care cohorts. These limitations underscore the need for careful interpretation of high-discrimination values and support calls for more rigorous, prospective, multicenter evaluations in both LMIC and HIC environments.

Determinants of performance variation across disease stage and model architecture

Stage-dependent performance patterns
A clear and recurrent gradient in AI performance was observed across the disease continuum, with early-stage cases yielding substantially higher concordance and discrimination metrics than metastatic disease. Importantly, these patterns varied not only by disease stage but also by model architecture, addressing a key concern raised by Reviewer 2 regarding the need to differentiate performance trajectories across expert systems, machine learning (ML), and deep learning (DL).

Early-stage disease
Across all AI categories—expert systems, ML models, and DL architectures—performance in early-stage breast cancer (stages I–II) remained consistently high, typically within the 80%–100% range for both concordance (expert systems) and discrimination metrics (AUC, C-index for ML/DL models). This stability reflects the more deterministic nature of therapeutic decision-making in early-stage disease, where guideline-based treatment pathways are highly standardized and present fewer branching options. In such contexts, rule-based expert systems align closely with clinical standards, and ML/DL models benefit from lower biological and clinical heterogeneity.

Metastatic disease
Across models, performance declined markedly in metastatic disease, where therapeutic decisions involve greater uncertainty, wider clinical variability, and the integration of multiple competing priorities (disease control, symptom burden, toxicity, prior therapeutic exposures, and patient preference). The decline was most pronounced in expert systems, with concordance estimates occasionally falling below 30%—as reported in Zhao (2020) and Somashekhar (2017)—reflecting the difficulty of encoding complex and individualized decision-making within rigid rule-based structures.
ML and DL models demonstrated smaller but still notable declines, suggesting that data-driven architectures partially—but not completely—capture the underlying biological and clinical heterogeneity characteristic of advanced disease. While these models performed better than expert systems in metastatic settings, none fully overcame the challenges posed by evolving lines of therapy, resistance mechanisms, and incomplete availability of biomarkers in real-world LMIC and HIC contexts.

Influence of model type and data modality
AI performance varied substantially according to model type (expert systems, ML, DL, multimodal architectures) and data modality (clinical, genomic, radiomic, digital pathology, or multi-omics).

Expert systems (rule-based models)
Expert systems such as WFO and Navya depend primarily on structured clinical inputs—tumor stage, hormone receptor status, HER2 status, menopausal status, and comorbidities. Their strongest performance occurred in early-stage settings where therapeutic pathways are well defined and algorithmic branching is relatively narrow. However, their reliance on predefined rules limited adaptability in contexts with greater case complexity or rapidly evolving therapy options, contributing to the sharp performance decline observed in metastatic disease (e.g., <30% concordance in Zhao 2020; Somashekhar 2017).
Key determinants for expert-system performance:
Rigid algorithmic structure;

High dependence on completeness and quality of structured clinical data;

Limited capacity to infer latent patterns or compensate for missing biomarkers.

Constrained flexibility in settings where guidelines diverge between LMIC and HIC contexts.

Machine learning (ML) and deep learning (DL) models
In contrast, ML and DL architectures displayed greater robustness in complex cases due to their ability to model nonlinear relationships and incorporate diverse input features. These models frequently leveraged genomic, transcriptomic, radiomic, or digital pathology data, contributing to their high discriminative performance across studies (e.g., AUCs >0.95 for genomic models; Yu 2023, Witowski 2024, Shamai 2025).
Determinants influencing ML/DL model performance:
Capacity to integrate high-dimensional inputs;

Improved flexibility in modeling heterogeneous tumor biology;

Sensitivity to domain shift when trained on HIC-derived multi-omics data (e.g., Yang 2023a: CRM AUC drop from 0.9956 → 0.9795 in LMIC external validation);

Dependence on preprocessing, feature engineering, and consistent biomarker availability—factors that vary sharply between LMIC and HIC contexts.

Multimodal architectures (fusion models)
Multimodal approaches combining clinical, imaging, digital pathology, and genomic features demonstrated the most stable overall performance across disease stages and geographic settings. For instance, Yu et al. (2023) achieved high AUCs across training, validation, and testing cohorts (0.999, 0.995, 0.981), while Witowski et al. (2024) reported C-indices between 0.655 and 0.71 across molecular subtypes.
Drivers of stability in multimodal models:
Complementary strengths of diverse data sources;

Reduced reliance on any single modality;

Mitigation of missingness in one modality through stronger signals in another;

Potential for generalization in settings with variable infrastructure (especially relevant for LMIC contexts lacking universal access to genomics or digital pathology).

Radiomics–genomics fusion models
Hybrid models integrating imaging-derived tumor microenvironment signatures with genomic data (e.g., Jacobs 2020, AUC 0.89, sensitivity 95%, specificity 83%) provided clinically meaningful discrimination, suggesting that radiomic features can supplement or partially replace genomic inputs when the latter are unavailable—an important consideration for LMIC feasibility.

Performance attrition in cross-domain validation
A consistent pattern across genomic and multimodal ML studies was the decline in performance when models trained on high-income country (HIC) datasets were externally validated in low- and middle-income country (LMIC) cohorts. This cross-domain attenuation, visualized in Figure 2, highlights the impact of population differences, diagnostic workflows, and data-generation heterogeneity on model transportability.
Genomic-driven models showed the clearest performance drops. For example, the CDK4/6 inhibitor response model (CRM; Yang et al., 2023a) decreased from AUC 0.9956 in TCGA training to 0.9795 in an external LMIC cohort. Similarly, recurrence-risk signatures (e.g., Zhao 2025) exhibited AUCs >0.95 in HIC-derived datasets but lower stability when applied to LMIC samples characterized by greater variability in sequencing platform, preprocessing, and biomarker availability.
Several factors contributed to this attrition:
Population and tumor biology differences, including subtype distribution and age structure.

Diagnostic infrastructure gaps in LMICs, particularly limited access to genomics or digital pathology.

Platform and preprocessing heterogeneity, producing batch effects that impair compatibility across datasets.

Contextual differences in guideline implementation, which influence downstream clinical endpoints.

In contrast, multimodal fusion models demonstrated smaller performance losses due to the compensatory value of imaging, pathology, and clinical data, while radiomics–genomics models (e.g., Jacobs 2020, AUC 0.89) appeared less sensitive to domain shift than genomic-only architectures.
These findings support the need for expanded LMIC data generation, standardized preprocessing pipelines, and systematic external validation to ensure equitable deployment of AI-driven oncology tools across diverse health-system environments.

Implementation feasibility and real-world integration
Several studies conducted in LMICs (13, 26) provided empirical evidence of operational impact, including reductions in:
Time-to-treatment (hours → minutes),

Frequency of in-person MTB meetings,

Patient travel burden.

However, major infrastructural barriers were consistent across contexts:
Fragmented or incomplete EHR systems (25)

Limited availability of genomic assays (16, 17)

Absence of digital pathology or advanced imaging (5)

Limited interpretability for clinicians (34)

Case-based reasoning and explainable-AI enhancements markedly improved clinician trust and uptake, particularly in LMIC cohorts where algorithmic transparency was prioritized (10).
Table 3 synthesizes implementation barriers, proposed solutions, and reported outcomes across studies.

Quality assessment and risk of bias
Assessment using the integrated PROBAST-AI and QUADAS-AI framework revealed substantial methodological variability across studies. As detailed in Supplementary Table S2, most investigations were rated as unclear or high risk of bias in at least one of the five evaluated domains.
Model overfitting and validation constituted the most recurrent concern. Many ML/DL and genomic models were trained on retrospective, single-center cohorts with limited case diversity and only internal validation, increasing susceptibility to inflated performance estimates. True external validation—particularly cross-domain HIC→LMIC evaluation—was available in only a minority of studies.
Predictor measurement was another frequently affected domain. Multimodal and multi-omics studies often lacked transparent reporting of preprocessing pipelines, feature engineering strategies, batch correction methods, or handling of missing data, limiting reproducibility and comparability across settings.
Participant selection was commonly at risk due to reliance on convenience samples from tertiary referral centers or specialized oncology services. These cohorts may not reflect real-world patient distributions, especially in LMICs, where diagnostic pathways and biomarker availability differ markedly.
In contrast, outcome assessment tended to be more robust in concordance-based evaluations, where multidisciplinary tumor board (MTB) decisions served as explicit reference standards. However, ML and radiomics studies demonstrated considerable heterogeneity in endpoint definitions (e.g., event-free survival vs. progression vs. risk stratification), introducing additional uncertainty when comparing discrimination metrics.
A consolidated visualization of domain-level judgments is presented in Figure 3 (risk-of-bias traffic-light plot), which complements the quantitative evidence summarized in Table 2 and underscores systematic vulnerabilities across the AI oncology evidence base.

Discussion

Discussion
This AI-assisted scoping review synthesized evidence from 43 studies evaluating artificial intelligence models designed to support therapeutic decision-making in breast cancer, with particular attention to performance and feasibility in low- and middle-income countries (LMICs). Across model categories—ranging from rule-based expert systems to multimodal deep-learning pipelines—AI frameworks consistently demonstrated strong discriminative performance and high concordance with multidisciplinary tumor boards (MTBs) in early-stage disease. However, the evidence also highlighted persistent limitations related to generalizability, transparency, data governance, and system-level inequities that must be addressed before these tools can be responsibly implemented at scale.

Interpretation of main findings
The most consistent pattern identified across studies was the superior performance of AI models in early-stage breast cancer, where treatment pathways are highly standardized and strongly aligned with guideline-based recommendations. Expert-system platforms such as WFO and Navya reported concordance rates frequently between 85% and 95% for localized disease, reflecting the narrower therapeutic repertoire and more deterministic nature of clinical algorithms in stages I–II (4, 9, 26). In contrast, performance in metastatic settings declined substantially, sometimes falling below 30% in rule-based systems (3, 22), underscoring the difficulty of encoding therapeutic nuance when decisions involve prior treatment exposure, competing clinical goals, and rapidly evolving sequencing strategies.
Machine learning and deep learning architectures demonstrated smaller performance declines across disease stages, suggesting that data-driven models may better capture the biological and clinical heterogeneity of advanced cancer (36, 37). Nonetheless, even advanced multimodal pipelines showed attenuated discrimination in metastatic subsets, indicating that therapeutic complexity imposes constraints not easily resolved by algorithmic inference alone (38, 39).
The review also identified a clear hierarchy across model architectures. Multimodal systems integrating imaging, histopathology, genomics, and clinical variables produced the most consistent performance across diverse settings, highlighting the value of complementary data sources (6, 18). Genomic-driven predictors—particularly those developed using TCGA—achieved the highest AUC values, often approaching 0.99 (17, 23), but were also the most susceptible to cross-domain performance declines when applied to LMIC cohorts. Expert-system platforms remained competitive in early-stage disease but showed limited adaptability to clinical nuance, incomplete diagnostic inputs, and variation in local drug availability.
Taken together, these findings indicate that while AI has substantial potential to augment oncologic decision-making, its performance is strongly shaped by disease complexity, data provenance, and the structural characteristics of the health systems in which these tools are deployed.

Cross-domain generalizability: LMIC vs. HIC performance
A central aim of this review was to evaluate whether AI systems developed in high-income country (HIC) environments maintain performance when applied to low- and middle-income country (LMIC) populations. Across studies, we observed a consistent—though often numerically modest—decline in predictive accuracy, most evident in genomic-driven models. For example, the CDK4/6 inhibitor response model trained on TCGA demonstrated an AUC reduction from 0.9956 to 0.9795 when externally validated in a Guangdong cohort (23). While the absolute difference appears small, it reflects structural distinctions between high-resource datasets and the clinical, biological, and infrastructural realities of LMIC settings.
Multiple, intersecting sources of domain shift likely contribute to this attenuation:
Genetic and epigenetic variability: TCGA disproportionately represents European and North American populations, whereas LMIC cohorts often display distinct genomic architectures and mutational patterns.

Divergent diagnostic pathways: Variability in imaging protocols, pathology processing, and laboratory infrastructure affects data completeness and feature stability across settings.

Therapeutic availability and guideline divergence: Differences in drug formularies, insurance coverage, and national guidelines influence the “ground truth” decisions used for concordance evaluations.

Data quality and digitalization gaps: Higher rates of missingness, incomplete documentation, and limited EHR integration disproportionately affect LMIC datasets.

In contrast, evaluations of expert systems conducted directly within LMIC settings frequently reported higher concordance than those in HICs (4, 9). This pattern may reflect the tighter alignment between algorithmic outputs and simplified, resource-adapted national treatment guidelines, which reduce ambiguity and narrow the therapeutic decision space.
Despite these encouraging findings, the predominance of HIC-derived datasets across the evidence base raises concerns regarding the extrapolation of performance claims to populations with markedly different clinical and biological profiles. These results underscore the importance of region-specific calibration, expansion of LMIC-centered data generation, and prospective multicenter validation to ensure equitable and context-appropriate deployment of AI-driven oncology tools.

Study quality, methodological rigor, and risk of bias
The methodological quality of the included studies introduces important constraints on the interpretation of reported performance metrics. As shown in Supplementary Table S2, most investigations were retrospective and single-center, with limited external validation and incomplete reporting of predictor preprocessing, management of missing data, and input standardization.
Several recurrent vulnerabilities were identified:
Spectrum bias, resulting from tertiary-center populations with more complete diagnostic workups than those typically encountered in LMIC clinical environments.

Inadequate validation strategies, with heavy reliance on split-sample or internal cross-validation—approaches known to inflate performance, particularly in high-dimensional genomic and radiomics models.

Heterogeneous or opaque outcome definitions, where MTB concordance serves as a pragmatic but indirect proxy for clinical utility and does not capture patient-centered outcomes such as survival, quality of life, or treatment-related toxicity.

Insufficient conflict-of-interest transparency, especially in evaluations involving proprietary systems, limiting the ability to assess potential commercial or sponsorship bias.

Collectively, these limitations indicate that many of the high accuracy, AUC, and concordance values reported in the literature likely represent upper-bound estimates rather than reliable indicators of real-world effectiveness, underscoring the need for prospective, multicenter, and transparently reported evaluations.

Ethical, governance, and equity considerations in LMIC deployment
Ethical and equity considerations are inseparable from the technical performance of AI-driven decision-support systems, particularly in low- and middle-income countries (LMICs), where structural vulnerabilities—including limited health literacy, fragmented infrastructure, and weak data-governance frameworks—heighten the risks associated with algorithmic deployment.

Genomic data governance and privacy
Models leveraging genomic signatures (17, 23) pose distinctive privacy risks, as re-identification remains possible even after standard anonymization procedures. Many LMICs lack regulatory safeguards governing cross-border genomic data transfer, cloud-based storage, or secondary reuse of biospecimens, creating tension between the benefits of precision oncology and the need to preserve sovereignty over local health data.

Informed consent and secondary data use
Most retrospective datasets used in LMIC-based studies were not originally collected with AI development or international model training in mind (4, 9). Participants may be unaware that their data are now used in algorithmic development, commercial applications, or multinational collaborations. Strengthening consent pathways—especially for genomic data—remains essential to avoid extractive data practices and ensure participant autonomy.

Algorithmic bias and underrepresentation
Underrepresentation of LMIC populations in genomic and multimodal datasets increases the risk of biased predictions and miscalibrated models. The performance declines observed in LMIC validations (12, 13, 18, 23) illustrate the consequences of this imbalance. Systematic evaluation of subgroup performance—by ethnicity, socioeconomic status, and geographic region—remains largely absent across studies, limiting our understanding of differential model impact.

Scalability, infrastructure, and equity
Even when models demonstrate excellent technical performance, their real-world feasibility depends on infrastructure—digital pathology, standardized imaging, robust EHRs, cloud computing—that is often inaccessible in public-sector LMIC oncology centers. Without strategies to address these gaps, AI tools may preferentially benefit patients in urban tertiary hospitals, reinforcing existing disparities in cancer care.
Open-source architectures (18, 20–22) offer partial remedies by enabling local adaptation, algorithmic auditing, and collaborative improvement, potentially supporting more equitable deployment across diverse resource settings.

Ethical considerations in EHR-integrated AI systems
EHR integration, frequently described as a facilitator of AI adoption, simultaneously introduces heightened privacy and governance risks in LMIC contexts. Fragmented or partially digitized EHRs increase vulnerability to data leakage, unauthorized access, insecure transmission, and uncontrolled secondary use of clinical or genomic information. These risks disproportionately burden populations already facing systemic inequities.
Responsible implementation requires context-specific safeguards—including data minimization, explicit consent pathways for data reuse, on-premises or sovereign-cloud storage options, and transparent audit mechanisms—to ensure that the benefits of AI integration do not compromise patient autonomy or privacy in resource-constrained environments.

Implications for clinical adoption and future research
The findings of this review highlight both the promise and the limitations of AI-driven decision-support systems for breast cancer care across diverse income settings. Moving from proof-of-concept toward responsible clinical deployment will require targeted efforts across methodological, infrastructural, and governance domains.
Priority areas include:
Prospective, multicenter validation across geographically and demographically diverse LMIC populations to generate context-specific evidence.

Adoption of transparent reporting standards, including CONSORT-AI, SPIRIT-AI, and TRIPOD-AI, to improve reproducibility and comparability across studies.

Robust external validation, with explicit cross-income calibration to assess real-world generalizability and identify differential performance across subgroups.

Integration with public-sector workflows, emphasizing low-compute architectures, mobile or hybrid deployment options, and on-premises data processing to enhance feasibility in resource-limited environments.

Strengthened ethical and governance frameworks, addressing genomic data sovereignty, algorithmic auditing, and equitable access to ensure that AI implementation reduces rather than reinforces existing disparities.

The successful application of semantic AI tools for literature retrieval in this scoping review illustrates that AI can support not only clinical decision-making but also accelerate evidence synthesis in rapidly evolving domains such as digital oncology.

Conclusion

Conclusion
This scoping review shows that AI-driven decision-support systems for breast cancer consistently demonstrate strong technical performance across diverse settings, with particularly high concordance and discriminative accuracy in early-stage disease. Multimodal and genomic models—especially those incorporating transcriptomic features—achieved the most robust predictive capacity, while expert-system platforms proved useful for streamlining decision-making in resource-limited environments. However, these advances must be interpreted considering substantial methodological constraints, the predominance of high-income country datasets, and the limited availability of external validation across heterogeneous populations.
The performance attenuation observed when HIC-trained models are applied to LMIC cohorts underscores persistent challenges related to domain shift and the need for locally calibrated datasets, context-sensitive model adaptation, and prospective evaluation. Ethical considerations—including genomic data governance, informed consent, and risks of algorithmic bias—are especially salient in LMIC settings, where structural vulnerabilities intersect with emerging digital infrastructures. Without proactive efforts to ensure transparency, fairness, and equitable access, AI-enabled oncology tools may inadvertently reinforce existing disparities in cancer outcomes.
Overall, the current generation of AI systems should be viewed as promising but still in early translational stages. Their clinical impact will depend on rigorous methodological refinement, sustained investment in LMIC data ecosystems, and governance frameworks that protect patient autonomy while enabling responsible innovation. Under these conditions, AI-supported decision systems have the potential to meaningfully augment clinical judgment, improve guideline-concordant care, and contribute to more equitable global outcomes in breast cancer management.

Limitations

Limitations
This review has several limitations that should be considered when interpreting its findings. First, despite using a combined search strategy that integrated traditional databases with AI-assisted semantic retrieval, the evidence base remained dominated by retrospective analyses. Only two studies incorporated prospective elements, limiting assessment of how AI-generated recommendations influence real-world outcomes such as treatment adherence, toxicity management, progression-free survival (PFS), or overall survival (OS). Concordance with multidisciplinary tumor boards provides a pragmatic surrogate but remains an indirect proxy for clinical utility.
Second, substantial heterogeneity across model architectures, data modalities, validation frameworks, and endpoint definitions precluded meta-analysis and necessitated narrative synthesis. Variability in how performance was quantified (AUC, C-index, concordance), the nature of “ground truth” comparators, and inconsistent use of external validation complicate cross-study comparisons and may inflate performance estimates—particularly in high-dimensional genomic and radiomics pipelines that relied heavily on cross-validation or split-sample designs.
Third, although this review focused explicitly on LMIC contexts, the underlying literature remained heavily influenced by datasets and models originating in high-income settings. High-performing genomic predictors—often trained on TCGA—were externally validated in LMIC cohorts only in a minority of studies, and when such evaluations were conducted, performance declines were consistently observed. This imbalance reflects broader global inequities in access to high-quality multimodal data and limits the generalizability and fairness of current AI systems. Accordingly, the findings of this review should be interpreted as reflective of the published literature rather than representative of the full diversity of LMIC clinical environments.
Fourth, reporting inconsistencies hindered comprehensive quality appraisal. Key methodological details—including predictor preprocessing, handling of missing data, imaging or assay standardization, and approaches to algorithmic explainability—were frequently underreported. Likewise, disclosures related to funding and conflicts of interest were inconsistent, particularly for proprietary platforms, raising the possibility of unrecognized bias in commercial evaluations.
Finally, although AI-assisted tools broadened the scope of literature retrieval, they introduce their own sources of selection bias related to semantic ranking and algorithmic weighting. While Supplementary Table S1 improves reproducibility, relevant studies employing atypical terminology or nonstandard indexing may still have been missed. As semantic-search technologies evolve, future reviews should continue refining and standardizing AI-enabled retrieval protocols to reduce these risks.
Taken together, these limitations underscore the need for prospectively designed, transparently reported, and contextually grounded evaluations of AI-based decision-support systems—particularly in LMIC settings where implementation feasibility, data quality, and equitable access remain central concerns.

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

🟢 PMC 전문 열기