Domain adaptation, self-supervision, and generative augmentation enhance GNNs for breast cancer prediction.

Qiu S; Zhao Y; Li X

doi:10.1038/s41598-025-32924-9

← 뒤로

Domain adaptation, self-supervision, and generative augmentation enhance GNNs for breast cancer prediction.

1/5 보강

Scientific reports 📖 저널 OA 98.2% 2021~2026 2026 Vol.16(1) p. 3011

Qiu S, Zhao Y, Li X

📖 무료 전문 🟢 PMC 전문 PMC12830969

PubMed ↗ DOI ↗ BibTeX ↓ RIS ↓

📝 환자 설명용 한 줄

Breast cancer presents substantial molecular heterogeneity, requiring accurate subtype classification, receptor-status prediction, and survival estimation for precision care.

이 논문을 인용하기

↓ .bib ↓ .ris

APA Qiu S, Zhao Y, Li X (2026). Domain adaptation, self-supervision, and generative augmentation enhance GNNs for breast cancer prediction.. Scientific reports, 16(1), 3011. https://doi.org/10.1038/s41598-025-32924-9

MLA Qiu S, et al.. "Domain adaptation, self-supervision, and generative augmentation enhance GNNs for breast cancer prediction.." Scientific reports, vol. 16, no. 1, 2026, pp. 3011.

PMID 41559366 ↗

DOI 10.1038/s41598-025-32924-9

Abstract

Breast cancer presents substantial molecular heterogeneity, requiring accurate subtype classification, receptor-status prediction, and survival estimation for precision care. Existing machine-learning models often fail to generalize across cohorts or adapt to rare subtypes. We propose a unified graph neural network (GNN) framework that integrates multi-task learning, domain-adversarial adaptation, contrastive self-supervision, few-shot meta-learning, and generative augmentation. Gene-expression data from TCGA-BRCA (1084 samples) and METABRIC (1980 samples) were mapped onto gene-centric PPI graphs and patient-similarity graphs. A shared encoder (including Graph Transformer variants) jointly predicts intrinsic subtypes (Luminal A, Luminal B, HER2-enriched, Basal-like), ER/PR/HER2 biomarkers, and overall survival (OS) using a Cox proportional hazards head. Validation included fivefold cross-validation and strict TCGA → METABRIC transfer testing. The multi-task Graph Transformer achieved subtype F1 = 0.872, ER/PR/HER2 AUCs of 0.960/0.943/0.918, and C-index = 0.721. Domain adaptation improved external subtype F1 from 0.738 to 0.801. For the HER2-enriched subtype, MAML enabled few-shot prediction with F1 = 0.782, while MolGAN augmentation increased HER2 AUC to 0.935. GNNExplainer highlighted biologically consistent drivers, including ESR1, ERBB2, and PGR, aligning with known hormonal and HER2 signaling mechanisms. This study introduces a comprehensive, interpretable GNN framework that unifies subtyping, biomarker prediction, and survival modeling while improving cross-cohort robustness and rare-subtype adaptation. The combination of multi-task learning, domain adaptation, self-supervision, and generative augmentation demonstrates strong potential for clinically actionable decision support.

같은 제1저자의 인용 많은 논문 (2)

Opportunities and challenges of proximity labeling for microbe-host cell interactions in tumor microenvironment.
Frontiers in cellular and infection microbiology 2025
Prp19/CDC5L promotes gastric cancer via activation of the MAPK pathway-mediated homologous recombination.
International journal of biological sciences 2025

📖 전문 본문 읽기 PMC JATS · ~96 KB · 영문

Introduction

Introduction
Breast cancer (BC) is the most common malignancy among women and remains a leading cause of global cancer mortality despite major advances in screening, targeted therapies, and precision medicine1. Its clinical complexity largely stems from extensive biological heterogeneity, which complicates prognostication, treatment selection, and long-term management. These challenges are even more pronounced for rare subtypes and underrepresented patient groups, highlighting the need for computational models that can leverage biological structure, generalize across diverse datasets, and deliver clinically interpretable predictions for subtyping, receptor status estimation, and survival modeling1.
Over the past two decades, genomic research has established that BC comprises multiple biologically distinct disease entities. Early gene-expression profiling defined reproducible “molecular portraits,” which were subsequently formalized into intrinsic subtypes—Luminal A, Luminal B, HER2-enriched, and Basal-like2. These subtypes provide considerably stronger prognostic and therapeutic insight than traditional clinicopathologic features. The PAM50 assay standardized subtype determination in both clinical and research settings3. Large-scale datasets such as TCGA and METABRIC further mapped BC heterogeneity, linking subtype patterns to outcomes but also revealing substantial intra- and inter-subtype variation4,5. Together, intrinsic subtyping and ER/PR/HER2 biomarker status now form foundational components of BC management3–5.
Despite these advances, translating multi-omics data into clinically deployable predictive models remains hindered by several limitations. First, conventional machine-learning (ML) methods treat transcriptomic features as independent variables, neglecting the structured nature of tumor biology, which arises from coordinated interactions within pathways and protein–protein interaction (PPI) networks6. Second, domain shifts caused by assay platform differences (RNA-seq vs. microarray), population heterogeneity, and batch effects significantly reduce model performance across datasets. Even after batch correction using methods such as ComBat, cross-cohort degradation typically persists, with standard ML pipelines showing a 10–15% absolute drop in subtype F1-score when transferring models between TCGA and METABRIC4,5,7. Third, real-world clinical decision-making requires not only accurate predictions but also reliable calibration and transparent interpretability, as miscalibrated risk estimates can lead to suboptimal clinical choices despite high discrimination8. Tools such as decision curve analysis (DCA) help quantify clinical utility across risk thresholds9. These limitations emphasize the need for structure-aware, domain-robust, and interpretable modeling frameworks7–9.
Network medicine (NM) proposes a conceptual shift by framing disease as a perturbation of interconnected biological systems rather than as isolated gene abnormalities6. Graph-based representations naturally encode such relationships. PPI networks at the gene level and similarity networks at the patient level enable structured relational learning. Integrative methods like Similarity Network Fusion (SNF) combine multi-modal features into a consensus graph, enhancing stability and predictive utility10. Together, gene-level and patient-level graphs offer a biologically grounded basis for tasks such as subtype classification, receptor prediction, and survival estimation6,10.
Graph neural networks (GNNs) provide a principled approach to learning from graph-structured data. Message-passing architectures such as graph convolutional networks (GCNs) and graph attention networks (GATs) propagate and aggregate neighborhood information, capturing both local and higher-order biochemical interactions11,12. More recently, Graph Transformers (GTs) incorporate global self-attention and structural encodings—such as shortest-path distances and node centrality—to model long-range regulatory dependencies that commonly arise in oncogenic pathways13.
However, even GNN-based approaches face challenges in generalizing across datasets. Domain-adversarial neural networks (DANNs) mitigate domain shift by explicitly enforcing domain-invariant feature representations through adversarial training, often improving TCGA → METABRIC transferability by 6–8% compared with naïve transfer14. Alternatives such as CORAL and maximum mean discrepancy (MMD) reduce distributional discrepancies by aligning covariance and kernel-based feature statistics, respectively, typically providing 3–5% improvements, though they may struggle to fully remove dataset-specific signatures in high-dimensional graph spaces. In contrast, adversarial alignment directly optimizes representation invariance, albeit with modest increases in computational overhead.
Self-supervised and generative techniques provide additional complementary benefits. Graph contrastive learning (GraphCL) enhances robustness under limited annotations by aligning biologically constrained augmented graph views15. Generative augmentation methods such as MolGAN can alleviate class imbalance by synthesizing graph-structured omics profiles. Unlike variational autoencoders (VAEs), which generate latent embeddings but do not explicitly enforce graph topology, MolGAN jointly models adjacency structure and node attributes. Prior studies show that GAN-based augmentation can yield ≈5% improvements in rare-class AUROC when biologically implausible samples are filtered appropriately.
Few-shot meta-learning further addresses the challenge of rare subtypes—especially HER2-enriched tumors, which comprise only 5–10% of many cohorts. Model-agnostic meta-learning (MAML) enables rapid adaptation to new tasks with minimal labeled data and often outperforms zero-shot or conventional transfer-learning approaches, which lack explicit mechanisms to handle subtype-specific transcriptional differences16. Existing genomic and histopathology studies report 5–10% gains in F1-score on rare classes using few-shot methods.
To situate our contributions within the literature, we provide a comparative summary of representative prior models, contrasting classical multi-omics pipelines, single-task and multi-view GNNs, transformer-based architectures, and hybrid deep-learning systems across datasets, tasks, performance metrics, and methodological limitations. Recent works from 2024 to 2025—including multi-omics GNNs for subtype classification and transformer-enhanced frameworks for BC imaging—reinforce the need for unified models that jointly address multi-task prediction, domain adaptation, data imbalance, and interpretability.
Recent graph-based and multi-omics models for breast cancer typically optimize for a single primary endpoint—such as molecular subtyping or survival prediction—and are usually evaluated within a single cohort, without explicit assessment of cross-cohort generalization or rare-subtype behavior. Existing multi-omics GNN frameworks and transformer-based architectures have reported subtype F1-scores in the low 0.80 range and survival C-indices around 0.67–0.70 on TCGA-like datasets, but they seldom incorporate domain-adversarial alignment, few-shot meta-learning, and generative augmentation within a unified pipeline or quantify their combined effect on external transferability and rare-class performance. Moreover, few studies systematically analyze how multi-task modeling of subtypes, receptors, and survival jointly shapes latent representations and downstream calibration. These gaps motivate the need for an integrated, adaptation-enhanced Graph Transformer framework that simultaneously addresses multi-task clinical prediction, domain shift, label scarcity, and interpretability in a reproducible manner12–16.
Building on these insights and recent advances in graph representation learning, we propose a comprehensive graph-based framework that addresses several gaps in current GNN applications to breast cancer. In contrast to prior work that typically focuses on a single primary endpoint (e.g., molecular subtyping or survival) and is evaluated within a single cohort, our approach: (i) unifies molecular subtype prediction, ER/PR/HER2 receptor classification, and survival estimation within a multi-task Graph Transformer architecture; (ii) integrates gene-level PPI graphs and patient-level similarity networks to capture complementary biological structure; and (iii) combines domain-adversarial adaptation, contrastive self-supervision, few-shot meta-learning, and MolGAN-based generative augmentation to improve cross-cohort robustness, representation quality, and rare-subtype adaptation. Finally, by coupling the model with GNNExplainer-based attribution analyses, we explicitly link predictions to known signaling modules and patient-specific pathway patterns, positioning the framework as an integrated, clinically oriented graph learning approach for breast cancer.

Materials and methods

Materials and methods

Data acquisition and preprocessing
We used two well-established public breast cancer datasets to support both internal modeling and external validation. For both datasets, clinical and molecular annotations were standardized to include the main intrinsic subtypes (Luminal A, Luminal B, HER2-enriched, and Basal-like) and receptor status (ER, PR, HER2). OS information, including follow-up time and event status, was also included to enable prognostic analysis. To ensure high data quality, we only kept primary tumor samples with complete subtype, receptor, and survival information. Samples from recurrent disease or those affected by neoadjuvant treatment were excluded to avoid confounding effects.
Quality control was conducted in three stages to ensure data integrity. First, samples with incomplete subtype or receptor labels, implausible survival follow-up (< 30 days), or assay-related anomalies such as low library size, strong 3′ bias, or extensive missing values were excluded. Second, all gene identifiers were standardized to HUGO Gene Nomenclature Committee (HGNC) symbols; in cases of duplicate mappings, the probe or transcript with the highest variance across samples was retained to preserve a one-to-one mapping between symbols and expression features. Third, low-information genes were removed, including those with minimal expression (e.g., counts-per-million < 1 in > 80% of RNA-seq samples) or near-zero variance, thereby reducing noise and enhancing statistical power.
Because TCGA-BRCA and METABRIC were generated using different assay platforms, preprocessing steps were tailored to each technology prior to cross-cohort harmonization. For RNA-seq data, fragments per kilobase per million were converted to transcripts per million (TPM) and log₂-transformed as log₂(TPM + 1). For microarray data, the platform-provided log₂ quantile-normalized values were used directly. To reduce platform- and study-specific effects, gene-wise z-score standardization was performed within each cohort prior to feature-space alignment for downstream analyses. Empirical Bayes batch correction was then applied using ComBat on the log-scale expression values, fitted exclusively on the training split and applied unchanged to the validation and test splits, ensuring that no batch parameters were learned from held-out data. When additional latent confounders were suspected (e.g., center-specific effects), surrogate variable analysis (SVA) was performed on the training data only, and the estimated surrogate factors were regressed out prior to model fitting.
For transfer learning experiments, a common feature space was created by intersecting genes shared between the two cohorts and ranking them by dispersion within each dataset. From this, the top 5000 genes were selected using median absolute deviation, providing a balance between biological signal and cross-platform stability. After quality control, missing values were rare; when present, they were imputed within each cohort using k-nearest neighbors (KNN) with k = 5 on standardized features. Genes with more than 20% missingness after alignment were excluded rather than imputed to avoid bias. All preprocessing steps, including normalization, harmonization, feature selection, and imputation, were performed exclusively on the training split and then applied unchanged to the validation and test sets.
Cohorts were divided using stratified sampling to preserve the distribution of intrinsic subtypes across splits. Unless otherwise stated, the proportions for training, validation, and testing were 70%, 15%, and 15%, respectively. To evaluate cross-cohort generalizability, train-on-source/test-on-target protocols were used, such as training on TCGA-BRCA and testing on METABRIC, and the reverse. Random seeds were fixed for each partition to ensure reproducibility across experiments. All data used were publicly available and fully de-identified. Because no protected health information was accessed, institutional review board approval was not required. Data access and use complied with the terms of both GDC and cBioPortal. An overview of the end-to-end research workflow is provided in Supplementary Fig. S1.
To enhance reproducibility, we explicitly detail all preprocessing steps. TCGA RNA-seq data were converted to TPM and normalized using log2(TPM + 1). METABRIC microarray values were used in log2-quantile-normalized form. Gene identifiers were mapped to HGNC symbols and the top 5000 genes were selected based on median absolute deviation. The PPI graph was constructed using STRING edges with confidence ≥ 0.7. Patient similarity graphs used cosine similarity with a threshold of 0.8. All normalization, batch correction, and imputation steps were fitted only on training data and applied unchanged to validation/test sets.

Dataset summary and characteristics
To enhance transparency and reproducibility, we provide a unified summary of all datasets, data modalities, supervised tasks, class distributions, and data splits used in this study. The transcriptomic and clinical data were obtained from two publicly available breast cancer cohorts: TCGA-BRCA, retrieved via cBioPortal (https://www.cbioportal.org/), and METABRIC, accessed through the European Genome–Phenome Archive (EGA; accession EGAS00000000083). TCGA provides RNA-seq TPM-normalized expression profiles (log₂(TPM+1)) and METABRIC provides log₂-quantile-normalized microarray intensities, together with curated clinical labels required for molecular subtyping, receptor-status prediction, and survival modeling.
A total of 1084 TCGA samples and 1980 METABRIC samples were included after preprocessing and quality filtering. The supervised prediction tasks consist of: (i) molecular subtypes (four classes: Luminal A, Luminal B, HER2-enriched, Basal-like), (ii) biomarker prediction for ER, PR, and HER2 (binary labels), and (iii) overall survival (OS) with time-to-event information and censoring indicators. The datasets exhibit notable imbalance, with HER2-enriched tumors comprising approximately 5% of all samples, whereas OS labels show approximately 70% censoring.
All experiments followed an 70/15/15 stratified train–validation–test split, ensuring proportional representation of subtypes and receptor classes. For graph-based modeling, two complementary graph types were constructed. In the gene-level graph, nodes represent genes and edges correspond to high-confidence protein–protein interactions from STRING (confidence ≥ 0.7). In the patient-level graph, each patient is represented as a node, and edges encode cosine similarity between standardized expression profiles, retaining only similarities above 0.8 with k-nearest-neighbor symmetrization. To mitigate severe subtype imbalance, especially for HER2-enriched tumors, MolGAN-based generative augmentation was applied to synthesize additional patient graphs corresponding to approximately 20% of the minority-class samples, with synthetic instances included only in the training set.

Graph construction
To align with the principles of network medicine (NM) and preserve clinically relevant cohort structure, we constructed two complementary graph representations: a gene-centric graph encoding molecular interactions and a patient-centric graph encoding similarity between individuals. In the gene-centric graph, each node corresponded to a gene, and edges represented protein–protein interactions (PPIs) obtained from the STRING database. Only high-confidence edges (score ≥ 0.7) were retained, and edge weights were min–max scaled to the range [0,1]. For each patient, the same PPI topology was used, but node attributes were defined by the patient’s expression profile restricted to the 5000-gene panel. This produced one graph per patient, all sharing topology but differing in node features. To stabilize message passing under degree heterogeneity, self-loops were added to every node and adjacency matrices were symmetrically normalized. During pretraining, low-probability edge dropout was applied only to low-confidence edges, thereby preserving the backbone of reliable interactions.
In the patient-centric graph, each node represented a patient, and edges encoded pairwise cosine similarity between standardized gene-expression profiles. To ensure graph connectivity while limiting spurious associations, we first constructed a k-nearest-neighbor (kNN) graph with k = 10, followed by symmetrization through mutual kNN. Edges with cosine similarity below 0.85 were subsequently pruned. Edge weights were retained as cosine similarity values to preserve neighborhood strength in downstream message passing.
We used the STRING v11.5 PPI database to construct the gene-level PPI graph. The high-confidence edge cutoff of 0.7 was selected based on prior benchmarking studies indicating that thresholds between 0.7 and 0.85 provide an optimal trade-off between interaction reliability and network connectivity for transcriptomic GNN applications. Thresholds below 0.6 introduce substantial noise, whereas thresholds above 0.85 yield overly sparse graphs that degrade message propagation and reduce representation diversity. STRING identifiers were mapped to HGNC symbols, and only experimentally supported or curated interactions were retained.
All graphs were implemented and stored in PyTorch Geometric (PyG). Sparse adjacency formats and memory-efficient neighborhood sampling were used to support mini-batch training on per-patient gene graphs. Graph schemas and preprocessing transforms were version-controlled to maintain consistency across experiments.

Graph encoders
We assessed four major families of GNN encoders to capture different inductive biases relevant to biological graphs: the Graph Isomorphism Network (GIN), the Graph Convolutional Network (GCN), the Graph Attention Network (GAT), and a Graph Transformer (GT) based on the Graphormer architecture. In addition, we implemented a Directed Message Passing Neural Network (D-MPNN) to account for potentially directional regulatory interactions.
Among these, the GIN served as a strong baseline due to its high expressive power, which is theoretically comparable to the Weisfeiler–Lehman (WL) graph isomorphism test.
Each layer updated node features via an additive neighborhood aggregation modulated by a learnable scalar ϵ\epsilon, followed by a multilayer perceptron, i.e.,
We used five layers, each followed by batch normalization and rectified linear unit (ReLU) activation, with dropout of 0.3 between layers. Graph-level embeddings were obtained by global mean pooling unless otherwise specified.
For GAT, attention coefficients were computed as:
For the Graph Transformer, self-attention used:where b ij encodes structural biases (shortest-path, degree, PPI edge attributes).
The GCN captured localized spectral smoothing by propagating signals through a normalized adjacency operator, a bias that is advantageous when signals are spatially smooth on the graph (e.g., within pathways). The GAT replaced uniform neighborhood weighting with learnable attention coefficients αij that adapt to node-pair compatibility, conferring robustness to noisy or spurious edges. We used 4–8 attention heads per layer, concatenating heads in hidden layers to increase expressivity and averaging heads in the final layer to stabilize outputs.
To capture long-range dependencies and explicitly incorporate structural priors, we employed a GT that applies global self-attention across nodes while integrating structural encodings such as shortest-path distance, node degree or centrality, and available edge attributes into the attention mechanism. This design combines the flexible receptive field of transformer models with graph-specific inductive biases, making it particularly effective in gene networks where distant regulatory interactions, such as trans-regulatory effects, can play a significant role in shaping phenotype.
The D-MPNN associated messages with directed edges and updated them asfor edge u → v, where ∥ denotes concatenation, and then aggregated edge states to node and graph representations. This allowed us to test whether directional modeling improves capture of putative regulatory flow compared with undirected message passing.
All encoders were implemented in the same codebase to ensure identical data handling, logging, and evaluation. Unless otherwise noted, hidden dimensions were 128–256, ReLU activations were used throughout, and final embeddings were ℓ2\ell_2-normalized before passing to task heads.

Self-supervised pretraining and generative augmentation
To leverage the abundant but unlabeled structure within patient graphs and enhance data efficiency, we employed GraphCL as a self-supervised pretraining strategy. For each patient-specific gene graph, two augmented “views” were generated using biologically guided transformations. Edge dropout was applied only to low-confidence STRING interactions to preserve core biological structure, while node-feature masking targeted low-variance genes to minimize disruption of strong signals. In addition, mild Gaussian noise was introduced to node features with variance-proportional scaling. Encoder outputs were passed through a two-layer projection head into a contrastive space, where the InfoNCE loss was used to maximize agreement between embeddings of augmented views of the same graph (positives) while pushing apart embeddings of different graphs within the minibatch (negatives). The temperature parameter (τ) was set to 0.2 unless optimized on the validation set, and batch sizes were selected to balance negative sample diversity with GPU memory limitations. Contrastive learning used InfoNCE:where augmented views were created via biologically constrained edge-dropout and feature masking.
To address severe class imbalance, particularly in rare subtypes or underrepresented clinicopathologic groups, we applied Molecular Generative Adversarial Networks (MolGAN) to generate synthetic but biologically plausible patient graphs. The generator produced adjacency matrices and node attributes conditioned on the target class, while the discriminator evaluated their realism at the graph level. MolGAN generated synthetic graphs G(z); the discriminator minimized WGAN-GP loss, allowing augmentation of rare HER2-enriched samples. Training was stabilized using a Wasserstein GAN with gradient penalty (WGAN-GP), and additional topology-preserving regularizers were introduced to penalize deviations in degree distribution and pathway membership relative to real graphs. Synthetic samples were generated only from training data after early stopping, filtered to remove near-duplicates based on cosine similarity of graph embeddings, and incorporated into the training set at low proportions (≤ 25% of the rare class) to avoid distorting the empirical distribution. No synthetic data were included in validation or test sets. Ablation experiments were conducted to verify that augmentation improved external generalization rather than artificially inflating internal performance metrics.

Domain adaptation and meta-learning
Because the primary objective was cross-cohort deployment, such as transferring models from TCGA-BRCA to METABRIC, we incorporated DANNs to learn representations that remained predictive of clinical labels while being invariant to dataset origin. A gradient reversal layer (GRL) connected the shared encoder to a domain classifier trained to distinguish source from target cohorts. During backpropagation, the GRL multiplied gradients by − λ, forcing the encoder to generate embeddings that confused the domain classifier while still optimizing task-specific objectives. To stabilize training, λ was gradually increased from 0 to 1 over the first 30% of epochs. The domain classifier was implemented as a two-layer multilayer perceptron with dropout, and class imbalance between source and target samples was addressed using inverse-frequency weighting.
For scenarios with limited labeled data in the target cohort or rare subtypes, we further employed MAML. Meta-tasks were defined by sampling small labeled support and query sets across subtypes and sites. In the inner loop, the encoder and task heads were adapted to the support set with 1–5 gradient steps at a relatively high learning rate, while in the outer loop, initialization parameters were updated to improve query performance.enabling few-shot adaptation for rare subtypes.
This procedure enabled the model to rapidly adapt to new data regimes, which is particularly valuable for deployment in smaller institutions or among niche patient populations. When combined with DANN, MAML supported robust cross-cohort transfer followed by fast personalization with minimal labeled data. The domain-adversarial loss was:applied through a gradient reversal layer to enforce domain-invariant embeddings.

Multi-task learning and optimization
A single shared encoder (GIN, GCN, GAT, Graph Transformer, or D-MPNN) was used to generate graph-level embeddings, which were subsequently passed to three task-specific heads corresponding to the clinical endpoints of interest. For molecular subtyping, a softmax classifier predicted one of four intrinsic subtypes (Luminal A, Luminal B, HER2-enriched, and Basal-like). For biomarker prediction, three independent sigmoid classifiers were employed to predict ER, PR, and HER2 receptor status. For survival analysis, a dedicated head implemented a differentiable Cox proportional hazards objective based on the DeepSurv formulation, optimizing the negative partial log-likelihood with Efron’s method to handle tied event times.
The overall training objective combined cross-entropy loss for molecular subtyping, binary cross-entropy loss for receptor prediction, Cox loss for overall survival, as well as domain-adversarial loss from DANN (when enabled) and contrastive loss from GraphCL (when joint training was performed). Class imbalance was addressed through class-balanced focal loss for receptor prediction, mild oversampling of rare subtypes during training, and label smoothing (ε = 0.05) for subtype classification to improve probability calibration.
Model optimization was performed using the AdamW optimizer. Learning rates and weight decay coefficients were tuned on the validation split using a hybrid grid and random search strategy, as detailed in Sect. “Experimental design and statistical analysis”. Separate learning-rate ranges were considered for message-passing GNNs and transformer-based encoders to account for their different optimization dynamics. A cosine annealing schedule with a warm-up phase was applied to the selected learning rate. Dropout rates ranged between 0.2 and 0.4 depending on model capacity, and gradient clipping with a maximum ℓ₂-norm of 5 was applied to stabilize training. Early stopping was based on validation performance (validation loss for classification tasks or C-index for survival-focused ablations) with a patience of 30 epochs.
Mini-batch training was conducted on per-patient gene-level PPI graphs with batch sizes between 16 and 32, depending on GPU memory constraints. Automatic mixed precision (AMP) was used when available to reduce memory consumption and accelerate training without compromising numerical stability.
The multi-task objective was defined as:
Ablation experiments demonstrated that removing the domain-adversarial component (DANN) resulted in an approximately 6% absolute decrease in TCGA → METABRIC subtype F1-score, underscoring the importance of explicit domain-invariant representation learning for cross-cohort generalization.

Experimental design and statistical analysis
Unless otherwise specified, experiments followed a 70/15/15 stratified train–validation–test split, consistent with Sect. “Dataset summary and characteristics”. External validation was conducted under strict train-on-source/test-on-target protocols, with preprocessing and harmonization performed only on the source training data and then applied unchanged to the target cohort. Hyperparameters were tuned through grid or random search on the source validation split, and no target labels were used for model selection, thresholding, or calibration. For classification tasks (subtypes and receptors), performance metrics included accuracy, macro-averaged F1-score, area under the receiver operating characteristic curve (AUC). Differences in correlated ROC curves were assessed using the DeLong test, while paired discrete outcomes (e.g., error counts for two models on the same dataset) were compared using McNemar’s test.
Ninety-five percent confidence intervals (95% CIs) were estimated via nonparametric bootstrap with 1000 replicates unless otherwise indicated. For survival modeling, discrimination was evaluated using Harrell’s concordance index (C-index), and overall calibration over time was measured using the Integrated Brier Score (IBS) at predefined horizons (e.g., 36 and 60 months). Risk stratification was visualized through Kaplan–Meier survival curves defined by model-derived thresholds and compared using the log-rank test. Calibration of classification probabilities was assessed with reliability curves, expected calibration error (ECE), and Brier scores.
Clinical utility was further quantified usingDCA across clinically relevant threshold ranges. To examine learned representations and detect residual cohort structure after alignment, graph-level embeddings were projected into two dimensions using Uniform Manifold Approximation and Projection (UMAP) with cosine distance (n_neighbors = 15, min_dist = 0.1), and, as a sensitivity analysis, t-distributed stochastic neighbor embedding (t-SNE). In both cases, only held-out test data were plotted, with labels used exclusively for visualization to avoid bias. When multiple hypotheses were tested, the false discovery rate (FDR) was controlled at 5% using the Benjamini–Hochberg procedure. All statistical analyses were conducted in Python using scikit-learn, lifelines, and statsmodels, with exact random seeds recorded for reproducibility.
Hyperparameters were tuned using a hybrid grid and random search strategy on the source-domain validation split. Learning rates were explored in the range {1 × 10⁻4, 3 × 10⁻4, 5 × 10⁻4, 1 × 10⁻3}, dropout rates in {0.2, 0.3, 0.4}, hidden dimensions in {128, 192, 256}, and mini-batch sizes in {16, 24, 32}. For contrastive pretraining with GraphCL, the temperature parameter τ was searched in {0.1, 0.2, 0.3}, and warm-up durations were varied between 3 and 10 epochs before applying cosine-annealed learning-rate decay. Early stopping was applied with a patience of 15–30 epochs based on validation F1 (for classification tasks) or C-index (for survival), and the final models were selected according to the best validation performance.
For all reported metrics, we computed 95% confidence intervals (95% CIs) using nonparametric bootstrap resampling (1000 replicates) on the held-out test set(s). For paired model comparisons on the same test instances, we used McNemar’s test for subtype classification error differences (paired discrete outcomes), and DeLong’s test to compare correlated ROC curves for receptor AUROCs (ER/PR/HER2). For F1-score comparisons across repeated runs, we used a paired t-test when normality was not violated; otherwise we used the Wilcoxon signed-rank test. Kaplan–Meier survival curves were compared using the log-rank test, and C-index uncertainty was also bootstrapped. All hypothesis tests were two-sided, with statistical significance defined at α = 0.05 (and FDR control at 5% where multiple comparisons were conducted).

Implementation details and reproducibility
All models were implemented in PyTorch with PyTorch Geometric (PyG) used for graph operations. Supporting libraries included scikit-learn for metrics and utilities, lifelines for survival analysis, and matplotlib-based packages for visualization. Experiments were executed primarily on a single NVIDIA T4 graphics processing unit with 16 GB of memory in Google Colab Pro, with selected results replicated on an NVIDIA V100 to verify consistency across hardware.
Reproducibility was ensured through multiple safeguards. Random seeds were fixed for Python, NumPy, PyTorch (CPU and CUDA), and PyG neighborhood sampling. All hyperparameters, data partitions, and preprocessing steps were stored in version-controlled configuration files, and model checkpoints were linked to configuration hashes and code commit identifiers. To prevent data leakage, normalization, harmonization, feature selection, and imputation were fit exclusively on training data and subsequently applied without refitting. Trained encoders and task heads were exported together with explicit dependency versions to enable exact replication. A lightweight command-line interface (CLI) coordinated the entire pipeline, from data acquisition and graph construction to training, evaluation, and figure generation, allowing every figure in the manuscript to be reproduced with a single command.
Experiments were run primarily on a single NVIDIA T4 GPU with 16 GB of VRAM, with selected configurations replicated on an NVIDIA V100 GPU (32 GB) to verify hardware robustness. For the full multi-task Graph Transformer with domain adaptation and self-supervision enabled, typical training time was approximately 20–25 s per epoch, resulting in a total wall-clock time of about 2–2.5 h for 200 epochs on the T4 GPU. Baseline GIN and GAT models trained faster, requiring approximately 45–70 min for the same number of epochs. Peak GPU memory usage for the multi-task Graph Transformer ranged between 11 and 13 GB, depending on batch size.

Ethical and data-use considerations
All analyses were conducted using publicly available, fully de-identified datasets. No protected health information was accessed, and no human subjects were directly involved. Therefore, formal ethics committee approval and informed consent were not required.

Planned ablation and sensitivity analyses
To evaluate the contribution of individual design choices and to assess robustness, we conducted a series of prespecified ablation experiments. First, the multi-task formulation was compared with task-specific single-head models to quantify the benefit of shared representation learning across subtype classification, receptor prediction, and survival estimation. Second, DANN was removed to measure the extent of cross-cohort degradation under covariate shift, and experiments were repeated with maximum mean discrepancy (MMD)–based alignment to confirm that observed effects were attributable to invariance learning rather than the adversarial classifier itself. Third, GraphCL was replaced with a feature-masking denoising autoencoder to test whether contrastive invariances were essential for label efficiency. Fourth, MolGAN augmentation was excluded, and synthetic-to-real ratios were systematically increased to identify ranges that improved rare-class performance without compromising calibration or external validity. Fifth, MAML was disabled to examine its role in enabling few-shot adaptation for rare subtypes and small target cohorts. Finally, the STRING confidence threshold was varied between 0.6 and 0.9 to assess how graph sparsity and edge reliability influenced accuracy and interpretability.
Sensitivity analyses included repeating the pipeline with alternative common-gene panel sizes (e.g., 3000 and 8000 genes) to test the impact of feature-space selection, assessing time-dependent ROC curves for survival at fixed horizons (36 and 60 months) to complement the C-index, retraining with grouped cross-validation by TCGA tissue source site to simulate multi-center deployment, and evaluating probability calibration after temperature scaling fitted only on source validation data to determine whether post-hoc calibration improved clinical decision value without reducing discrimination.

Computational complexity and resource footprint
To provide a practical reference for model deployment, we estimated the computational complexity and resource usage of the major model variants evaluated in this study. All models operate on a fixed 5000-gene feature space and share the same PPI graph topology. The GIN encoder serves as a lightweight baseline with a relatively small number of trainable parameters and low per-epoch training time, whereas the Graph Transformer (GT) and auxiliary modules (multi-task heads, GraphCL, and DANN) introduce additional capacity and computational load.
Overall, the total number of trainable parameters for our models lies on the order of 10^6–10^7. The single-task GT model is approximately 1.2–1.4 × larger than the GIN baseline, and the multi-task GT (with three task-specific heads for subtypes, receptors, and survival) adds an additional ≈10–20% parameters relative to the single-task GT. Incorporating the DANN domain classifier and the projection head used during GraphCL pretraining increases the parameter count by a further ≈10–15%; however, these modules are shallow multilayer perceptrons and do not dominate the overall computational burden.
Regarding computational cost, forward and backward passes of the multi-task GT with DANN and GraphCL require approximately 1.5–2.0 × the FLOPs of the GIN baseline for the same batch size, owing to both the heavier self-attention layers and the auxiliary objectives. Empirically, on a single NVIDIA T4 GPU, training epochs for the full multi-task GT configuration were observed to be 1.5–1.7 × slower than those of the GIN baseline under identical training settings, while still completing within practical time for the cohort sizes used in this study. Peak GPU memory usage increased moderately, primarily due to transformer attention blocks and additional heads, yet remained well within the 16 GB available on the hardware.
For deployment scenarios, inference-time complexity is governed mainly by the choice of encoder (GIN vs. GT) and the number of graph layers. Since GraphCL and DANN are used only during training, they introduce no latency overhead during inference. The multi-task formulation adds negligible extra inference cost relative to training separate single-task models, as the shared encoder is executed only once and the task-specific heads are computationally lightweight. In resource-constrained environments, lighter variants can be adopted—for example, using GIN or a shallower GT encoder, disabling the survival or receptor heads, or freezing a pretrained encoder and fine-tuning only the task-specific layers. These findings collectively indicate that although the full GT + GraphCL + DANN configuration is more computationally demanding than the GIN baseline, it remains tractable on commodity GPUs and can be flexibly adapted to diverse deployment constraints.

Results

Results

Graph-based models achieve superior subtype classification
We benchmarked eight graph-based and hybrid architectures on the TCGA-BRCA dataset for molecular subtype classification and report macro-F1 with 95% confidence intervals (95% CI) estimated via nonparametric bootstrap resampling (1000 replicates) on the held-out test set. As summarized in the bar plot matrix (Fig. 1a), the Graph Transformer (GT) achieved the highest single-task performance (F1 = 0.842, 95% CI [0.833–0.851]), followed by GAT (F1 = 0.828, 95% CI [0.819–0.837]) and GIN (F1 = 0.816, 95% CI [0.806–0.826]). Incorporating multi-task learning into the GT further improved performance to a peak F1 = 0.872 (95% CI [0.863–0.881]), indicating a statistically meaningful gain over all single-task baselines. Additional enhancements were observed with GraphCL-based contrastive pretraining (F1 = 0.858, 95% CI [0.848–0.868]) and MolGAN-driven synthetic augmentation (F1 = 0.865, 95% CI [0.855–0.875]).
The violin plots in Fig. 2 illustrate the bootstrap distributions (1000 resamples) of subtype F1 for each model, demonstrating tight uncertainty ranges and stable performance for GT-based approaches. The GT confusion matrix (Fig. 3) confirms strong classification accuracy across all molecular subtypes, with most misclassifications occurring between Luminal A and Luminal B. Furthermore, a UMAP projection of GT-derived embeddings (Fig. 4) shows distinct subtype-specific clusters, underscoring the model’s capacity to learn biologically meaningful, discriminative representations in the latent space. In addition to the UMAP projection, a t-SNE visualization (Fig. 5) was generated to further assess the structure of the learned embeddings. A clear separation between the four molecular subtypes was observed: Luminal A and Luminal B formed partially overlapping clusters, while HER2-enriched and Basal-like tumors occupied distinct and compact regions, confirming that the model captures consistent biological structure across dimensionality-reduction methods.

Multi-task learning enhances biomarker and survival prediction
Integrating receptor prediction tasks (ER, PR, and HER2) into the shared multi-task learning framework substantially improved both classification accuracy and performance on auxiliary tasks. As shown in Fig. 1b–d and the violin plots in Fig. 2b–d, the multi-task GT achieved AUCs of 0.960 for ER, 0.943 for PR, and 0.918 for HER2, consistently outperforming all other evaluated models.
Survival prediction also benefited from multi-task learning. As shown in Fig. 1e, the C-index improved from 0.709 (GT single-task) to 0.721 (multi-task GT). Violin plots (Fig. 2e) confirm this performance gain is consistent across trials. These results support the hypothesis that auxiliary tasks provide inductive bias for richer, generalizable representations.

Domain adaptation boosts cross-cohort transferability
To evaluate cross-cohort generalization, models trained on TCGA were tested directly on the METABRIC cohort. Without adaptation, the baseline GIN exhibited a marked decline in performance, with F1-score reduced to 0.738. Incorporating domain-adversarial training improved transferability, with DANN-GNN achieving an F1-score of 0.801, surpassing even the Graph Transformer, which reached 0.763 on the external dataset (Fig. 1f). For HER2 prediction, DANN-GNN further increased AUC from 0.882 to 0.904, demonstrating its effectiveness in mitigating distributional shifts between datasets. These findings underscore the necessity of domain-invariant representations for deployment in real-world clinical settings.
Mechanistically, DANN-GNN produced cohort-agnostic embeddings by explicitly penalizing domain-specific signals during training, thereby improving alignment across latent spaces. UMAP visualizations revealed tighter clustering of subtypes across cohorts after adaptation, reflecting improved harmonization of data distributions. The largest gains were observed in subtypes with substantial cross-platform variability, such as Luminal B, where conventional models frequently underperformed. Calibration analyses further showed that DANN-GNN maintained reliable probability estimates across cohorts, a prerequisite for clinical decision support.
Robustness was confirmed through leave-one-center-out validation within TCGA using hospital site metadata. DANN-GNN consistently generalized to unseen sites, supporting its utility beyond cross-dataset transfer. Collectively, these results demonstrate that adversarial domain adaptation not only enhances predictive accuracy but also improves the reliability and portability of graph-based models in heterogeneous clinical environments.

Contrastive pretraining improves robustness
Self-supervised contrastive pretraining with GraphCL substantially improved subtype classification compared with randomly initialized GIN. The pretrained GraphCL + GIN model achieved a subtype F1-score of 0.858, which was more than 4% higher than the baseline GIN (Fig. 1a). This improvement was accompanied by reduced variance across experimental runs (Fig. 2a), indicating greater training stability. Pairwise metric correlation plots (Fig. 6) further showed that gains in subtype classification were positively associated with improvements in biomarker AUCs and survival metrics, particularly HER2 AUC and the C-index.
The contrastive objective promoted domain-invariant and semantically consistent embeddings by maximizing agreement between augmented views of the same graph. Augmentations were biologically constrained, with perturbations limited to low-confidence edges or low-variance features, which ensured that important biological signals were retained. GraphCL improved training efficiency by accelerating convergence. Consequently, the number of epochs required to reach peak validation performance was reduced by approximately 30%. This gain in efficiency is especially valuable for computationally intensive GNN frameworks.
Latent space analysis revealed that GraphCL embeddings preserved subtype clusters more clearly than the baseline, with improved separation of HER2-enriched and Basal-like tumors. Label-efficiency experiments further demonstrated that pretraining yielded significant performance gains even when only 20% of labeled data were available. Overall, these results indicate that contrastive pretraining acts as a strong regularizer that enhances generalization and data efficiency, two properties that are essential for the translational deployment of GNNs in oncology.

Meta-learning and generative augmentation enable generalization for rare and underrepresented subtypes
To address data scarcity in rare molecular subtypes, we combined MAML with MolGAN-based generative augmentation. In 5-shot experiments on the HER2-enriched subtype, MAML-GNN substantially outperformed conventional models, achieving a Few-Shot F1-score of 0.782 compared with 0.639 for GIN and 0.667 for GAT (Fig. 1g; violin plot, Fig. 2g). These results indicate that meta-learning provides an effective mechanism for rapid adaptation under limited supervision, which is essential in low-prevalence clinical scenarios.
To improve sensitivity in underrepresented classes, we further introduced synthetic HER2-enriched samples generated with MolGAN. These synthetic graphs preserved biologically plausible topology and increased HER2 AUC to 0.935 while reducing inter-run variance (Figs. 1d, 2d). The MolGAN-augmented GT improved HER2 classification performance and approached state-of-the-art results across multiple endpoints, demonstrating the value of generative augmentation in addressing class imbalance.
The integration of MAML and MolGAN proved orthogonal and complementary. MAML enabled fast personalization in few-shot settings, whereas MolGAN enriched the training distribution without compromising biological fidelity. Together, these strategies mitigated both label scarcity and imbalance, challenges that are common in real-world clinical datasets. Importantly, no synthetic data were included in validation or testing, ensuring that the observed improvements reflected genuine generalization rather than overfitting to augmented samples. These findings highlight the translational potential of combining meta-learning with synthetic augmentation to support equitable performance across both common and rare breast cancer subtypes.

Comparative trends and summary
A comprehensive performance summary is provided in Table 1, with Fig. 7 illustrating the comparative effectiveness of all evaluated models across subtype classification, receptor prediction, survival analysis, cross-cohort transfer, and few-shot learning. The multi-task GT consistently achieved the highest scores, with Subtype F1 = 0.872, ER AUC = 0.960, PR AUC = 0.943, HER2 AUC = 0.918, and C-index = 0.721, establishing it as the top-performing architecture across tasks.
In transfer learning experiments, the DANN-GNN demonstrated superior generalizability, raising TCGA → METABRIC F1 from 0.738 (GIN) to 0.801, thereby highlighting the impact of domain-invariant representation learning. In few-shot settings, MAML-GNN achieved a HER2 5-shot F1 of 0.782, substantially outperforming conventional baselines. Line plots in Fig. 8 further corroborated these findings, showing consistent upward trends in average metric performance, particularly for ER, PR, and HER2 AUCs, as well as subtype F1. Together, these results underscore the synergistic contributions of multi-task learning, self-supervised pretraining (GraphCL), adversarial domain adaptation (DANN), and generative augmentation (MolGAN) in enhancing model robustness and clinical utility.
To quantify the effect of architectural components, we performed an ablation analysis based on Subtype F1, ER AUC, PR AUC, and HER2 AUC (Fig. 9). Compared with the GIN baseline, the multi-task GT substantially improved performance across all metrics (Subtype F1: 0.816 → 0.872; ER AUC: 0.928 → 0.960). MolGAN-augmented models provided further improvements for HER2 (AUC: 0.918 → 0.935), while Subtype F1 and ER/PR AUCs remained comparable to the multi-task GT. These findings suggest that generative augmentation primarily benefits receptor-specific subclasses, with less additional impact on global subtype classification.
To visualize receptor-specific discrimination across models, ROC curves were generated for ER, PR, and HER2 (Fig. 10). In line with the quantitative results in Table 1, the multi-task GT achieved the highest AUCs for ER (0.960) and PR (0.943), while the MolGAN-augmented model performed best on HER2 (AUC = 0.935). Baseline and meta-learning variants, including GIN, GAT, and MAML, showed lower ROC trajectories, reinforcing the value of multi-task learning and generative augmentation for receptor-specific prediction.
To provide a more structured and transparent overview of the empirical findings, we consolidated the key performance metrics into Supplementary Table S1, which reports overall subtype classification performance (F1 = 0.872), per-class metrics (e.g., Luminal A F1 = 0.89), biomarker prediction accuracy (ER AUC = 0.95, with PR and HER2 AUCs detailed in the main text), survival discrimination (C-index = 0.721), and computational efficiency (approximately 2 h of end-to-end training for the multi-task Graph Transformer).
To further characterize model behavior on subtype classification, Supplementary Fig. S2 shows the confusion matrix of the multi-task Graph Transformer on the TCGA-BRCA test set. The matrix highlights subtype-specific misclassification patterns, including an approximately 4% confusion between Luminal B and HER2-enriched tumors, which is consistent with ERBB2-driven transcriptional overlap between these groups. For the survival modeling task, Supplementary Fig. S3 presents Kaplan–Meier survival curves for model-derived low-risk and high-risk patient groups, demonstrating clear separation between risk strata and supporting the prognostic value of the survival head.
Interpretability analyses have been expanded by incorporating GNNExplainer-derived visualizations. Supplementary Fig. S4 displays feature importance scores for key regulatory genes, highlighting ESR1- and PGR-centered modules in hormone receptor–positive disease, as well as ERBB2-associated features relevant to HER2-enriched tumors. These results illustrate that the model’s decisions are grounded in biologically meaningful pathways.
The contribution of individual architectural components is quantified in Supplementary Table S2, which summarizes an ablation study. Removing MAML leads to an approximately 8% reduction in performance on rare subtypes, whereas excluding MolGAN results in an approximate 5% decrease in overall performance, underscoring the complementary roles of meta-learning and generative augmentation in addressing data scarcity and class imbalance.
Finally, Supplementary Table S3 compares the proposed multi-task Graph Transformer with representative state-of-the-art graph-based and multi-omics models (e.g., MGNN with F1 ≈ 0.85), reporting subtype F1, biomarker AUCs, C-index, and—where available—training time under comparable evaluation settings. This comparison highlights a 5–10% improvement over typical single-task GNNs and demonstrates superior cross-cohort robustness of our unified, adaptation-enhanced framework.

Explainability analysis using GNNExplainer
To interpret model predictions, we applied the GNNExplainer module from PyTorch Geometric, which identifies influential node features and subgraph structures contributing to specific outputs. For subtype classification, the explainer consistently highlighted ERBB2, ESR1, and PGR as dominant contributors to predictions of HER2-enriched and Luminal A subtypes. Beyond individual gene contributions, the explainer also uncovered recurrent subgraph motifs enriched for established signaling pathways, including PI3K/AKT and MAPK cascades, particularly within HER2-enriched tumors. These motifs frequently corresponded to densely connected regions in the STRING-derived PPI graph, indicating that the GNN exploited both co-expression and interaction topology to support its classifications.
Feature attribution maps further revealed patient-specific heterogeneity in pathway usage, enabling subtype-level interpretability. For example, although ERBB2 was the dominant node in most HER2-enriched cases, some patients demonstrated additional reliance on GRB7 or CDK12, suggesting the presence of plausible biological subgroups within the HER2 subtype. To evaluate consistency across the cohort, we aggregated frequency-weighted heatmaps of top-explained nodes, which confirmed stable reliance on a core set of biologically validated markers. These results strengthen confidence in the model’s predictions and demonstrate the ability of GNNs to produce interpretable, biologically grounded insights in complex transcriptomic analyses.

Discussion

Discussion
The advent of GNNs has transformed computational biology by enabling relational learning that respects the structured nature of omics data. Unlike conventional machine learning methods that treat gene expression as independent features, GNNs capture dependencies between genes, pathways, and patients within a network topology, thereby modeling emergent biological behavior that cannot be inferred from flat vectors alone. This study contributes to this paradigm by systematically benchmarking eight GNN-based approaches, including GIN, GAT, GT, and D-MPNN, along with GraphCL for contrastive pretraining, MAML for meta-learning, MolGAN for generative augmentation, and DANN for domain adaptation. By integrating subtype classification, receptor prediction, and survival modeling into a unified framework, we demonstrate that graph-based models achieve improvements in both discrimination and calibration across diverse clinical endpoints.
Dimensionality reduction analyses using t-SNE and UMAP showed that embeddings derived from GNNs preserved clinically relevant biological structure. The four intrinsic molecular subtypes of breast cancer (Luminal A, Luminal B, HER2-enriched, and Basal-like) were clearly separated in the latent space, in alignment with receptor status groupings. This indicates that the learned embeddings captured tumor similarities that reflect underlying phenotypes. Similar findings have been reported in analyses of TCGA pan-cancer cohorts, where graph embeddings revealed latent clusters corresponding to known mutational subtypes. The concordance between unsupervised graph structure and clinical labels highlights the ability of relational learning to disentangle high-dimensional omics complexity.

Complementary contributions of architectural refinements and augmentation
Our ablation analyses showed that both the multi-task GT and MolGAN-based augmentation independently improved performance, although through different mechanisms. The multi-task GT enhanced overall subtype discrimination by leveraging attention across multiple relational scales, whereas MolGAN augmentation primarily benefited receptor-specific subclasses, particularly HER2. Comparative ROC analyses demonstrated that the multi-task GT achieved the highest AUCs for ER and PR, while MolGAN-augmented training most strongly improved HER2 prediction. These findings are consistent with recent studies in generative biology, where augmentation improved performance for underrepresented drug-response phenotypes6 and with reports that attention-based GNNs enhance subtle subtype discrimination7. Collectively, these results indicate that model architecture and generative augmentation capture complementary signals and achieve the greatest impact when combined.

Superiority of the graph transformer framework
The multi-task GT consistently outperformed baseline models across all endpoints, surpassing GIN and GAT in subtype F1-score, receptor AUCs, and OS C-index. These findings extend the work of Padma et al.8 and Ren et al.9, who demonstrated that multi-task objectives enhance generalization in GNNs for drug response prediction and subtype discovery. In contrast to prior studies that focused on single endpoints, our unified GT framework improved multiple clinical outputs simultaneously, underscoring the scalability of hierarchical attention. Importantly, the GT also outperformed established multi-omics models such as LASSO-MOGAT10 and MoGCN11, highlighting the advantage of attention-driven graph reasoning in distinguishing subtle expression differences, particularly between Luminal A and Luminal B tumors.

Domain adaptation for cross-cohort translation
A major barrier to clinical translation is the limited generalization of models across cohorts. Differences between TCGA and METABRIC arise from assay technologies (RNA-seq versus microarray), population demographics, and clinical annotation, all of which contribute to distributional heterogeneity. Conventional machine learning pipelines often show marked degradation when evaluated across such settings12. In our study, DANN-GNN mitigated this gap, raising cross-cohort subtype F1 from 0.738 with the GIN baseline to 0.801 and improving HER2 AUC from 0.882 to 0.904. These improvements align with findings from adversarial domain adaptation frameworks applied in other biomedical contexts, including cross-cancer transfer13 and pathology domain shifts14. By promoting domain-invariant embeddings, DANN facilitates the transfer of graph models across institutions, representing a critical step toward robust deployment in heterogeneous clinical environments.

Meta-learning for rare subtypes
Rare subtypes remain a significant challenge due to limited data availability and severe class imbalance. In our study, MAML-GNN achieved a five-shot HER2 F1 of 0.782, outperforming GAT (0.667) and GIN (0.639). This provides proof-of-principle that meta-learning, originally developed for few-shot image classification15, can enable GNNs to generalize in oncology. Previous studies have demonstrated the utility of MAML in genomics16 and histopathology graphs17, but to our knowledge, this is the first application to breast cancer subtype prediction. The capacity to adapt to rare subtypes with only a few labeled examples presents a practical approach for clinical contexts where prospective data collection is slow or highly imbalanced.

Self-supervision and generative robustness
Self-supervised pretraining with GraphCL improved mean F1 by 5.2% compared with randomly initialized GIN and reduced inter-run variance by approximately 35%. These findings are consistent with prior work by Ci et al.18 and Ray et al.19, who showed that contrastive objectives capture global graph semantics and stabilize training. Generative augmentation with MolGAN further increased HER2 AUC to 0.935 and reduced variance, paralleling the results of Patel et al.20 in molecular property prediction. Collectively, these results support a hybrid strategy in which pretraining provides global invariances while generative augmentation enriches underrepresented distributions. Such a combined approach is rarely applied in oncology but may be essential to address variance and imbalance in real-world clinical datasets.

Correlation between predictive endpoints
Correlation analyses revealed strong positive associations between subtype F1, receptor AUCs, and survival C-index. These relationships suggest that improvements in subtype prediction directly enhance receptor status classification and survival estimation, supporting the hypothesis that molecular subtypes serve as latent determinants of clinical outcomes21. Comparable integrative observations have been reported in large-scale prognostic studies, where subtypes have been shown to mediate treatment response and survival disparities22. Visualization of embeddings using UMAP (Fig. 2) further confirmed that GNN-derived latent spaces encode this biological hierarchy, providing a mechanistic link between classification and prognosis.

Clinical relevance of model interpretability
The explanatory patterns identified by GNNExplainer showed strong concordance with established biological mechanisms of breast cancer. For HER2-enriched tumors, the explainer consistently prioritized ERBB2 together with co-amplified genes such as GRB7 and CDK12, reflecting the well-characterized 17q12–q21 amplicon that drives HER2-mediated PI3K/MAPK signaling and confers sensitivity to anti-HER2 therapies. In Luminal A and Luminal B tumors, ESR1 and PGR emerged as dominant attribution nodes, aligning with the endocrine-regulated transcriptional programs that define hormone-receptor-positive disease and underpin clinical responsiveness to aromatase inhibitors or selective estrogen-receptor modulators. These observations confirm that the model is capturing biologically meaningful regulatory structure rather than relying on spurious correlations.
Beyond confirming known biology, the explainer outputs revealed patient-specific pathway dependencies. For example, a subset of HER2-enriched cases exhibited higher attribution weights for CDK12 or FGFR4 instead of ERBB2, suggesting alternative signaling reliance that may have therapeutic implications or reflect intrinsic resistance states. Likewise, Luminal B tumors with elevated attribution of proliferation-related subnetworks (e.g., CCND1, FOXM1 modules) may represent clinically more aggressive phenotypes consistent with their higher Ki-67 levels.
These mechanistic explanations can support clinical translation in three ways.(i) Diagnostic refinement: subtype predictions can be accompanied by explainer-derived gene subnetworks that confirm or challenge the model’s reasoning, enabling quality control and identification of atypical biological signatures.

(ii)
Therapeutic decision support: attribution maps highlighting HER2-signal amplification, ER-dependence, or alternative pathways may complement receptor-based treatment selection, particularly in borderline, low-expression, or genomically heterogeneous cases.

(iii)
Risk stratification: survival-associated explainer features allow clinicians to visualize the molecular drivers influencing individual risk scores, improving transparency of prognostic estimates and facilitating shared decision-making.

Together, these analyses demonstrate that the interpretability module not only aligns with known molecular drivers of breast cancer but also provides actionable, patient-specific insights that enhance the clinical usefulness of the proposed GNN framework.

Comparison with recent graph-based and multi-omics models
A growing body of recent work has applied deep learning and graph-based methods to omics integration and breast cancer prediction. For example, Yang et al.1 and Mahmoud et al.2 developed deep genomic models for breast cancer subtyping and survival prediction based on transcriptomic and multi-omics data, while Waqas et al.3, Ballard et al.5 and Wekesa and Kimwele9 reviewed or proposed multi-omics integration strategies using deep neural networks and autoencoder-style architectures. More recent graph-oriented and multimodal frameworks have been introduced by Bu et al.12, Padma et al.18, Ren et al.19, Li and Nabavi21, Cai et al.23, Ray et al.24, and Patel et al.25, who explored hierarchical GNNs, multi-view graph learning, multimodal GNNs, and explainable graph architectures for cancer and, in particular, breast cancer modeling.
Across these recent studies, subtype classification performance is typically reported in the low 0.80 range for F1-score or accuracy on TCGA-like cohorts12,18,19,21,25, and survival modeling generally yields concordance indices in the 0.67–0.70 range in deep genomic and multi-omics prognostic frameworks1,2,5,8. However, most of these methods are optimized for a single primary endpoint (either molecular subtyping or prognosis), focus on within-cohort evaluation, and do not systematically assess train-on-source/test-on-target generalization or few-shot behavior in rare subtypes1–3,5,8,9,12,18,19,21,23–25.
In contrast, our multi-task Graph Transformer achieves a subtype F1-score of 0.872 on TCGA-BRCA, representing an absolute gain of approximately 5–8 percentage points over the typical performance reported by recent graph-based and multi-omics models12,18,19,21,25. For receptor prediction, our framework attains ER, PR and HER2 AUCs of 0.960, 0.943 and 0.918, respectively, exceeding the ≈0.90 range commonly achieved by multimodal and HER2-focused GNN frameworks19,21,25. For survival, our C-index of 0.721 improves upon the 0.67–0.70 range reported in recent deep survival models based on genomic or multi-omics data1,2,5,8.
Moreover, we explicitly evaluate cross-cohort generalization under a strict train-on-TCGA, test-on-METABRIC protocol, showing that the DANN-GNN variant improves external subtype F1 from 0.738 (baseline GIN) to 0.801. To our knowledge, comparable cross-cohort metrics are rarely reported in these recent GNN and multi-omics studies on breast cancer12,18,19,21,25. For rare HER2-enriched subtypes in a 5-shot setting, our MAML-GNN achieves an F1-score of 0.782, markedly higher than conventional GNN baselines (0.639–0.667), while MolGAN-based augmentation further increases HER2 AUC to 0.935 without degrading subtype F1 or calibration. Comparable few-shot or generative-augmentation benchmarks for rare breast cancer subtypes are not reported in the aforementioned recent studies1–3,5,8,9,12,18,19,21,23–25.
Overall, this quantitative comparison indicates that, relative to recent graph-based and multi-omics models1–3,5,8,9,12,18,19,21,23–25, our framework provides consistent gains in subtype F1, receptor AUCs, and survival C-index, while uniquely combining multi-task learning, cross-cohort robustness, and explicit rare-subtype adaptation within a single, reproducible GNN/Graph Transformer pipeline. These results indicate that our framework delivers not only higher accuracy and better survival discrimination, but also explicitly addresses cross-cohort robustness and rare-subtype adaptation, which are rarely evaluated in previous work.

Conclusion, limitations, and future directions

Conclusion, limitations, and future directions

Summary of contributions
In this study, we introduced a unified and interpretable graph-based framework that integrates molecular subtyping, biomarker prediction, and survival analysis within a multi-task Graph Transformer architecture. By incorporating domain-adversarial adaptation, contrastive self-supervision, few-shot meta-learning, and generative augmentation, the proposed pipeline achieved superior performance across all major clinical endpoints. The model demonstrated strong subtype classification (F1 = 0.872), high biomarker prediction accuracy (ER AUC = 0.960), improved survival discrimination (C-index = 0.721), enhanced cross-cohort robustness, and biologically consistent interpretability through GNNExplainer-derived pathway attributions.

Limitations
Despite promising results, several limitations remain.

(i) Dataset bias: Both TCGA-BRCA and METABRIC are predominantly Western-centric cohorts with limited demographic diversity. As a result, generalizability to global populations remains uncertain.

(ii) Limited omics coverage: The current framework relies solely on transcriptomic profiles and PPI-derived graph structure. Additional omics layers such as proteomics, methylation, copy number variation, and spatial transcriptomics were not incorporated but could provide deeper biological context.

(iii) Computational demands: Graph Transformers, especially with multi-task objectives and domain adaptation modules, require substantial GPU resources and long training times, which may limit deployment in low-resource clinical settings.

(iv) Survival modeling constraints: The Cox-based survival head assumes proportional hazards, which may not fully capture dynamic risk trajectories in heterogeneous tumors.

Future directions
Future extensions of this work may address these limitations in several ways:Multi-omics integration: Incorporating scRNA-seq, spatial proteomics, methylation, and copy-number profiles would allow more comprehensive modeling of tumor ecosystems and microenvironmental interactions.

Population diversity: Expanding evaluation to additional international datasets (e.g., ICGC, MET500, Asian and African cohorts) is essential to ensure unbiased, globally robust performance.

Real-time clinical decision-support tools: Embedding the model into rapid-response clinical pipelines—such as cloud-hosted APIs or point-of-care diagnostic systems—could enable real-world application in treatment planning and risk stratification.

Dynamic and causal survival models: Extending beyond proportional hazards toward dynamic GNN-based survival estimators and causal graph frameworks would support richer modeling of treatment effects and longitudinal risk.

Lightweight and deployable GNN variants: Developing pruning, quantization, or distilled versions of the Graph Transformer could improve feasibility in low-resource settings and accelerate clinical translation.

Supplementary Information

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

A Phase I Study of Hydroxychloroquine and Suba-Itraconazole in Men with Biochemical Relapse of Prostate Cancer (HITMAN-PC): Dose Escalation Results.
Cancer research communications 2026 Talmor B 외 📖 unpaywall
Self-management of male urinary symptoms: qualitative findings from a primary care trial.
The British journal of general practice : the journal of the Royal College of General Practitioners 2026 Wheeler JR 외 📖 unpaywall
Clinical and Liquid Biomarkers of 20-Year Prostate Cancer Risk in Men Aged 45 to 70 Years.
JAMA network open 2026 Lindholz M 외 📖 unpaywall
Diagnostic accuracy of Ga-PSMA PET/CT versus multiparametric MRI for preoperative pelvic invasion in the patients with prostate cancer.
Science progress 2026 Qin Z 외 📖 unpaywall
Comprehensive analysis of androgen receptor splice variant target gene expression in prostate cancer.
Biochimica et biophysica acta. Molecular cell research 2026 Wüstmann N 외 📖 unpaywall
Clinical Presentation and Outcomes of Patients Undergoing Surgery for Thyroid Cancer.
Journal of the College of Physicians and Surgeons--Pakistan : JCPSP 2026 Khan MMU 외 📖 unpaywall