본문으로 건너뛰기
← 뒤로

A highly resolved integrated single-cell atlas of human breast cancers.

1/5 보강
NAR genomics and bioinformatics 2026 Vol.8(1) p. lqaf217
Retraction 확인
출처

PICO 자동 추출 (휴리스틱, conf 2/4)

유사 논문
P · Population 대상 환자/모집단
138 patients.
I · Intervention 중재 / 시술
추출되지 않음
C · Comparison 대조 / 비교
추출되지 않음
O · Outcome 결과 / 결론
We also identified robust associations between TME composition and clinical phenotypes such as tumor subtype and grade that were not discernible when the analysis was limited to individual datasets, highlighting the need for atlas-based analyses. This atlas represents a valuable resource for further high-resolution analyses of TME heterogeneity within BC.

Chen A, Kroehling L, Ennis CS, Denis GV, Monti S

📝 환자 설명용 한 줄

In this study, we developed an integrated single-cell transcriptomic (scRNAseq) atlas of human breast cancer (BC), the largest resource of its kind, totaling >600 000 cells across 138 patients.

이 논문을 인용하기

↓ .bib ↓ .ris
APA Chen A, Kroehling L, et al. (2026). A highly resolved integrated single-cell atlas of human breast cancers.. NAR genomics and bioinformatics, 8(1), lqaf217. https://doi.org/10.1093/nargab/lqaf217
MLA Chen A, et al.. "A highly resolved integrated single-cell atlas of human breast cancers.." NAR genomics and bioinformatics, vol. 8, no. 1, 2026, pp. lqaf217.
PMID 41640877 ↗

Abstract

In this study, we developed an integrated single-cell transcriptomic (scRNAseq) atlas of human breast cancer (BC), the largest resource of its kind, totaling >600 000 cells across 138 patients. Rigorous integration and annotation of publicly available scRNAseq data enabled a highly resolved characterization of epithelial, immune, and stromal heterogeneity within the tumor microenvironment (TME). Within the immune compartment, we were able to characterize heterogeneity of CD4, CD8 T cells, and macrophage subpopulations. Within the stromal compartment, subpopulations of endothelial cells (ECs) and cancer-associated fibroblasts (CAFs) were resolved. Within the cancer epithelial compartment, we characterized the functional heterogeneity of cells across the axes of stemness, epithelial-mesenchymal plasticity, and canonical cancer pathways. Across all subpopulations observed in the TME, we performed a multi-resolution survival analysis to identify epithelial cell states and immune and stromal cell types, which conferred a survival advantage in both The Cancer Genome Atlas (TCGA), METABRIC, and SCANB. We also identified robust associations between TME composition and clinical phenotypes such as tumor subtype and grade that were not discernible when the analysis was limited to individual datasets, highlighting the need for atlas-based analyses. This atlas represents a valuable resource for further high-resolution analyses of TME heterogeneity within BC.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

같은 제1저자의 인용 많은 논문 (5)

📖 전문 본문 읽기 PMC JATS · ~70 KB · 영문

Introduction

Introduction
Breast cancer (BC) is the most prevalent cancer and the second most common cause of cancer death in women [1]. It is defined as a malignancy of the epithelial duct in breast tissue, and is a highly heterogeneous disease [2, 3] for which clinical outcomes and responsiveness to treatment depend greatly on intrinsic factors related to the cancer cell, as well as extrinsic factors related to altered states of immune and stromal cells in the tumor microenvironment (TME) [4, 5, 6]. To improve our understanding of breast cancer progression and treatment, an extensive characterization of the heterogeneity in BC cells and the surrounding tumor microenvironment is required.
Single-cell transcriptomics (scRNAseq) remains the method of choice to characterize heterogeneity in tissue composition in terms of cell types – categories of cells that perform consistent functions, and in terms of cell states – categories of transient functions that can be shared across different cell types [7, 8]. The technology has been extensively applied to profile healthy breast tissue, as part of the larger human cell atlas effort. There are now highly-resolved scRNAseq atlases that have annotated 700 000–800 000 cells across 55–126 samples, yielding new insights into the diversity of immune, stromal, and epithelial populations in healthy breast tissue and how these are linked to clinical metadata [9, 10]. However, the application of scRNAseq to profile breast cancer (BC) samples has been limited by cohort sizes, with most studies having fewer than 30 samples [11–19, 20]. While these individual studies have laid substantial groundwork in characterizing tumor heterogeneity within BC, their limited sample size poses a challenge with reproducibility and identifying statistically significant associations between tissue composition and clinical metadata.
To address this challenge, we integrated eight publicly available scRNAseq datasets of untreated and unsorted BC samples [11–19, 20] to create an atlas of transcriptional heterogeneity in BC. The analysis of this combined atlas enabled (1) the reconciliation of mislabeled cell types, (2) the identification of novel cell types, and (3) the association of changes in TME heterogeneity with clinical phenotypes such as tumor subtype and grade with increased statistical power (Fig. 1A). Another study has attempted to integrate publicly-available scRNAseq BC data to create an atlas but compared to Xu et al. our study only contains unsorted BC samples, allowing for an unbiased characterization of inter-patient heterogeneity, and our atlas has a much a larger sample size (> 600 000 cells), enabling more robust identification of cell types and differentially abundant populations across phenotypes [21].
This integrated scRNAseq atlas provides an unbiased reference landscape of transcriptional heterogeneity in BC, comprised of more than 600 000 cells across 138 patients, the largest of its kind, and represents a valuable resource for hypothesis-driven analysis of TME heterogeneity. The data and methods described herein provide a robust framework for updating this BC atlas as new datasets are made publicly available [22].

Materials and methods

Materials and methods

Data acquisition
To create an unbiased scRNAseq atlas of human BC, we integrated publicly available datasets that contained untreated samples (samples from patients who had not undergone chemotherapy or other treatment before sequencing) and samples that were not sorted prior to sequencing (not biased towards a particular cell type). Given these criteria, we identified eight datasets to include in this atlas, totaling 138 patients and 621 200 cells (Table 1).

Data preprocessing
Each scRNAseq dataset was first pre-processed using Seurat [23] to remove low-quality cells, doublets, and normalize data before integration. We adopted current scRNAseq pre-processing best-practices by using adaptive thresholding based on median absolute deviation, instead of manual cutoffs, to filter out outlier cells based on three criteria: mitochondrial gene expression, number of unique genes detected per cell, and total number of genes detected per cell [24, 25]. For studies that only provided pre-filtered data, we retained all cells for downstream analysis. Following filtering, doublet scores were obtained using scDblFinder [26] and expression profiles were normalized by total expression and log-transformed using Seurat’s LogNormalize function, which has been demonstrated to work well in prior benchmarking [27].

Integration of single-cell data
Integration of filtered scRNAseq data was performed using various methods: Seurat’s V5 reciprocal principal component analysis (RPCA) [23], fastMNN [28], Harmony [29], scVI [30], and scANVI [31]. To facilitate fair comparison between the integration methods, the same number of variable features (5000) and PC dimensions (200) were used for integration when possible. For scVI and scANVI, integration is performed at the level of raw counts instead of embeddings, so only the variable feature selection process was included. To avoid biasing the integration towards individual sources of annotations e.g. author annotations or SingleR/CellTypist annotations, only the batch correction metrics: silhouette score, k-nearest neighbor batch effect test (kBET), and integration local inverse Simpson’s index (iLISI), were used to assess integration performance.

Inference of PAM50 subtype
To infer molecular subtypes of each sample using the PAM50 method, we first calculated a ‘pseudo-bulk’ profile for each tumor using Seurat’s AggregateExpression function, and then applied molecular.subtyping from the genefu R package to each pseudo-bulk profile with default parameters [32]. Each pseudobulk profile receives a score normalized between 0 and 1 for each subtype, indicating the likelihood of classification, and the molecular subtype with the highest score leads to the final classification (Supplementary Table S1A).

Discrete annotation of cell types
Annotation of cell types was performed using a combination of reference-based annotation methods: singleR [33], CellTypist [34, 35], unsupervised recursive partitioning: K2Taxonomer [36], and unsupervised clustering followed by differential expression analysis: MAST [37].

Reference-based annotation of cell type
Both singleR and CellTypist require a labelled scRNAseq reference dataset to compare query cells with. Since all of the samples in this study originate from breast tissue, we leveraged labeled scRNAseq data from the healthy breast cell atlas (HBCA) as a reference [10]. For singleR, the Wilcox method de.method = “Wilcox” was used, as this is more appropriate than the default methods when applying singleR to single-cell data. For CellTypist, we used a pre-trained breast tissue model that used labeled data from HBCA [35], and used majority voting to determine the final cell type label majority_voting = True, mode = ‘best match’.

Unsupervised clustering-based annotation
Both K2Taxonomer and clustering-based annotation rely on unsupervised clustering, which was performed using the Louvain algorithm for three subsets of the atlas separately: epithelial, immune, and stromal, based on expression of canonical markers (Epithelial: EPCAM, KRT8, KRT 18; Immune: PTPRC, CD3D, CD3E, CD68; Stromal: COL1A1, COL1A2, ENG). Performing clustering and annotation within each compartment separately enables the detailed characterization of heterogeneity within each compartment.
To identify differentially expressed genes (DEGs) between clusters, we used Seurat’s FindAllMarkers with the parameters test.use = ‘MAST’, and latent.vars = ‘batch’ to account for the negative-binomial distribution of single-cell data and to control for dataset-specific effects, respectively.
Whilst clustering-based annotation can quantify differences at the cluster level, the number of clusters, and the resulting DEGs obtained from downstream analysis, relies heavily on arbitrary values of the ‘resolution’ parameter in typical unsupervised clustering algorithms. To mitigate these limitations, we leveraged K2Taxonomer to first learn hierarchical relationships between the clusters identified in each compartment, enabling the annotation of clades of clusters instead of just individual clusters. K2Taxonomer was applied to the integrated embeddings obtained from RPCA, with the following parameters (featMetric=“F”, nBoots = 400, clustFunc = cKmeansDownsampleSqrt), which are recommended for large scRNAseq datasets [36]. To estimate the differentially expressed genes (DEGs) between nodes in the dendrogram, we used the normalized gene expression matrix and K2Taxonomer’s built-in function runDGEmods, which uses MAST to model differential expression and the batch variable as a covariate to control for dataset-specific effects.

Random forest-based annotation of discriminative factors
The randomForest package was used to train random forests to classify cluster membership, or meta-cluster membership (as determined by K2 Taxonomer), of single-cell profiles based on expression of continuous scores related to cell stemness, EMP, HBCA cell type, malignancy, and clinical metadata. The model was run with the parameter importance = TRUE to extract the most important features (as determined by gini-based importances), that differentiated epithelial clusters from each other.

Continuous annotation of cell states
Classifying discrete cell types is just one facet of heterogeneity, and there are many transient functional programs or cell states that require a more continuous scoring [38]. To characterize heterogeneity of functional states within the cancer epithelial compartment of our BC atlas, we scored each cell in terms of its malignancy with inferCNV [39], stemness with Cytotrace2 [40], Epithelial-Mesenchymal plasticity [41], expression of hallmark pathways [42, 43], as well as expression of epithelial subtypes identified in HBCA [10].

Inference of copy number variations
InferCNV was used to estimate copy number alterations for tumor epithelial cells from each dataset independently, with the following parameters: cutoff = 0.1, cluster_by_groups = TRUE, denoise = TRUE and HMM = FALSE, which are recommended for 10X scRNAseq data. Immune and stromal cells from each dataset were used as normal references, and from these normal reference sets, 100 immune and stromal cells were sampled to act as negative controls to validate inferCNV output.
For each cell, inferCNV estimates a score for each gene that is above one if there is inferred copy number gain, and below one if there is inferred copy number loss. We summarized these scores to estimate a malignancy score per cell that represents the mean absolute CNV score.
To classify each cell as malignant or non-malignant, for each dataset, we identified the 95% quantile of malignancy scores in the non-epithelial cells and labeled all epithelial cells with malignancy scores greater than this cutoff as malignant (Supplementary Table S1B).

Scoring of gene set activity
To characterize the heterogeneity of functional states present in the atlas, we leveraged publicly available transcriptional signatures (Supplementary Table S1C) and used a rank-based method (AUCell) [44], with default parameters, to score enrichment of signatures in each cell. For the immune compartment, we scored cells using signatures from pan-cancer studies of myeloid and lymphoid cell states [45, 46], as well as signatures of immune cell types from HBCA [10]. For the stromal compartment, we scored cells using signatures from a pan-cancer study of stromal cell heterogeneity [47]. For the epithelial compartment, we scored cells using canonical hallmark signatures [43], as well as signatures of epithelial cell types from HBCA [10].

Differential diversity and abundance analysis

Cell type diversity analysis
We used an entropy-based cell type diversity metric (CTDS) [48] to assess how the overall diversity in a patient’s TME changes across subtype, age, and tumor grade. The CTDS metric accounts for the number of cells per sample and the compositional nature of proportional data, which enables comparison across samples.

Differential abundance analysis
To test differences in cell type proportions across different phenotype groups e.g. tumor grade and subtype, we used the R package sccomp with bimodal_mean_variability_association = TRUE, which is recommended for scRNAseq data [49]. To isolate the effects of tumor grade on TME composition, we controlled for tumor subtype, age, and batch with the following formula:
Similarly, to isolate the effects of tumor subtype on TME composition, we controlled for grade, age, and batch.

Multi-resolution survival analysis of cell type populations
To identify the clinical relevance of cell type annotations identified in the atlas, we extracted differentially expressed genes defining each subpopulation found in our taxonomic analysis of each compartment (K2Taxonomer) [36], and used the R package brcasurv [50] to model the association between the expression of subpopulation specific genes and patient survival in both TCGA [3], METABRIC [51], and SCANB [52].
To address possible confounders in the survival analysis, we included patients' age, as well as patient-level gene set projection scores of inflammation and proliferation that have previously been associated with poor prognosis and demonstrated to be necessary for correction in survival modelling [36, 53, 54]. The patient-level gene set scores were obtained by using the R package GSVA [55] to project inflammation [54], proliferation [53], as well as the cell type-specific gene sets onto transcriptional profiles from TCGA and METABRIC. The Cox proportional hazards model underlying the survival analysis controlled for these confounders with the following formula:
Since the Cox proportional hazards model assumes that the hazard ratio is constant over time (proportional hazards assumption), we filtered out results that did not satisfy this criterion (Supplementary Table S1D). Finally, to control for multiple testing, we corrected P-values in each Cox model using the FDR method [56] in the p.adjust R package, and combined p-values across analyses performed in TCGA, METABRIC, and SCANB using Fisher’s method [57].

Results

Results

Constructing the integrated core atlas
The BC atlas was constructed by integrating eight publicly available scRNAseq datasets of untreated primary BC tumors. Across the eight datasets, samples were sequenced at varying throughput levels and had different distributions of clinical and patient metadata, including subtype, grade, and age (Fig. 1A–C). Prior to integration, batch effects were evident as cells separated by the dataset of origin, rather than by biological factors (Supplementary Fig. S1A). Six different integration approaches were applied, and the best-performing method, as determined by atlas integration batch correction metrics [58], was Seurat’s reciprocal PCA method (RPCA) (Fig. 1D).

Relabeling of broad cell types
To annotate cells in the integrated atlas, we first defined three compartments: epithelial, immune, and stromal, by cross-referencing annotations from SingleR, CellTypist, and author annotations with broad clustering (Louvain clustering with resolution 0.1) (Supplementary Fig. S1B). The resulting compartments showed distinct expression of canonical marker genes (Epithelial: EPCAM, KRT8, KRT18; Immune: PTPRC, CD3D, CD3E, CD68; Stromal: COL1A1, COL1A2, ENG) and diverse representation across all eight source datasets (Fig. 1E–G). More than 300 000 cells that previously had no available cell type annotation were annotated using the cluster-based annotations from the atlas (Fig. 1H).
These annotations of broad cell types based on integrated data also resulted in a relabeling of broad cell types for over 9172 cells (around 1% of the atlas) across 5 datasets (e.g. relabeling a cell as stromal that was originally labeled immune).

Cancer epithelial diversity
Intra-tumor heterogeneity within cancer cells stems from multiple factors, including but not limited to genomic instability, epigenetic alterations, and clonal evolution, which are all implicated in the treatment of cancer [59]. In this BC atlas, we sought to characterize intra and inter-tumor transcriptional heterogeneity by scoring each cancer cell with respect to potency/stemness [40], epithelial-mesenchymal plasticity (EMP) [41], malignancy [39], ‘activity’ of hallmark pathways [43], expression of epithelial cell type markers from HBCA [10], as well as clinical metadata related to age, tumor grade, and subtype (Fig. 2A). Further, we sought to find the driving factors of variation by leveraging a recursive partitioning tree-based method, K2 Taxonomer [36], to identify meta-clusters and hierarchical relationships between them (Fig. 2B), and also trained random forests to extract the most discriminative features distinguishing each cluster and meta-cluster (Fig. 2C).
To identify functional groups of cancer cells shared across patients in the atlas, we performed unsupervised clustering (Louvain clustering with resolution 0.1) of cancer epithelial cells and initially identified 23 clusters that largely stratified by cell typist annotation based on the clinical subtype, and PAM50 inferred subtype (Supplementary Fig. S2A–C). The PAM50 ‘intrinsic’ molecular subtype of a tumor sample is related to, but not equivalent to, clinical annotations of tumor subtype based on immunohistochemistry or in-situ hybridization (Supplementary Fig. S2D). It provides valuable tumor subtype information, especially when clinical subtype data is unavailable.
Though most clusters captured epithelial variation across tumor subtypes and individual donors (Supplementary Fig. S2E and F), across the range of resolution parameters evaluated (0.1–1.0), even the coarsest resolution of 0.1 yielded multiple clusters (7, 9, 10, 11, 12, 19, 21) mostly consisting of cells from individual patients, as also revealed by their lowest entropy of donor proportions (Supplementary Fig. S2G and H). This suggests that the variation captured within these clusters was mostly driven by tumor-specific clonal expansion, and given our goal of defining shared programs of cancer epithelial heterogeneity, they were excluded from further downstream characterization, leaving sixteen epithelial clusters in the atlas. The differentially expressed genes for each of these sixteen epithelial clusters were identified using K2 Taxonomer [36] and MAST [37] (Supplementary Table S2) and characterized using our taxonomy-guided random forest approach described below.

Multi-resolution characterization of cancer epithelial cells
Among the remaining 16 epithelial clusters, we identified the driving factors of variation within our cancer epithelial compartment by training random forest-based classifiers at three separate levels of taxonomic resolution (Fig. 2B and C). In previous integrated single-cell studies of breast cancer, cancer epithelial heterogeneity was characterized with respect to PAM50 and clinical subtypes[11, 12], as well as functional annotations by, e.g. Hallmark signatures [21], but has not yet been done in a hierarchical manner to identify the driving factors of variation.
This multi-level characterization revealed segregation of cancer epithelial cells at the highest level (Level 1) by stemness, EMP, and basal phenotypes, with clusters 13 through 4 (13∼4) on the right side of the dendrogram exhibiting more stemness, more EMP (more mesenchymal), and being more basal-like compared to clusters on the left side (clusters 0∼16). At the next level of the dendrogram (Level 2), cancer cells on both sides segregated by age, with the left branch of the tree (clusters 0∼16) segregating by older donors (clusters 0∼20, 8, and 15∼16) and younger donors (clusters 1∼5), and the right branch of the tree (clusters 13∼4) segregating by older (cluster 4) and younger (clusters 13∼2) patients. At the next level of characterization (Level 3), finer groups of cancer epithelial heterogeneity emerge with clusters exhibiting decreased mitotic spindle activity (cluster 0), decreased TNFA signaling (cluster 20), increased EMT activity (cluster 13), as well as subtype-specific clustering (Luminal A: 1∼17, 18; Luminal B: 6, 5, 15, 16; HER2: 3, 14), and grade-specific clustering (high-grade tumors: 8; low-grade tumors: 6, 5, 18).

Immune and stromal cell diversity in the TME
Across the immune and stromal compartments of the atlas, 31 immune and 14 stromal subpopulations were characterized (Figs. 3A and 4A). These subpopulations were identified by integrating data from automatic annotation methods: SingleR [33] and CellTypist [34] (Supplementary Figs. S3, S4), hierarchical tree-based methods (K2 Taxonomer) [36] (Figs. 3B and 4B), as well as marker genes obtained from differential expression analysis (MAST) [37] (Supplementary Tables S3, S4).

Reconciliation of immune cell subtypes to pan-cancer subtypes
Macrophages and monocytes formed eight subpopulations, which were found to have analogs in previous pan-cancer and breast cancer-specific characterizations of myeloid cells [10, 45]. Using nomenclature from Guimaraes et al. [45], we distinguish between tissue-resident macrophages with Mac RTM and circulating monocyte-derived macrophages as Mac. In our atlas, we identified two macrophage populations associated with lipid metabolism, Mac RTM LA and Mac LA, defined by expression of EGR1, FOSB, and SPP1, BNIP3 respectively; two macrophage populations associated with interferon signaling, Mac RTM IFN and Mac IFN, defined by expression of CXCL9, CXCL10 and AREG, FCN1; a resident tissue macrophage, RTM MT, also expressing monocyte markers defined by GPNMB, LGMN, and LIPA; an early-stage macrophage, Mac ES, defined by TIMP3, CHI3L1, ACP5; and an antigen presenting macrophage, Mac Anti., defined by STMN1, HIST1H4C, HMGB2. The remaining macrophages were not significantly associated with known functional annotations and were thus labeled as general macrophage/monocyte cells Mac/Mono (Fig. 3A–C). While previous studies have characterized heterogeneity of macrophages in terms of FOLR2 + and TREM2 + subtypes [60, 61], our analysis leveraged pan-cancer macrophage signatures to identify eight subpopulations that greatly expand beyond these two known subtypes.
T cells formed nine clusters (Fig. 3A), which were also found to have analogs in previous pan-cancer studies [13, 46, 62]. These included CD4 and CD8 effector memory T cells (CD4/CD8 Tem), which are characterized by expression of LMNA, FOS, and GZMK, EOMES respectively; CD4 Tregs defined by expression of FOXP3, IL2RA, and TNFRSF4; an interferon stimulated group of CD8 T cells (CD8 ISG) defined by strong expression of interferon signaling genes IFIT1, IFIT2, IFIT3; naïve CD4 cells characterized by expression of CCR7, PASK, LEF1; exhausted CD4 T cells (CD4 Ex) defined by expression of canonical exhaustion markers such as LAG3, HAVCR2, PDCD1 and TIGIT; CD4 T follicular helper cells (CD4 Tfh) expressing transcription factors TOX, TOX2 and chemokine CXCL13; CD4 T helper cells (CD4 Th) which in comparison to CD4 Tfh cells expressed high levels of interleukin genes IL7R, as well as the chemokine receptor CCR6. Finally, a proliferative T cell cluster (T. Prolif) was identified exhibiting strong expression of MIK67 and STMN1 (Fig. 3A, B, and D).
NK cells formed six clusters (Fig. 3A and D), which were collectively defined by expression of Killer Cell Lectin receptors (KLRD1, KLRF1) and growth factor FGFBP2. Comparison of these clusters to the NK clusters identified in a previous single-cell breast cancer atlas [21] identified two overlapping populations: NK-2 or (NK Cyto), a subpopulation expressing high levels of cytotoxicity-related genes GZMA, GZMB, PRF1; and NK-0, or reprogrammed NK cell (rNK), which expressed FOS, JUN, NR4A1, and DUSP1. The reproducibility of these two sub-populations in our larger mega-analysis of single-cell breast samples suggests that the cytotoxic and reprogrammed NK cell states are indeed robust sub-populations of NK cells in the breast TME. The remaining four NK subtypes identified in the previous study did not overlap significantly with the markers identified in our atlas (Supplementary Table S3). We therefore employed a conservative approach to label these remaining clusters as general NK cells.
Mast cells formed one cluster (Fig. 3A and C), defined by expression of the enzymes TPSB2, CPA3, and TPSAB1.
B Cells formed five subpopulations collectively defined by canonical B cell markers such as CD19, MS4A1, CD74, as well as expression of human leukocyte antigen genes such as HLA-DRA and HLA-DPB1. We used the signatures from a previous study characterizing pan-cancer B Cell heterogeneity across 270 patients to identify an activated B Cell population (B ACB) defined by expression of NR4A2, EGR, LY9; a naïve B cell population (B Naïve), defined by expression of TCL1A, FCER2, IL4R; a subpopulation expressing interferon-induced genes (B IFIT) defined by expression of IFIT3, ISG15, IFITM1; and germinal center B cells, defined by expression of LMO2, SUGCT, and ITGB1. B Cells that were not significantly enriched for functional signatures identified in Ma et al. were labeled as general B cells (Fig. 3A and E).

Identification of dendritic subtypes
Dendritic cells formed four clusters (Fig. 3A and C), which include two conventional dendritic cell populations, one characterized by expression of CLEC9A (cDC1), and another characterized by expression of CD1C and CLEC10A (cDC2); a mature dendritic cell population (mDC) defined by CCL22, CCR7, and CCL19 expression; and a plasmacytoid dendritic cell population (pDC) defined by expression of GZMB, JCHAIN, and PTGDS.

Reconciliation of stromal cell subtypes to pan-cancer subtypes
Fibroblasts formed six clusters (Fig. 4A–C), which were also reconciled to previously characterized pan-cancer stromal subtypes [47]. This includes five clusters of cancer-associated fibroblasts (CAFs) consisting of a population of COL11A1 + CAFs (COL11A1, COL8A2) implicated in collagen metabolism [47], a population of LAMP5 CAFs defined by expression of LAMP5 and cystatins CST1 and CST2 implicated in promoting EMT [47], a population of PI16 + CAFs defined by expression of PI16 and PCOLCE2, a population of DPT + CAFs defined by expression of DPT and CAPN6, as well as population of CA12 + CAFs defined by expression of CA12 and SLC2A1 which has been implicated in glycolysis metabolism and hypoxia [47]. Apart from these CAF subtypes, myofibroblasts also formed one cluster by expression of RGS5 and CDH6.
Endothelial cells (ECs) surround blood vessels in the TME, which are responsible for angiogenesis and tumor growth, and form a total of five clusters in the atlas (Fig. 4A). Four clusters emerged based on the function of these endothelial cells in supporting vasculature: endothelial vein, arterial, capillary, and lymphatic cells segregated based on expression of different marker genes (Fig. 4A–C). A further cluster of immature endothelial cells (Endo Imm), defined by expression of PLVAP, VWA1, and CA4, was identified and has been implicated in poor tumor prognosis [47]. Recent pan-cancer studies of stromal cell heterogeneity have implicated SELE + and CXCR4 + tip cells as the prominent angiogenic and pro-inflammatory endothelial subpopulations in the TME [63]. While our endothelial vein cells (Endo Ven.) differentially express SELE, we did not identify CXCR4 + endothelial cells in our atlas, perhaps due to the differences between pan-cancer subtypes and breast cancer-specific subtypes.
Other stromal subtypes captured in the atlas include vascular smooth muscle cells (VSMC), defined by expression of RERGL, and MYH11; extracellular matrix related pericytes (ECM PCs), defined by expression of collagen genes COL4A1, and COL4A2; as well as a set of proliferating stromal cells (Prolif. Stromal) characterized by high expression of proliferative markers, including TOP2A, MKI67.

Taxonomic characterization of immune and stromal subtypes
We utilized K2Taxonomer to organize the clusters in the immune and stromal compartments into hierarchical taxonomies. Annotating groups of clusters, or clades, mitigates the uncertainty in annotation due to over-clustering. Taxonomic characterization also captures groups of cells sharing similar functions and cell states rather than just cell types.
In the immune compartment, cells are largely segregated by lymphoid and myeloid lineages. The exceptions were a population of dendritic cells, which clustered with B and plasma cells in a clade due to their shared expression of antigen-presenting genes HLA-DQB1, HLA-DQA1, as well as activation markers CD83 [64], and Mast cells, which clustered within the lymphoid clade due to the lack of expression of macrophage and dendritic markers (CD68, CD1C). These exceptions suggest that immune cells in the TME may share similar transcriptional states despite having different lineages (Fig. 3B).
In the stromal compartment, cells are partitioned by endothelial and non-endothelial cells at the first level, and within the non-endothelial clade, further separated between CAFs and mural cells, and lastly between smooth muscle cells, pericytes, and myofibroblasts, indicating strong transcriptome-based partitioning of these stromal subtypes (Fig. 4B).

Cell type diversity of tumors changes across phenotypes
Using these improved annotations within the atlas, we used an entropy-based cell type diversity score (CTDS) [48] to assess how the overall diversity of the TME changes across patient phenotypes (Fig. 5A). The CTDS score yields the highest value when all cell types (or states) have equal representation in a sample, and the lowest value when a single-cell type (or state) is represented in a sample.
Across clinical subtypes, the CTDS of observed epithelial cell states and stromal cell types observed was lowest in TNBC compared to ER + and HER2 +, whereas the average CTDS of immune cell types was highest in TNBC. Using the median age as a cutoff, we compared CTDS across old and young donors and observed a consistent decrease in CTDS across all three compartments in older tumors compared to younger tumors. This expands upon previous studies that identified decreased immune cell-type diversity in old subjects compared to young subjects, but only within PBMCs [48]. As tumor grade increased, we also observed a consistent decrease in cell-type diversity within the epithelial and stromal compartments. The cell-type diversity scores provide a global summary of changes in tumor heterogeneity that can complement the more granular differential abundance analysis of individual cell types within the tumor.

Improved estimates of differential cell type abundance across phenotypes
Using these improved annotations within the atlas, we used sccomp [49] to estimate differential cell type abundance across phenotypes such as tumor grade and subtype (Methods).
HER2 + tumors are typically considered to be ‘hot’ tumors with increased immune infiltration compared to other subtypes of tumors [65], but the details of their cell type composition vis-à-vis other tumor subtypes is unclear. In our analysis, we observed that one specific subset of reprogrammed NK cells, rNK, increased in abundance in HER2 + tumors relative to other tumor types, whereas pDCs and Mac/Mono decreased in relative abundance (Fig. 5B). Previous studies have found that pDCs were depleted in HER2 + tumors compared to HER2- tumors [66], and that NK cells were associated with HER2 + tumors [67]. However, this is to our knowledge the first report of a specific subpopulation of NK cells found to be associated with HER2 + tumors. With regards to non-immune subpopulations, HER2 + tumors were enriched in immature endothelial cells, and depleted in CA12 + and COL11A1 + CAFs (Supplementary Fig. S5).
The immune microenvironment of ER + tumors has been characterized as macrophage-driven [68]. In our analysis, we recapitulated this finding and observed that Mac/Mono were enriched in ER + tumors compared to other tumors (Fig. 5B). With regards to non-immune subpopulations, ER + tumors were enriched in COL11A1 + CAFs and myofibroblasts and were depleted in proliferative stromal cells (Supplementary Fig. S5).
TNBC tumors are known to exhibit a heterogeneous immune microenvironment, characterized by both increased tumor-infiltrating lymphocytes (TILS) [69], and increased tumor-associated macrophages [70]. However, specific immune cell types enriched within TNBC have yet to be fully characterized in studies with sufficient statistical power. In our analysis, we identified significant enrichment of Mac RTM IFN, Mac IFN, Mac LA, cDC2, and CD8 Tex cells in TNBC tumors. Conversely, activated B cells and rNK cells were notably depleted. Previous studies have reported expansion of pDCs [71] and depletion of mast cells in TNBC tumors [72]. While these trends were also observed in our atlas, their changes were not statistically significant (fdr > 0.05) (Fig. 5B). Regarding non-immune cell subpopulations, TNBC tumors were enriched in CA12 + CAFs and in Prolif. Stromal but were depleted in Endo Lymph. (Supplementary Fig. S5).
Comparison of the atlas-based estimates of differential cell type abundances across tumor grades with analogous estimates derived from the individual datasets further highlighted the utility of the integrated atlas in uncovering unique associations (Fig. 5C). Using our atlas, we not only recapitulated known immune TME associations with tumor grade, such as the enrichment of CD8 Tex in higher grade tumors [73], but also identified novel associations, including enrichment of CD4 Tfh and Mac IFN, alongside depletion of Mac ES and Mast cells. Beyond confirming previously reported associations, we sought to validate our novel findings using publicly available bulk breast cancer datasets via regression analysis. Of the three bulk breast cancer datasets analyzed in this study (TCGA [3], METABRIC [51], SCANB [52], only METABRIC had tumor grade data. Within METABRIC, we confirmed a positive association between higher tumor grade (grade 3 versus grade 1) and the abundance of CD8 Tex, CD4 Tfh, and Mac IFN, with both CD8 Tex and Mac IFN passing the significance threshold (q < 0.05) (Supplementary Table S1E). For the negative associations identified in our atlas’ grade analysis, we validated the depletion of Mast cells with increasing tumor grade. Although early-stage macrophages (Mac ES) were negatively associated with tumor grade in the atlas, they exhibited a positive but non-significant association in METABRIC (P-value > 0.05). Taken together, four out of five of the immune TME associations discovered in our atlas were independently supported by METABRIC data (Supplementary Table S1E), underscoring the robustness and biological relevance of our integrated approach.
Finally, we found that significant estimates of differential cell type abundance in the atlas were not significant or in discordant directions when tested in individual datasets. For instance, CD4 Treg cells were observed to increase with grade in two studies (Wang 2024 and Wu Natgen 2021), but after including all the datasets in the atlas, they did not reach atlas-wide significance. Similarly, Mast cells yielded no significant changes in one of the datasets tested (Wang 2024), although when all datasets were combined in the atlas, it was the cell type that most significantly decreased with grade. Only three datasets were included in this comparison since datasets need to have both high-grade (grade 3) and low-grade tumors (grade 1 & 2) to be included in this differential abundance analysis, which further highlights the need for integration so that diverse patient and clinical metadata can be pooled in mega-analysis-based approaches.

Multi-resolution cell type associations with patient survival
To assess the clinical relevance of the subpopulations identified in the atlas, we performed survival analyses by projecting bulk RNA expression data from TCGA, METABRIC, and SCANB onto the gene signatures that defined each subgroup within our taxonomic analyses (Fig. 6). To correct for the confounding effects of patient age, proliferation, and inflammation with respect to patient survival [36, 53], the Cox proportional hazard model also included these factors as covariates (Methods).
Two cell populations were significantly associated with survival in at least two of the datasets, and both had a positive survival association (hazard ratio less than 1). These include a subpopulation of epithelial cells (Cluster 8), which is characterized by low stemness (more differentiated) and low grade (Fig. 2A and C), and a cluster of Mast immune cells (Cluster 1). Both Mast cell infiltration as well as low-grade phenotype within tumors have been previously associated with a favorable prognosis in BC [74, 75]. Though no single-cell type/state was associated with decreased survival in at least two datasets separately, when combining the results by Fisher’s method, increased CAFs (COL11A1+) activity was collectively associated with decreased survival. Finally, we repeated this analysis within each tumor molecular subtype and identified subtype-specific subpopulations differentially associated with survival (Supplementary Table S1D).

Discussion

Discussion
In this study, we integrated scRNAseq data from 138 primary BC tumors collected across eight studies to create the largest transcriptomic atlas of human BC to date. The large number of patients in our integrated atlas, along with the rigorous annotation of both discrete cell types and continuous cell states, will provide a valuable resource for researchers seeking to explore associations of gene expression and or cell type composition with clinical and molecular phenotypes with increased statistical power.
Our characterization of the cancer epithelial compartment of the atlas revealed substantial heterogeneity between cancer cells along the axes of potency/stemness (differentiation continuum), EMP (epithelial-mesenchymal continuum), expression of hallmark and HBCA pathways, as well as clinical phenotypes. By leveraging K2Taxonomer and random forest models, we performed a multi-resolution characterization of the cancer epithelial compartment and identified the most discriminative factors differentiating epithelial clusters at different levels of the taxonomy of BC epithelial cells. At the first level, we found that BC tumor cells primarily segregated due to EMP, stemness, and basal phenotypes. At the second level, tumor cells are primarily segregated by age, suggesting that aging processes can be a major driver of tumor heterogeneity [76]. At the third level, tumor cells separate through more granular hallmark pathways, tumor subtype, and grade. While previous studies have employed NMF to identify novel gene programs defining tumor heterogeneity [77], in our study, we sought to characterize cancer epithelial heterogeneity using a combination of known epithelial phenotypes i.e. stemness and EMP, clinical metadata i.e. patient age, tumor subtype, and grade, and gene programs i.e. hallmarks and HBCA. As such, more flexible methods are needed, such as our K2 taxonomy-based random-forest analysis. Further, our multi-resolution characterization of cancer cells has several advantages over typical unsupervised clustering approaches, including mitigating the risks of over-clustering and thus over-annotation, as well as being able to identify the major factors driving heterogeneity in a dataset.
In the immune and stromal compartments, our analysis substantially expanded the repertoire of cell type annotations previously reported in individual datasets by integrating insights from pan-cancer studies of myeloid cells, B-lymphocytes, and T cells. Notably, we identified nine distinct T Cell subpopulations, thereby expanding prior characterizations of T cell heterogeneity within the breast TME [78]. Additionally, while we reproduced NK cell subpopulations described in an existing breast cancer atlas, our substantially larger sample size enabled us to distinguish which NK subsets are robustly reproducible across cohorts and which may represent dataset-specific observations. Regarding stromal heterogeneity, recent work by Liu et al. utilized spatial transcriptomics to incorporate information from neighboring cells, resolving four spatial CAF subtypes common across cancers [79]. In contrast, since our atlas is based solely on transcriptomics data, our annotation strategy relied on pan-cancer stromal signatures, through which we identified six novel CAF subpopulations. This highlights complementary strengths of transcriptomics and spatial approaches for elucidating stromal complexity.
Using the derived annotations of immune cell types and states, we performed cell type diversity and differential abundance analyses to capture global and more granular changes in TME cell type composition with respect to clinical phenotypes such as tumor subtype (ER+, HER2+, TNBC). The cell type diversity analysis revealed an association between increasing tumor grade and decreasing overall TME diversity, as well as age-associated shifts across epithelial, stromal, and immune cells. Differential abundance analysis further resolved immune cell composition differences unique to each subtype, offering insights relevant for therapeutic strategies. The enrichment of and interferon signaling macrophages in TNBC suggests promising therapeutic targets, given their respective roles in antigen presentation driving anti-tumor T activation [80] and in shaping a proinflammatory yet immunosuppressive TME [45]. The observed expansion of Mac LA in TNBC is of interest and may help explain recent findings that TNBC is uniquely responsive to lipid-targeting drugs [81]. In HER2 + tumors, the expansion of rNK cells may underlie the efficacy of anti-HER2 therapies known to recruit NK cells [82], underscoring the potential to develop additional NK-targeting treatments, especially since NK cell activity serves as a prognostic biomarker in this subtype. Collectively, these subtype-specific differences in the immune TME provide mechanistic insight into the efficacy of existing cancer therapies and may guide prioritization of future therapeutic targets. Importantly, repeating the differential abundance analysis using only individual datasets from the atlas revealed considerable variability across cohorts. This underscores the necessity of data integration to achieve robust and statistically significant associations between cell type composition and clinical phenotypes.
Our atlas-wide survival analysis of cell type subpopulations, utilizing bulk data from TCGA, METABRIC, and SCANB (Supplementary Table S1D) recapitulated well-established associations, such as decreased mortality risk linked to increased CD4 memory T cell activity and higher proportions of low-grade cancer epithelial cells. Importantly, we also identified novel associations, including increased mortality risk associated with specific endothelial and CAF subpopulations, which may serve as valuable prognostic markers in breast cancer.
A key limitation of this atlas is the absence of adipocytes, despite their recognized importance in cancer progression and aggressiveness [83]. This omission stems from technical challenges inherent to scRNAseq, which inadequately capture adipocyte populations. Complementary technologies such as single-nucleus RNA sequencing and spatial transcriptomics will be essential to characterize adipocyte heterogeneity across patients. Additionally, the atlas represents a static snapshot of the transcriptome and does not capture dynamic changes in response to treatments like radiation or chemotherapy, which could significantly alter the cellular composition and gene expression profiles. Future studies incorporating temporal analyses of cellular responses to various therapies could enhance the atlas's utility in guiding personalized treatment strategies and understanding mechanisms of treatment resistance.
In summary, this integrated scRNAseq BC atlas, the largest of its kind to date, addresses a critical gap in our understanding of tumor heterogeneity by providing a comprehensive ‘reference’ landscape of cell populations which not only enables statistically powered analyses of cell type diversity, composition, and survival, but also provides a framework for the construction of further atlases to interrogate tumor heterogeneity.

Supplementary Material

Supplementary Material
lqaf217_Supplemental_Files

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

🟢 PMC 전문 열기