본문으로 건너뛰기
← 뒤로

Canopy2: Tumor Phylogeny Inference by Bulk DNA and Single-Cell RNA Sequencing.

1/5 보강
Statistics in biosciences 2026 Vol.18(1) p. 68-110
Retraction 확인
출처

PICO 자동 추출 (휴리스틱, conf 2/4)

유사 논문
P · Population 대상 환자/모집단
Canopy2 is an open-source R package available at https://github.com/annweideman/canopy2.
I · Intervention 중재 / 시술
추출되지 않음
C · Comparison 대조 / 비교
추출되지 않음
O · Outcome 결과 / 결론
We further assess the performance of Canopy2 through application to breast cancer and glioblastoma data, benchmarking against existing methods. Canopy2 is an open-source R package available at https://github.com/annweideman/canopy2.

Weideman AMK, Wang R, Ibrahim JG, Jiang Y

📝 환자 설명용 한 줄

Tumors are comprised of a mixture of distinct cell populations that differ in terms of genetic makeup and function.

이 논문을 인용하기

↓ .bib ↓ .ris
APA Weideman AMK, Wang R, et al. (2026). Canopy2: Tumor Phylogeny Inference by Bulk DNA and Single-Cell RNA Sequencing.. Statistics in biosciences, 18(1), 68-110. https://doi.org/10.1007/s12561-024-09466-1
MLA Weideman AMK, et al.. "Canopy2: Tumor Phylogeny Inference by Bulk DNA and Single-Cell RNA Sequencing.." Statistics in biosciences, vol. 18, no. 1, 2026, pp. 68-110.
PMID 41694548 ↗

Abstract

Tumors are comprised of a mixture of distinct cell populations that differ in terms of genetic makeup and function. Such heterogeneity plays a role in the development of drug resistance and the ineffectiveness of targeted cancer therapies. Insight into this complexity can be obtained through the construction of a phylogenetic tree, which illustrates the evolutionary lineage of tumor cells as they acquire mutations over time. We propose Canopy2, a Bayesian framework that uses single nucleotide variants derived from bulk DNA and single-cell RNA sequencing to infer tumor phylogeny and conduct mutational profiling of tumor subpopulations. Canopy2 uses Markov chain Monte Carlo methods to sample from a joint probability distribution involving a mixture of binomial and beta-binomial distributions, specifically chosen to account for the sparsity and stochasticity of the single-cell data. Canopy2 demystifies the sources of zeros in the single-cell data and separates zeros categorized as non-cancerous (cells without mutations), stochastic (mutations not expressed due to bursting), and technical (expressed mutations not picked up by sequencing). Simulations demonstrate that Canopy2 consistently outperforms competing methods and reconstructs the clonal tree with high fidelity, even in situations involving low sequencing depth, poor single-cell yield, and highly-advanced and polyclonal tumors. We further assess the performance of Canopy2 through application to breast cancer and glioblastoma data, benchmarking against existing methods. Canopy2 is an open-source R package available at https://github.com/annweideman/canopy2.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

📖 전문 본문 읽기 PMC JATS · ~213 KB · 영문

Introduction

Introduction
Cancer progression is marked by a cascade of genetic and epigenetic mutations subjected to natural selection. A patient’s tumor often contains various cell populations that differ in their genetic and phenotypic characteristics [1]. Variability within tumors is a key factor in the failure of targeted treatments and the development of therapeutic resistance, underscoring the importance of investigating tumor heterogeneity [2]. To address this, there has been an increase in efforts to sequence tumors at multiple time points and/or from spatially distinct resections to better understand clonal evolution [3]. Despite much progress, bulk DNA sequencing does not allow the characterization of small clones and often returns multiple cancer evolutionary histories that are equally supported by the data input. Single-cell sequencing circumvents the averaging of artifacts associated with bulk population data and offers unprecedented opportunities to study intra-tumor cellular heterogeneity and identify rare cell subpopulations [4].
Although single-cell DNA sequencing (scDNA-seq) technologies are now available for profiling tumor tissues at high resolution [5–7], they are less accessible and have lower throughput than single-cell RNA sequencing (scRNA-seq). Multiple cancer studies have adopted scRNA-seq to interrogate intra-tumor heterogeneity, yet they either ignore any genotypic diversity or group cells into large clonal subpopulations based on chromosome-level copy number aberrations (CNAs) [8–10]. Tumors consist of genotypically mixed subpopulations; numerous subclonal mutations may accumulate within the cellular architecture inferred by scRNA-seq. Without a comprehensive phylogenetic reconstruction, it is impossible to definitively rule out genetic inference and disentangle the confounded DNA- and RNA-level variations.
While methods have been developed to profile CNAs by scRNA-seq in cancer [8, 11], they are typically applicable to cancers with genome-wide ploidy changes, suffer from low resolution, and also make a stringent assumption on the gene-dosage effect between the copy number and gene expression. On the other hand, multiple studies [12, 13] have demonstrated the applicability of profiling single nucleotide variants (SNVs) by full-transcript scRNA-seq technologies, such as Smart-seq2 [14] and Smart-seq3 [15]. In this paper, we also primarily focus on SNVs, while additionally outlining adaptations to incorporate CNAs in the framework.
Assessing intra-tumor heterogeneity and reconstructing cancer phylogeny by scRNA-seq is challenging due to the high proportion of zeros [16], and this issue is further exacerbated in the cancer genomics setting. The ground truth of whether a cell harvests a mutation is not directly observed but rather mixed with technical and biological noise by scRNA-seq. As a result, observed zero mutational read counts can arise from different sources. First, the subclone that the cell belongs to may not have the mutation, and this is directly related to cancer phylogeny and its mutational structure. Second, because we seek to capture somatic mutations by scRNA-seq, a zero mutational read count can simply be due to the gene harvesting the mutation not being expressed – a phenomenon known as transcriptional bursting. This is a crucial aspect, and there is both theoretical [17, 18] and experimental [19] work that shows RNA transcription to be a bursty process with periods of activation separated by periods of inactivation. Finally, current scRNA-seq protocols introduce substantial biases and artifacts during the library preparation and sequencing procedure, such as gene dropouts and amplification bias [20]. Therefore, a zero read count can also be from sequencing errors or low sequencing depths, and these have been generally categorized as the technical zeros.
To address the aforementioned challenges, bulk DNA sequencing (DNA-seq), including whole-exome sequencing (WES) and whole-genome sequencing (WGS), continues to prove its utility as an efficient and effective approach for characterizing cancer mutations at various time points and/or across different tumor dissections. Developing a unified model that integrates both bulk DNA-seq and scRNA-seq data, while accounting for the various sources of error, offers a distinct advantage. As an illustrative example, Fig. 1A outlines a cancer phylogenetic tree comprised of bifurcations – at which a new mutation or set of mutations is introduced – and branch tips. These branch tips correspond to subclones with their respective cellular compositions, which sum to 100%. The leftmost, non-bifurcating branch denotes the population of normal cells, whose proportion reflects tumor purity (Fig. 1A). For data input, the number of mutational reads and variant allele frequencies (VAFs) are shown in Fig. 1B and Fig. 1C for scRNA-seq and bulk DNA-seq, respectively. When considering only bulk data, multiple phylogenetic structures can arise that are consistent with the VAFs (Fig. 1D); when considering single-cell data alone, there is typically preservation of the branching structure but not the temporal order of the variants due to noise in the mutational profiles (Fig. 1E). Collectively, joint analysis of the bulk and single-cell data should improve phylogenetic inference. Where one data type fails to recover the correct configuration, whether it be the branching structure or temporal order, the other data type is expected to fill the void.
Here, we propose Canopy2, which extends our previous work, Canopy [21], to conduct joint inference utilizing SNVs obtained from bulk DNA-seq and scRNA-seq data. By default, Canopy2 takes somatic point mutations inferred computationally from the transcribed regions as input. It then assigns cells to their respective clonal groups and reconstructs the tumor’s evolutionary history using a Bayesian approach. This method provides a confidence assessment based on the posterior distribution within the phylogenetic tree space. Specifically, Canopy2 takes advantage of both data types of the same samples collected from temporally/spatially separated dissections of the same patient to simultaneously (i) identify the cell of origin (cancer initiating cell), (ii) infer the temporal order of the point mutations along the phylogenetic tree, (iii) determine the mutational profiles of the subclones, (iv) identify the single-cells comprising these subclones, and (v) estimate the proportions of subclones within each bulk sample. To the best of our best knowledge, other existing methods (summarized in Table 1) that closely resemble Canopy2 with the same input data types are rather limited. Specifically, ddclone [22] and DENDRO [12] both focus on inferring the subclones of cells, while not returning or restricting their inference to a specific mutational hierarchy. DENDRO [12] also employs bulk DNA-seq data solely for the identification of SNVs, without utilizing it to infer subclonal heterogeneity. Cardelino [13] takes scRNA-seq as input and indirectly utilizes bulk DNA-seq data through a guide clonal tree produced from a first-step run of Canopy [21]; if no guide is provided, then Cardelino defaults to a single-cell only methodology.
Another major difference between Canopy2 and the other published methods is that Canopy2 specifically models and accounts for the stochasticity of RNA transcription. To estimate gene-specific bursting kinetics, Canopy2 utilizes germline heterozygous loci closest to the somatic SNVs to infer the gene-specific hyperparameters for the bursting kinetics. By doing so, we separate the DNA-level somatic point mutations from the RNA-level gene expression stochasticities, thus demystifying the different sources of observed zero mutational read counts. DENDRO [12] is the first method to specifically account for transcriptional bursting, yet it uses the SNVs to infer both the cancer phylogeny and the burstiness of its expression, which are confounded (Fig. 2), thus leading to a chicken-and-egg problem. Cardelino [13] also adopts a beta distribution that models a gene-specific success rate. However, the beta distribution shares the same hyperparameters across all genes. Altogether, Canopy2’s hierarchical model for scRNA-seq data adjusts for technical noise, systematically investigates the burstiness of gene transcription, and integrates the latest empirical results on the reliance of bursting parameters on cell size and ploidy [18].
We assessed the performance of Canopy2 through simulation studies and demonstrated that it consistently outperformed competing approaches in inference of clonal tree configuration. We systematically investigated the influence of different factors including sequencing depth, number of cells and bulk samples, number of mutations, and clonal structures (i.e., bifurcations and number of subclones). In real data analyses of two patients with glioblastoma and two patients with breast cancer, Canopy2 outperformed existing methods with higher cell-to-clone assignment accuracy, outputting a three-tier configuration that can be used to infer temporal order of the mutations, mutational profiles and clonal assignments of the single cells, and composition of the bulk samples.
Intra-tumor heterogeneity contributes to drug resistance and failures of targeted therapies. To gain a comprehensive understanding of the evolutionary dynamics of tumors, it’s crucial to identify not just the alterations driving tumor progression, but also to comprehend their relative timing and spatial arrangement during tumor evolution. Canopy2’s analyses help uncover potentially useful prognostic and diagnostic biomarkers while effectively tracing the tumor’s evolutionary history. While sequencing DNA and RNA from the same cell is far from mature technologically, through Canopy2’s framework, we will be able to profile DNA-level clonal diversity and RNA-level transcriptional heterogeneity, while accounting for the pervasive technical variability at the single-cell level. Canopy2 can be readily adopted and implemented by any lab that has access to bulk DNA-seq and scRNA-seq resources.

Methods and Materials

Methods and Materials

Model Formulation

Observed Data
As outlined in Table 2, let K, N, M, and T denote the numbers of subclones, cells, mutations, and bulk samples, respectively. Let and (superscript s for single-cell) denote the matrices of alternative (i.e., mutational) and total read counts from the scRNA-seq data, where the total read counts are defined as a sum of the reference and alternative read counts. Likewise, let and (superscript b for bulk) denote the matrices of alternative and total read counts for the bulk DNA WES data. Lowercase letters with subscripts for row and column are used to denote an element within a matrix. For example, denotes the element of that occurs in row 1, column 2.
In addition to the read count matrices, Canopy2 also takes as input expression levels of the genes that the mutations reside in or the expression levels of the germline heterozygous loci that the mutations are closest to from the same genes. These gene expression levels are used to infer the bursting kinetics of the mutations to account for the gene expression stochasticity.

Bayesian Hierarchical Model
For mutation m and bulk sample t, let each of the alternative read counts for the bulk samples, , be distributed aswith probability mass functionwhere denotes the total bulk read count from which was derived, and indicates the fraction of cells in bulk sample t that contain somatic variant m (for transposed ). Here, for simplicity, we show the binomial distribution for mutations that are heterozygous within copy-number-neutral regions – we discuss later how this can be extended to incorporate CNAs.
For each single cell n, let the alternative read counts, , be distributed aswhere denotes the total single-cell read count. Marginalization over and leads to the piecewise beta-binomial and its corresponding probability mass functionwhere denotes the beta function.
Note that, in Eqns (2), (3), and (6), is a binary indicator that specifies whether cell n contains somatic variant m (for transposed ), and it is a product of the mutational hierarchy of SNV m specified by and the clonal assignment of single cell n specified by . It is important to correctly recover to better understand the mutational structure of the phylogeny and the carrier status and clonal assignment of the cells. The observed alternative read counts in , however, do not reflect the true underlying . This relationship is illustrated in Fig. 2 and described below. If cell n carries mutation m (i.e., ), then is generated from the beta-binomial distribution in Eqn (4) with mutation-specific hyperparameters for the bursting kinetics. If , will be non-zero because the gene is constantly expressed and it successfully recapitulates the cell’s mutational profile.

If , will be zero regardless of the cell’s mutational carrier status because the gene that harvests the mutation is silenced (confounds with 2a below).

If , can be zero or non-zero, in which case the observed zero can be due to either the gene not being expressed or the cell not carrying the mutation (also confounds with 2a below).

If cell n does not carry the mutation m (i.e., ), then is generated from the beta-binomial distribution in Eqn (5) with hyperparameter . When there is minimal sequencing error, will be zero (confounds with 1b-1c above).

When there is abundant sequencing error, will likely be small, but non-zero.

In order to separate 1b-1c from 2a, and thus correctly identify the cells’ underlying mutational profiles, it is critical to obtain accurate estimates of and . Here, we resort to the genes that harvest the mutations or, when available, the germline heterozygous loci that are in the same linkage disequilibrium (LD) blocks of the SNVs. Notably, Canopy2 does not use the observed mutational read counts to estimate the bursting kinetics, since it is confounded with the SNV carrier status. Instead, Canopy2 utilizes independent expression measurements of genes or germline variants to estimate the gene- and mutation-specific bursting kinetics. These are decoupled from the estimation of , allowing to be estimated without the aforementioned confounding issue.
Since there is no closed form for the joint posterior, a Markov chain Monte Carlo (MCMC) approach is used to repeatedly sample each parameter from its full conditional distribution. The parameters , , and are sampled by column to coincide with their respective proposal distributions (see Proposals for details). Specifically, all three full conditionals can be written as proportional to the log joint conditional density of and , which, under the assumption of conditional independence, can be written as a linear combination of the log-transformed densities in Eqn (2) and Eqn (6).where the first summation is the contribution from the bulk data, and the second summation is the contribution from the single-cell data.

Hyperparameters for Beta Distributions
The mutation-specific hyperparameters and , which represent the gene activation and deactivation rates, are estimated empirically using the gene in which the point mutation resides. Canopy utilizes BPSC [17] and SCALE [18] to estimate the bursting kinetics from a beta-Poisson mixture model. BPSC utilizes MCMC sampling to obtain parameter estimates, while SCALE uses moment estimators. SCALE is computationally efficient, especially for large scRNA-seq datasets, but its moment estimates may be negative (Fig. 7 in appendix). We therefore choose to include both options with added library size normalization in the R package.
The average error rate of next-generation sequencing is reported to be per nucleotide [45]. To reflect this, the hyperparameters and were chosen so that the sequencing error rate , assumed to follow a beta distribution, had a mean of .

Model Selection to Determine the Number of Subclones
The Bayesian information criterion (BIC) is computed by modifying the classical definition [46]. Since the prior is flat, the likelihood is proportional to the posterior, so the maximum a posteriori (MAP) estimate is equivalent to the maximum likelihood estimate (MLE). The MAP estimate is equal to the mode of the posterior distribution and defined as the value of the parameter that maximizes the posterior distribution, which is the value at which the distribution reaches its highest peak. Thus, we can replace the maximized log-likelihood with the maximized log-posterior. To find the posterior evaluated at the MAP estimate, one can apply kernel density estimation to the posterior estimates. Since there are multiple chains, the third quartile (75th percentile) is taken across all chains.where is the third quartile of the maximized posteriors across all J MCMC chains, denotes the MAP estimate, y denotes the observed data, p denotes the number of parameters, and n denotes the total sample size.
In this particular case, (the number of subclones) since we are attempting to determine the optimal number of subclones. The sample size reflects the aggregate dimensions of the observed data y where and for M mutations, N single cells, and T bulk samples.

Incorporating Copy Number Aberrations
By default, Canopy2 is tailored to take as input SNVs to reconstruct cancer evolutionary history and assign cell clonal memberships. Existing methods [8, 11] can only reliably detect chromosome and chromosome arm-level CNAs in cancer cells that undergo abrupt copy number changes. Consequently, the resolution and accuracy of CNA assignments to inferred SNVs are severely limited. Moreover, recent observations have highlighted that CNA estimates by scRNA-seq are influenced by the true underlying cell ploidy because they are inferred by comparing to average expression levels [7]. Nonetheless, Canopy2 does not ignore CNAs from scRNA-seq. Instead, when modeling SNVs based on single-cell gene expression data, it uses the bursting kinetics to implicitly capture the copy-number effects on expression levels – genes with higher copy numbers are expected to transcribe more RNAs. Importantly, this approach relies on observed gene expression and avoids making specific assumptions about gene-dosage effects. For bulk CNAs, Canopy2’s MCMC algorithm can be easily extended to include bulk allele-specific copy-number estimates [47–50] using the same framework that we previously developed in Canopy [21] and MARATHON [51], which jointly model SNVs and CNAs with overlapping SNV-CNA pairs phased and temporally ordered.

Bioinformatic Processing
The SNV calling pipeline used to pre-process the breast cancer data (Table 3 in appendix) and glioblastoma data (Table 4 in appendix) is illustrated in Fig. 6 in appendix. Master shell scripts that prompt for user input to run each step in the pipeline are available at our open-access Zenodo repository (see Data and code availability).

Sequence Alignment and Expression Quantification
FASTQ files were aligned to the reference genome using BWA [52] for bulk WES and STAR [53] for scRNA-seq. Picard was used for SAM to BAM conversion, and then to sort, add read groups, and deduplicate to produce the assembled BAM files. SAMtools [54] was used to filter alignment records in the scRNA-seq data based on BAM flags, mapping quality, or location. For subsequent estimation of bursting kinetics, featureCounts [55] was adopted to quantify gene expressions. Summary statistics output by featureCounts and heatmaps of the variant allele frequencies (VAFs) were generated for each sample (Figs. 8, 9, 10, 11 in appendix).

Somatic Variant Calling
To perform joint variant calling for the bulk WES data, Mutect2 [56] was run on the BAM files to generate per-sample VCF files, and FilterMutectCalls was utilized to apply filters to the raw output of Mutect2. To avoid false positives in identifying SNVs using scRNA-seq due to RNA editing, we restricted somatic SNVs to those identified in gene coding regions (e.g., by bulk WES), followed by stringent quality control (QC) procedures with functional annotations by ANNOVAR [57]. Specifically, we kept SNVs that (i) showed PASS from FilterMutectCalls, (ii) retained homozygous 0/0 genotypes from the normal samples, (iii) had only one alternative allele, (iv) had at least 20 total reads in the normal samples, (v) had at least five alternative reads in the bulk cancer samples, (vi) were not reported by the 1000 Genomes Project [45], (vii) did not reside in segmental duplications, and (viii) had non-NA scores from the LJB database. Positions of the SNVs identified in the bulk DNA-seq data were used to force call SNV coverage in scRNA-seq using SAMtools mpileup [54]. In the final QC, the reference and alternative read counts for both data types were extracted from the parsed output.

Results

Results

Overview of Methodology
The proposed Bayesian framework illustrated in Fig. 3 jointly analyzes bulk DNA-seq and scRNA-seq data to reconstruct tumor phylogeny using SNVs by default, which can be extended to incorporate CNAs. The inputs consist of estimates for the sequencing error and bursting kinetics (Step 0 of Fig. 3), as well as the alternative and total read counts corresponding to the SNVs in both bulk DNA-seq and scRNA-seq (Step 1 of Fig. 3). The Canopy2 algorithm adopts a Bayesian hierarchical model (Step 2 of Fig. 3) and employs a Gibbs sampler with nested Metropolis-Hastings algorithm to sample cancer phylogenetic trees with mutational hierarchies , clonal assignments of single cells , and clonal compositions of bulk samples (Step 3 of Fig. 3). A comprehensive description of the models can be found in Methods and materials, and a detailed algorithm for the Gibbs with nested Metropolis-Hastings can be found in Algorithms 1 and 2 in Functions for Metropolis-Hastings. Appropriate dimensions for data inputs and parameters are listed in Table 2.

Simulation Studies
Via simulations, we benchmarked Canopy2 against three other methods, including Canopy [21] (bulk data only), Cardelino with guide clonal tree [13], and Cardelino without guide clonal tree (single-cell data only) [13]. The guide clonal tree utilized by both the Cardelino paper [13] and in our simulations is the output of from Canopy [21]. Thus, all four methods could be compared with respect to error in . However, error in could only be assessed for the methods that take single-cell input (Canopy2 and both Cardelinos) and error in could only be assessed for methods that take bulk input (Canopy2 and Canopy). It is important to mention that, on several occasions, Cardelino with and without the guide clonal tree failed to converge despite significantly increasing the number of iterations and chains. In these cases, Cardelino output estimates of NA. We opted to discard these estimates, which, in turn, would bias the results more favorably towards Cardelino.
The reconstruction error for , , and was quantified by finding the minimum absolute distance between the estimation and the truth through modulo permutation of the clones. The algorithm for computing this error and the functions for the distance metrics are outlined in Algorithm 3 in Algorithmic Details. Boxplots of the reconstruction error were produced across 100 runs with different seeds to generate the ground truths and observed data input. We refer interested readers to the R package documentation in our GitHub repository at https://github.com/annweideman/canopy2, specifically the simulate_data() function, for further details regarding the density functions employed for simulating the read count data and gene expression data.

Number of Mutations
We first sought to investigate the effect of the number of mutations on performance. Notably, across all four methods, Canopy2 consistently demonstrated the best performance in reconstructing the mutation-to-clone assignment matrix , although the relative difference in the performance of all methods was stable as the number of mutations both doubled and tripled (Fig. 4, first column). This suggests that only a handful of mutations are needed for an accurate assessment of the tumor phylogeny. Additionally, for a fixed number of subclones, since three of the four methods utilize bulk data with more deeply sequenced mutations that also have higher signal-to-noise ratio, it is also expected that the error in will vary significantly with increasing mutational burden.
We also compared performance as a function of number of mutations for methods that utilize single-cell input in their inference of and methods that utilize bulk input in their inference of . In inference of , Canopy2 consistently outperformed the Cardelino methods and demonstrated improved performance for higher mutational burden (Fig. 4B, first column) under low levels of gene expression. In inference of , Canopy demonstrated superior performance compared to Canopy2 for lower mutation burden (Fig. 13A in appendix). However, the performance gap between the two methods decreased with increasing number of mutations. The performance of Canopy relative to Canopy2 was expected since Canopy is a bulk-only method and thus its likelihood function is not distorted by sparse nature of the single-cell components.

Number of Subclones
Next, we explored the effect of number of subclones on reconstruction accuracy. For (Fig. 4A, second column) and (Fig. 4B, second column), Canopy2 offered the most accurate inference of the clonal configurations and cell-to-clone assignment. An exception was noted in easier cases characterized by few subclones (3 or 4), in which Canopy2’s accuracy aligned with that of alternative approaches. For (Fig. 13B in appendix, first column), Canopy slightly outperformed Canopy2 in inference of the sample-to-clone assignment, which was expected since Canopy is a bulk-only method that is being used to infer assignment of clones to bulk samples. Its likelihood function is independent of , thereby remaining unaffected by the noisy characteristics of single-cell data.

Number of Single Cells
Even though poor single-cell yield would potentially impair the performance of Canopy2, for as few as 30 single cells across all levels of gene expression, Canopy2 inferred the correct clonal tree configuration in 254 of 300 simulations with mean error rate of (Fig. 4A, third column). For Canopy, Cardelino with guide, and Cardelino without guide, the correct was inferred in 129, 149, and 83 of the 300 simulations with mean error rates of , , and , respectively. Since Canopy does not utilize single-cell data, the estimated error in remained unchanged with increasing number of cells. Similar metrics were produced for inference of the cell-to-clone assignment matrix and sample-to-clone assignment matrix . In inference of , for only 30 single cells, the correct was inferred in 141, 0, and 0 of 300 simulations for Canopy2, Cardelino with guide, and Cardelino without guide, with mean error rates of , , and , respectively (Figure 4B, third column). When estimating , outcomes were consistent across number of cells and levels of gene expression; thus, any reported metrics span all parameters. Given the continuous nature of sample-to-clone assignment, perfect inference is unattainable. Therefore, error rates are expressed as means only, with Canopy2 and Canopy providing mean error rates of and , respectively (Fig. 13B in appendix, second column).

Number of Bulk Samples
We further varied the number of available bulk samples from two to six to assess its impact on reconstruction error for bulk coverage of 30 – 50x. Encouragingly, under hypothetical scenarios that favored bulk data, characterized by the availability of multiple bulk samples and relatively silent genes (), Canopy2 still outperformed its bulk-only predecessor, Canopy, in the inference of the clonal composition matrix (Fig. 4A, fourth column). In the meantime, error rates for the cell-to-clone assignment matrix remained consistent across subclones since is independent of bulk sample data (Fig. 4B, fourth column). In estimation of the bulk clonal composition matrix , Canopy2 and Canopy demonstrated similar levels of accuracy (Fig. 13B in appendix, third column).

Gene Expression Stochasticity
Gene expression has inherent stochasticity resulting from transcriptional bursting, a phenomenon that has been shown to be pervasive at the single-cell level without the averaging artifacts in bulk samples. Bursting kinetics of the stochastic gene expression can be inferred by scRNA-seq data [18, 58]. We carried out extensive simulation studies to investigate the effect of the bursting kinetics – characterized by a beta distribution with hyperparameters and – on model performance. Specifically, we examined three levels of gene expression: high (, ), bursty (, ), and low (, ) with identical expression levels for each mutation m. Canopy2 and Cardelino with guide demonstrated comparable performance in scenarios characterized by constitutive and bursty levels of gene expression, with an exception noted for a large number of single-cells (Fig. 4A, third column, ). In this specific scenario, Cardelino with guide capitalized on the fact that it’s a single-cell only method applied to a large dataset with adequate single-cell read counts, in addition to using output of Z from Canopy to guide the inference. However, in conditions of low gene expression, Canopy2 demonstrated superior performance relative to all methods. Since low single-cell gene expression results in sparse single-cell read counts and potentially biased estimates of bursting kinetics, these findings indicate Canopy2’s robust performance in practical applications.

Real Data Analysis

Case Studies in Glioblastoma and Breast Cancer
We demonstrated Canopy2’s performance on matched WES and scRNA-seq data from two studies of patients with breast cancer (BC) [59] and glioblastoma (GBM) [60], as summarized in Tables 3 and 4, respectively. The two individuals with breast cancer involved one ER/HER2-positive patient, BC03, and one triple negative patient, BC07. For each patient, a primary tumor specimen and two metastatic lymph nodes (LN) were sequenced [59]. The two individuals with glioblastoma involved one patient, GBM10, with three locally adjacent tumors (GBM10_1, GBM10_2, and GBM10_3) and one patient, GBM9, with a multifocal tumor that involved two initial tumors in the right and left frontal lobes (GBM9_1 and GBM9_2) and recurrent tumors (GBM9_R1 and GBM9_R2) that emerged in the left frontal lobe after treatment. Glioma specimens were collected at different times and spatially separated sites so as to sufficiently explore the intra-tumor heterogeneity and differential drug response [60].
Results were not generated for Cardelino without a guide clonal tree (de-novo mode), as this particular method of Cardelino frequently returned errors regarding conflicts in the computed number of internal nodes when attempting to plot the results. The optimal tree was selected based on modified BIC, as described in Model selection, to determine the number of subclones, considering a range of 3 to 10 subclones. Convergence was confirmed by examining the posterior densities, lag autocorrelation functions (lag ACFs), and trace plots. Canopy2 inference of the phylogenetic trees for the two patients with breast cancer [59] revealed that the optimal configuration for patient BC03 involves four subclones and for patient BC07 involves five subclones. Canopy2 inference of the phylogenetic trees for the two patients with glioblastoma [60] revealed that the optimal configuration for patient GBM9 involves eight subclones and for patient GBM10 involves six subclones. All data was processed according to the SNV calling pipeline illustrated in (Fig. 6 in appendix) which is described in detail in Bioinformatic processing. Links to the pipeline scripts and both the raw and processed data can be found in Data and code availability.

Assessing Cell-To-Clone Assignment
For each patient, the cell-to-clone assignment matrices were recorded. In the case of Canopy2, is a binary matrix, with each row indicating the assignment of a cell to a specific subclone. For Cardelino, however, is a probability matrix, where the sum of the row-wise probabilities is equal to 1, and cells were assigned to subclones with the highest probabilities. We validated the assignment of single cells to tumor subclones in using the inferred list of mutations belonging to each cluster and the corresponding tree. Discrepancies between the cell assignments and cell mutational profiles have been reported in phylogeny reconstruction by scDNA-seq [41, 61]. Cells remained as unassigned if they could not be assigned to a subclones without conflict, i.e., if the cells contained mutations from two or more diverging branches. In other words, using the inferred mutational profile for each cell, we asked whether that profile could definitively assign the cell to one, and only one, subclone on the tree. If not, and the profile contained mutations from two or more diverging branches, then the cell remained unassigned. The fewer the number of unassigned cells, denoted in white along the top bars, the more accurate the inference.
As an illustrative example, consider the single-cell mutational profiles in Fig. 1B paired with the resulting phylogeny in Fig. 1E. Although Cell3 was assigned to Clone3, it contains mutations SNV4 and SNV5 from diverging branches. Thus, we would label Cell3 as unassigned. A cell is considered a member of Clone1 (normal) if it doesn’t contain any mutations. Otherwise, a validation metric was developed to determine exclusive clonal membership as follows: (i) assign to Clone2 if SNV2 but not (SNV3, SNV4, SNV5, or SNV6); (ii) assign to Clone3 if SNV5 but not (SNV2 or SNV4); (iii) assign to Clone4 if SNV4 but not (SNV2 or SNV5); and (iv) leave as unassigned if none of the above.
Canopy2 consistently assigned more cells to clones than the other two methods in samples from patients with breast cancer [59] and glioblastoma [60] (Fig. 5). For malignant cells, the percentages of unassigned cells were as follows for (Canopy2, Canopy, and Cardelino with guide clonal tree): BC03 (15.6, 27.3, 26.0)%, BC07 (0, 29.4, 11.8)%, GBM9 (14.7, 36.4, 30.2)%, and GBM10 (0, 2.7, 16.2)%.

Discussion

Discussion
Here we propose a Bayesian framework for accurate reconstruction of the tumor phylogenetic tree involving a three-tier output (Step 3 of Fig. 3): (i) , a binary matrix which assigns mutations to clones, (ii) , a binary matrix which assigns single-cells to clones, and (iii) , a matrix of clonal proportions which make up each bulk sample. Additionally, users can opt to output a text file containing the mutations assigned to each cluster along the tree. This three-tier output offers a comprehensive look into the evolutionary history of the tumor and the intra-tumor heterogeneity. It can be used to infer the cell-of-origin (the cancer-initiating cell), the temporal order of the SNVs, the mutational profiles of the single-cells, and the composition of the bulk samples that comprise these single-cells.
The distinguishing features of Canopy2 when compared to available methods, including predecessor Canopy [21], include: (i) joint inference using SNVs derived from bulk DNA WES and scRNA-seq, (ii) separation of zeros categorized as non-cancerous (cells without mutations), stochastic (mutations not expressed due to bursting), and technical (expressed mutations not picked up by sequencing), and (iii) comprehensive output that involves all three components mentioned above. Canopy2 outputs convergence diagnostics for each evolutionary configuration and uses BIC to select the optimal evolutionary configuration. To elaborate on points (i) and (ii), the individual limitations of using only bulk or single-cell data (Fig. 1) can be overcome by conducting joint inference. Canopy2 accounts for the sparse and stochastic nature of the single-cell data by placing distributions on the sampling probabilities. This beta-binomial piecewise function, combined with the estimation of and from the independent gene expression data, is useful in distinguishing the sources of error mentioned in point (ii) (Fig. 2).
A limitation of our methodology is its reliance on sampling via Metropolis-within-Gibbs, where successive values in the chain depend on previous iterations, creating a bottleneck in convergence that is highly influenced by the proposal distributions used for the random walk. Proposals that do not allow for sufficient mixing can lead to dependencies on the initial state, delaying convergence or causing it to settle on a local rather than global optimum. Our options for proposal distributions were further limited by the requirement that certain columns sum to one, leading to relatively slow convergence for large datasets with many cells and mutations. Nevertheless, convergence diagnostics (available to users of the Canopy2 package) from the case studies indicated that all chains were mixing well and sampling effectively from the true posterior distribution, as confirmed in simulation studies. In addition to these constraints, Canopy2 estimates three components (Z, , ) rather than two, which contributes to its slower performance. Microbenchmarking on a Windows laptop with a 12th Gen Intel Core i9-12900HK processor (ncores=20) showed that Canopy2 is the slowest of the four methods evaluated, being approximately eight times slower than Canopy and 22.5 times slower than Cardelino (Fig. 14 in appendix). However, for datasets of average size, results can be obtained within a few minutes on a personal computer if the ncores argument is used effectively. For analyses involving a high number of chains or larger datasets with more subclones, mutations, single cells, or bulk samples, it is advisable to run Canopy2 on a high-performance computing cluster. Future optimizations could include translating for loops into C++ using Rcpp [62] or adopting STAN’s [63] no-U-turn sampler (NUTS) to eliminate the need for specific proposal distributions.
We suggest stringent QC of the bulk and single-cell read count data using our pre-processing pipeline (Fig. 6 in appendix) to obtain the finalized SNV callset. We have uploaded the pre-processing shell and R scripts for the glioblastoma and breast cancer case studies into an open-access, Zenodo repository (see Data and code availability); each folder contains a master script that prompts users to select the appropriate step of the pipeline. We have also included pre-processing scripts for calling germline heterozygous loci in the single-cell data, should the user choose to estimate the gene activation and deactivation rates, and , using the germline mutations that occur within the same gene body as the somatic mutations. There can be error associated with detecting somatic variants because the particular cell we are examining may not contain this variant. Unlike somatic mutations, germline mutations are present in virtually every cell in the body, so for every cell n (Fig. 2). While we make these pre-processing scripts and pipeline (Germline Variant Calling and Fig. 12) for calling germline heterozyous loci available to users, we chose to demonstrate our methodology using estimates of the bursting kinetics from the gene expression data, as this is typically more convenient for the researcher.

Data and Code Availability

Data and Code Availability

R Package
The Canopy2 package is an open source R package available at https://github.com/annweideman/canopy2.

Variant Calling Pipelines
Master shell scripts for the somatic and germline variant calling pipeline are available at the open-access Zenodo repository https://zenodo.org/record/7931384.

Pre- and Post-processed Data Deposition
The pre- and post-processed data used for the case studies are stored in the data directory of our GitHub repository at https://github.com/annweideman/canopy2/tree/main/data with accompanying R documentation files.

Raw Data Deposition
Glioblastoma data [60]: the RNA-seq (single-cell and bulk) and bulk DNA WES data were downloaded from the European Genome-phenome Archive (EGA) under accession code EGAS00001001880.
Breast cancer data [59]: the single-cell and bulk RNA-seq data were downloaded from the NCBI Gene Expression Omnibus (GEO) database under accession code GSE75688. The bulk DNA WES data was downloaded from the NCBI Sequence Read Archive (SRA) under accession code SRP067248.

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

🟢 PMC 전문 열기