본문으로 건너뛰기
← 뒤로

Incorporating prior information in gene expression network-based cancer heterogeneity analysis.

1/5 보강
Biostatistics (Oxford, England) 2025 Vol.26(1)
Retraction 확인
출처

Li R, Xu S, Li Y, Tang Z, Feng D, Cai J

📝 환자 설명용 한 줄

Cancer is molecularly heterogeneous, with seemingly similar patients having different molecular landscapes and accordingly different clinical behaviors.

이 논문을 인용하기

↓ .bib ↓ .ris
APA Li R, Xu S, et al. (2025). Incorporating prior information in gene expression network-based cancer heterogeneity analysis.. Biostatistics (Oxford, England), 26(1). https://doi.org/10.1093/biostatistics/kxae028
MLA Li R, et al.. "Incorporating prior information in gene expression network-based cancer heterogeneity analysis.." Biostatistics (Oxford, England), vol. 26, no. 1, 2025.
PMID 39074174 ↗

Abstract

Cancer is molecularly heterogeneous, with seemingly similar patients having different molecular landscapes and accordingly different clinical behaviors. In recent studies, gene expression networks have been shown as more effective/informative for cancer heterogeneity analysis than some simpler measures. Gene interconnections can be classified as "direct" and "indirect," where the latter can be caused by shared genomic regulators (such as transcription factors, microRNAs, and other regulatory molecules) and other mechanisms. It has been suggested that incorporating the regulators of gene expressions in network analysis and focusing on the direct interconnections can lead to a deeper understanding of the more essential gene interconnections. Such analysis can be seriously challenged by the large number of parameters (jointly caused by network analysis, incorporation of regulators, and heterogeneity) and often weak signals. To effectively tackle this problem, we propose incorporating prior information contained in the published literature. A key challenge is that such prior information can be partial or even wrong. We develop a two-step procedure that can flexibly accommodate different levels of prior information quality. Simulation demonstrates the effectiveness of the proposed approach and its superiority over relevant competitors. In the analysis of a breast cancer dataset, findings different from the alternatives are made, and the identified sample subgroups have important clinical differences.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

같은 제1저자의 인용 많은 논문 (5)

📖 전문 본문 읽기 PMC JATS · ~161 KB · 영문

INTRODUCTION

1.
INTRODUCTION
Heterogeneity analysis has been extensively conducted in the research and clinical treatment of cancer (and many other complex diseases). In such analysis, the goal is to separate seemingly similar patients into subgroups that may have different behaviors. Analysis has been conducted based on a wide range of factors such as demographic and clinical characteristics, pathological images, immunological features, and others. With the maturity of high-throughput profiling techniques, molecular measurements have been effectively integrated into heterogeneity analysis (Navin et al. 2010; Meeks et al. 2020). In particular, a series of studies have shown that gene expression-based heterogeneity analysis can be highly successful (Budinska et al. 2013; Church et al. 2019). Some early studies are based on simple measures, such as the means and variances of gene expressions. In recent studies, it has been shown that gene expression network analysis, which takes a systemic perspective, can generate more insights into patient heterogeneity patterns (Tang et al. 2018; Pio et al. 2022). Here, it is noted that measures such as mean and variance can be easily incorporated into network-based analysis.
As noted in the literature (Kang et al. 2015), the interconnection between two gene expressions can be roughly classified as indirect and direct. An indirect interconnection can occur when for example two gene expressions are mediated by shared genomic regulators, such as transcription factors, microRNAs, and other regulatory molecules. In contrast, a direct interconnection does not involve such shared regulators and may describe a more essential interconnection. A schematic presentation is provided in Fig. 1. The gene expression networks with only direct interconnections (left panel) are usually sparser. The middle panel describes the regulations of gene expressions by regulators, and such regulation relationships are usually sparse. When both direct and indirect interconnections are included, as shown in the right panel, the networks are denser, which may mask the more essential gene interconnections. For the small example in Fig. 1, the networks in the right panel for the two subgroups have 78.3% overlapping edges, while the networks in the left panel have 46.2% overlapping edges—making them easier to be distinguished. Incorporating regulators in gene expression analysis has been made possible by the increasing popularity of multiomics studies, which collect gene expression and regulator data on the same subjects (Kagohara et al. 2018; Lee et al. 2021). It is especially worth noting that there have been a handful of gene expression network analyses incorporating regulators, including heterogeneity analysis (Li et al. 2023).
Gene expression data analysis is “traditionally” challenged by high data dimensionality and weak signals. Network analysis, heterogeneity analysis, and incorporating regulators each and all make these challenges more serious. To improve estimation, here we adopt the strategy of borrowing strengths from prior information, in particular, that contained in published literature. The proposed analysis involves the interconnections among gene expressions and the regulations of gene expressions by regulators. Our preliminary exploration suggests that there are relatively fewer published findings on the regulations of gene expressions. As such, in this study, we focus on the prior information on gene expression interconnections. It is noted that, with relatively straightforward extensions, the proposed analysis can incorporate prior information on the regulations. To fix ideas, we mine prior information by searching PubMed. There are more than 20,000 published studies that simultaneously include “EGFR,” “ERBB2,” and “breast cancer.” In contrast, there are only 18 published studies that simultaneously include “WLPH,” “ARAF,” and “breast cancer.” Based on this observation, it is sensible to conjecture that for breast cancer, EGFR and ERBB2 have a higher likelihood of being interconnected than WLPH and ARAF. Here, it is noted that this search is coarse. There are more refined ways to mine published literature (Lee et al. 2020), however, they may involve complicated algorithms/coding. Additionally, it is noted that PubMed (or any other literature database) does not include all published findings, published findings can be partial or even wrong, and the cooccurrence of two genes in a published literature does not necessarily suggest that they are interconnected. As such, the prior information can be useful but cannot be “fully trusted.”
With the consideration of the quality of the prior information, our proposal is to data-dependently “balance” between the prior information and the information contained in the observed data. The proposed strategy shares some similar spirit with that in Jiang et al. (2016), Li et al. (2022), and Wang et al. (2023), which have considerably simpler settings. Broadly speaking, incorporating prior information is not a new strategy. The most natural may be Bayesian analysis (Zhao et al. 2019), which adopts significantly different techniques. Highly curated biological information has also been used (Fan et al. 2016)—it is noted that such information can often be fully trusted and that a higher quality often means a smaller amount of information.
In this study, the goal is to develop a new technique that can further advance cancer heterogeneity analysis. Built on the existing literature, this study advances in multiple important aspects. First, compared to heterogeneity analysis that is based on simpler measures, it is based on gene expression networks as well as regulations of gene expressions by regulators. Second, different from most of the existing multiomics analyses, heterogeneity analysis is conducted, which is critical for cancer and many other diseases. Third, the type of prior information used is significantly different from that in Burrell et al. (2013), Fan et al. (2016), and many others. The data/model settings are much more complicated than in Jiang et al. (2016), Li et al. (2022), and Wang et al. (2023). The adopted analysis strategy significantly differs from Bayesian analysis. The proposed method incorporates prior information to conduct network-based unsupervised clustering, which demands more challenging computation. It does not specify prior information regarding the number of subgroups, which is very challenging to obtain in practice. Instead, it adopts a fusion technique to determine the number of subgroups along with model parameter estimation in a fully data-driven manner. Last but not least, as can be partly seen from our data analysis, this study can also deliver a practically useful tool that can lead to new insights into the heterogeneity of cancer (and some other diseases).

METHODS

2.
METHODS
The proposed analysis is unsupervised and takes measurements on gene expressions and their regulators as input. As argued in the published multiomics studies—in particular including those involving network analysis and heterogeneity analysis (Tarazona et al. 2021; Henao et al. 2023), the collection of regulators does not need to be “complete”—some types of regulators or some components of a specific type of regulators may not be available. The overall overflow is shown in Fig. 2. The first step is to obtain the prior information. After that, the analysis contains two steps. In the first step of prior information-guided analysis, the prior information is “fully trusted.” In the second step of prior information-incorporated analysis, we take into account the varying quality of the prior information and balance between the prior information and observed data.
2.1.
Extracting prior information
There are many sources of prior information. In this study, we focus on that contained in published studies, which can be broad and of relatively high quality. In particular, we use PubMed, which is one of the most comprehensive publication databases. For a specific cancer (for example, breast cancer), we search PubMed for the cooccurrence of two genes (for example, EGFR and ERBB2). This can be realized using software such as PubMatrix and easyPubMed. Then a threshold is imposed to the counts of cooccurrence to retain the strongest evidence. Two genes are considered as having prior information of being interconnected if they have a nonzero count of cooccurrence (after the thresholding). It is noted that this coarse text mining can be potentially improved (for example, by normalizing using the occurrence counts of individual genes) and that the proposed analysis does not demand prior information to be fully accurate. Additional discussions are provided in the last section.
Denote as the collection of gene expression relationships. In particular, if there exists prior information for the th gene and th gene, then (which corresponds to a network edge in the first step of analysis). We also set . Here, it is noted that in the literature there have been relatively limited studies on heterogeneity analysis and so the same prior information is shared by all sample subgroups.

2.2.
Prior information-guided heterogeneity analysis
Denote as the number of subjects. For the th subject, denote as the gene expression measurements and as their regulators, which can be copy number variations, DNA methylation, microRNAs, and others. Here, if there are multiple types of regulators, we stack them together. Multiple published studies have shown that this simple strategy has satisfactory performance (Seal et al. 2020). Denote as the design matrix composed of ’s and an intercept. For heterogeneity analysis, denote as the number of sample subgroups. Different from some studies, it is not assumed that is known. To start with, we consider a mixture model with subgroups. In practice, although is unknown, specifying an "upper bound" is usually not difficult. Our numerical exploration suggests that the value of is not critical as long as it is large enough. Then, the probability density function of , is:

where represents the gene expression network in the th subgroup, represents the regulation relationships in the th subgroup, and is the mixing proportion satisfying and , . This belongs to the mixture modeling framework, which has been extensively adopted for heterogeneity analysis (Hao et al. 2018; Ren et al. 2022). In the proposed analysis, each component of the mixture model and hence the heterogeneity structure is defined by a conditional Gaussian graphical model (CGGM) (Yin and Li 2011) incorporating the regulation relationships in the gene expression networks.
Consider linear regulations and Gaussian distributions for gene expressions. That is, in the th subgroup, and . Then, . is the precision matrix (inverse of the covariance matrix), and its sparsity structure directly describes the gene expression network structure. is the coefficient matrix and describes the regulation relationships. It is noted that the Gaussian assumption and GGM model have been extensively adopted for gene expressions (Wang and Huang 2014), and it is possible to relax such assumptions. Additionally, although nonlinear regulations have been considered, considering the high dimensionality and satisfactory performance, we follow the literature (Bersanelli et al. 2016) and consider linear regulations.
In the first step, we propose estimation:

where , , and the penalty function is defined as:

where is a base penalty function with regularization parameter . Convenient choices include MCP and SCAD. Note that, as for , the diagonal elements in the precision matrices are not penalized. In practical data analysis, and are usually of the same order, which can be partly seen in our data analysis. Additionally, it is expected that a certain proportion of the genes have prior information. As such, the two terms in (2.3) are likely to have similar scales.
Here, we adopt a finite mixture modeling strategy with a prefixed number of subgroups. The objective function has two terms. The first term is the log-likelihood and measures goodness-of-fit. The second term is the sparsity penalty, which conducts regularized estimation and identification of important network connections and regulations. Here, we “fully trust” the prior information—the network edges in the prior information set are not subject to selection. Then the sparsity penalty searches for additional signals. With this step of estimation, we obtain sample subgroups, the gene expression network and regulation relationships for each subgroup, and the mixture probabilities.

2.3.
Prior information-incorporated heterogeneity analysis
With estimation (2.2) and the Bayesian rule, we can also obtain the subgroup membership for each subject. Consider the membership matrix with each row corresponding to the subgroup identified for each subject. This matrix is denoted as , and . For and ,

We propose objective function:

where is a data-dependent weighting parameter and the penalty:

Consider the estimate:

It is noted that, with the fusion penalization, may contain identical values. Denote the number of unique values of as , which provides an estimate of the number of subgroups. From this estimation, we can also obtain the estimated gene expression network and regulation relationships for each subgroup, mixture probabilities, as well as subgroup membership of each subject.
In (2.5), the first two terms measure goodness-of-fit and balance between the observed data and the prior information. The balancing is achieved with . Intuitively, when the prior information has a low quality, , and the analysis will be heavily based on the observed data. On the other hand, when the prior information has a high quality, , and it puts more emphasis on borrowing strength from the prior information.
The proposed penalty has two major components. The first two terms achieve sparsity. Here, the edges in the prior information set are also subject to selection, which allows the proposed approach to screen out wrong prior information. The last term adopts a penalized fusion strategy for heterogeneity analysis. The intuition is that, if two of the subgroups are “close enough,” their parameters will be shrunk to be the same, and the two subgroups can be combined together. It is noted that both the gene expression networks and the regulation relationships are included in the fusion penalty and used for defining the subgroups, which significantly differs from the gene network-only heterogeneity analysis. and are treated as a group and simultaneously used to promote similarity, which can be more effective than being analyzed separately.

2.4.
Computation
The two optimization problems (2.2) and (2.6) are solved sequentially. The objective function in (2.6) requires the solution of (2.2). For each problem, the expectation-maximization (EM) technique is adopted, with computing the conditional expectation of the complete data log-likelihood function in the expectation step and updating () iteratively in the maximization step. The alternating direction method of multipliers (ADMM) technique (Boyd et al. 2011) is adopted, and the algorithm is summarized in Algorithm 1. The details for the M-steps are provided in Supplementary Materials.
For selecting the optimal tunings and in the first step, we adopt the following BIC criterion and a grid search:

where is the total number of nonzero parameters in , . In the second step, we need to determine , , , and . We propose first fixing and, for each candidate value of , selecting the optimal () using the BIC criterion and a grid search. Then the optimal can be selected also using the BIC criterion. It is noted that, with a more complex analysis goal, the tunings needed are more than some of the existing studies. However, published studies and our own experience suggest that such tuning parameter selection is feasible and generates reliable results. The code for the proposed algorithm is publicly available at https://github.com/lirong95/prior-cggm.

SIMULATION

3.
SIMULATION
Simulation is conducted to assess the performance of the proposed method and compare it against relevant alternatives. We set the true number of subgroups , where different subgroups have distinct networks and regulation relationships. For the dimensions, we consider and . For the sample sizes, we consider a balanced case with all the subgroups having sample sizes of 500 and an imbalanced case with the three subgroups having sample sizes of 250, 300, and 350. For the prior information, we consider a correctly specified case (denoted as T, where is the intersection of the nonzero elements in the subgroups) and a partially mis-specified case (denoted as F, where the entries of are selected at random following true/false positive rates (TPR/FPR) being 0.6/0.1).
3.1.
Settings
The following two network structures are considered. ST1: The first subgroup has an upper-tridiagonal precision matrix with the diagonal elements equal to 1 and the nonzero off-diagonal elements equal to 0.4. The second subgroup has a lower-tridiagonal precision matrix with the diagonal elements equal to 1 and the nonzero off-diagonal elements equal to 0.4. The third subgroup has a diagonal precision matrix with the nonzero elements equal to 1. ST2: The precision matrices are generated by the nearest-neighbor networks. Specifically, each network consists of 10 equally-sized disjoint subnetworks (modules), among which eight are shared by the three sample subgroups. Additionally, the first subgroup shares one module with the second subgroup and another one with the third subgroup. The second subgroup and the third subgroup also have a unique module of their own. The structure of each module is generated by a nearest-neighbor network. We first generate points randomly on a unit square, calculate all pairwise distances, and select nearest neighbors of each point besides itself. The nonzero off-diagonal elements of the precision matrices are located at which the corresponding two points are among the nearest neighbors of each other. The nonzero values are generated from Unif(−0.4, −0.1) ∪ Unif(0.1, 0.4). The diagonal elements are all set as 1. ST1 has a chain-like structure, and ST2 has a module structure. They are graphically presented in Supplementary Fig. S1 (Supplementary Materials).
We simulate as having a normal distribution and a categorical distribution, where is generated randomly from {0, 1, 2} with equal probabilities. In terms of regulations, the positions of the nonzero entries are randomly selected, and each entry has a probability proportional to of being nonzero. The nonzero values are generated from the uniform distribution Unif(−1, −0.7) ∪ Unif(0.7, 1).
The simulation settings are comprehensive. In particular, two types of network structures are considered, both of which are popular in the literature. Both continuous and categorical regulators are considered to mimic the distributions of genomic regulators encountered in practice. Two levels of prior information quality are considered, which may test the “robustness” of the proposed approach. It is noted that although the data dimensions may not seem that high, with the networks, regulations, and heterogeneity, the number of unknown parameters is significantly larger than the sample sizes. To better gauge performance, we consider the following relevant alternatives. (i) A two-step procedure. In the first step, a clustering method is used to generate subgroups. Here, we consider both the -means clustering and a nonparametric clustering method (Chauveau and Hoang 2016). Both clustering methods are conducted on () and only . The number of subgroups is set as . In the second step, we apply the CGGM approach with Lasso penalization (cglasso) (Yin and Li 2011). We denote them as -cglasso-, -cglasso-(), np-cglasso- and np-cglasso-(), respectively. (ii) HeteroGGM. The heterogeneous Gaussian graphical model via penalized fusion (HeteroGGM) approach (Ren et al. 2022) is applied. It can simultaneously achieve subgroup membership identification and precision matrix estimation. The number of subgroups is automatically determined by fusion regularization. It does not accommodate the regulations of on or the prior information. (iii) RI-HeteroGGM. We conduct the regulation-incorporated network-based heterogeneity analysis (Li et al. 2023). This method extends HeteroGGM to incorporate heterogeneous regulation relationships and can simultaneously obtain subgroup memberships and determine the number of subgroups, precision matrices, and coefficient matrices. It does not accommodate prior information. This alternative may be the closest to the proposed approach. (d) PI-CGGM. This heterogeneity analysis approach accommodates prior information. It conducts the mixture modeling + CGGM analysis with a fixed number of subgroups. It is the first step of the proposed approach and solves objective function (2.2). To facilitate comparison, the number of subgroups is set as .

3.2.
Results
When implementing the proposed approach, we set . Similar results are obtained under other values. We adopt the following measures to evaluate performance. For subgrouping accuracy, we consider and adjusted Rand index (RI), which measures the similarity between the estimated and true subgrouping structures. For estimation accuracy, we consider root mean squared error (RMSE). Specifically, for the precision matrices,

For variable selection accuracy, we consider true/false positive rates (TPR/FPR):

The above measures are defined accordingly for . Performance is evaluated for and separately.
The simulation results for ST1, the normal distribution, and are summarized in Table 1, and the other results are presented in Supplementary Tables S1–S7 (Supplementary Materials). The proposed method demonstrates competitive performance in subgrouping, selection, and estimation, across the whole spectrum of simulation scenarios. Specifically, when the prior information is correct, the proposed method can accurately identify the number of subgroups and achieve desirable estimation and selection accuracy. When the prior information is partially misspecified, the performance remains competitive compared to the approaches without incorporating prior information (i.e. RI-HeteroGGM). This suggests that the proposed approach has the “robustness” property—it can data-dependently adjust the impact of prior information. The alternative methods have inferior performance. HeteroGGM, which does not take the regulations into account, tends to over-estimate the number of subgroups and over-select the nonzero elements in the precision matrices. The estimation of the two-step procedure heavily depends on the subgrouping results, and it performs acceptably only when the number of subgroups is correctly specified—this is highly challenging in practice.
We also conduct a simulation experiment to assess the effects of prior information on the final estimation. We consider the setting with continuous regulators, ST1 precision matrices, dimensions , and sample sizes . We consider varying prior information quality, as measured by the TPR/FPR values, with a larger TPR and a smaller FPR indicating a higher quality of prior information. The results are summarized in Supplementary Table S8 (Supplementary Materials). It is observed that the performance of subgrouping, estimation, and selection deteriorates as the degree of misspecification in prior information increases, which is as expected. It is also observed that, even when the prior information is completely wrong (with TPR=0 and FRP=1), the proposed method still performs comparably to those without accommodating prior information. This is due to the weighting strategy. Additional simulations are conducted to more deeply comprehend the impact of weight . The results in Supplementary Table S9 (Supplementary Materials) suggest that higher-quality prior information corresponds to a larger , which is highly sensible.

BREAST CANCER DATA ANALYSIS

4.
BREAST CANCER DATA ANALYSIS
Breast cancer is among the most extensively studied using high-throughput profiling techniques, and there have been a handful of multiomics breast cancer studies. Here, we analyze the METABRIC data and refer to the original publications (Curtis et al. 2012; Pereira et al. 2016; Rueda et al. 2019) for details on the study and experimental designs. The dataset contains gene expression and copy number variation measurements on 1,898 subjects. Copy number variation has been long recognized as a critical regulator of gene expression. Although in principle the proposed analysis can be conducted using a large number of genes, to generate more reliable results, we focus on the “most interesting” genes. In particular, we consider genes in the PAM50 set (which has been manually curated and suggested as highly relevant for breast cancer subtyping) and in the KEGG “breast cancer” pathway—this leads to 154 genes that are highly likely to be relevant for breast cancer biology. We then identify the corresponding copy number variations. For prior information mining, we use the R package easyPubMed. The search involves breast cancer and any two genes out of the 154. The cooccurrence counts are presented in Fig. 3. Prior information is available for about 30% of the gene pairs. To focus on more reliable prior information, we impose a threshold of 10, which leads to a total of 195 gene pairs. More detailed information is available from the authors. Here, it is noted that the cutoff of 10 can be somewhat subjective. However, this may not pose a serious concern with the flexibility of the proposed method.
When implementing the proposed method, we set . A total of four subgroups are identified, with sizes 782, 448, 439 and 229, respectively. The detailed membership information is available from the authors. The estimated gene expression networks and regulation relationships are shown in Supplementary Fig. S2 (Supplementary Materials). In Supplementary Table S11 (Supplementary Materials), we present the numbers of network edges and the numbers of overlapping edges. We further present the DeltaCon distances in parentheses to measure the similarity of the corresponding two networks (Tantardini et al. 2019). It is observed that the four subgroups have significantly different network structures. Subgroups 1 and 3 are the most similar in terms of gene expression networks. More comparisons between the networks are presented in Supplementary Fig. S3. It is observed that the network of subgroup 2 has more high-degree nodes, which indicates higher direct connectivity. The network of subgroup 3 has more high-betweeness nodes, which indicates higher influence of information passing. Across the four subgroups, on average, about half of the edges in are identified. Additionally, the likelihood for an edge to be identified is correlated with the cooccurrence count.
The proposed analysis is unsupervised, and there is a lack of an objective way to evaluate subgrouping accuracy. To get additional insights, we compare some key clinical features across the four subgroups. Supplementary Table S10 (Supplementary Materials) presents the results of the Chi-squared tests with FDR adjustment for the Nottingham prognostics index (NPI), number of lymph nodes examined positive (LNP), and age at diagnosis (Age). In Fig. 4, we further compare overall survival (OS) and relapse free survival (RFS). Significant differences are observed across the four subgroups, which can provide “indirect support” to the sensibility of analysis. Breast cancer has been subtyped based on molecular biomarkers. A commonly adopted is the Claudin subtyping, under which breast cancer is classified as Basal-like, Her2, Luminal A, Luminal B, and Normal-like. We compare the obtained subgrouping with the Claudian subtyping, and the results are summarized in Supplementary Table S12 (Supplementary Materials). Consider Basal-like and Luminal A, both of which are divided into two of the identified subgroups. To get further insights, we compare the Basal-like subtype within subgroup 2 and subgroup 4, as well as the Luminal A subtype within subgroup 1 and subgroup 2. The results are presented in Supplementary Fig. S4 (Supplementary Materials). It is observed that the Luminal A subtype within subgroup 3 has a significantly better prognosis than that within subgroup 1. Additionally, the Basal-like subtype within subgroup 2 has a significantly poorer prognosis than that within subgroup 4. This suggests that the proposed analysis may lead to clinically meaningful findings that can complement the existing subtyping. An interesting finding is that compared to the Basal-like subtype within subgroup 2, the Basal-like subtype within subgroup 4 exhibits enriched expression in growth factor signaling, particularly involving EGFR, MET, BRAF, and CTNNB1, as illustrated in Supplementary Fig. S5 (Supplementary Materials). This observation is consistent with the known characteristics of BL2 (a subtype of Basal-like breast cancer) documented in the literature (Hubalek et al. 2017).
This dataset is also analyzed using the alternative approaches. We fix the number of subgroups as for better comparability. The Rand index results are presented in Supplementary Table S13 (Supplementary Materials), which suggests that different approaches lead to significantly different subgrouping structures. For each approach, we compare survival across the identified subgroups and present the results in Supplementary Table S14 (Supplementary Materials). It is observed that the proposed approach can better separate the subjects into subgroups with more distinct survival, which can provide “indirect support” to the validity of the proposed approach.

DISCUSSION

5.
DISCUSSION
In this study, we have developed a novel heterogeneity analysis approach that is based on both gene expression networks and gene expression-regulator relationships. Advancing from the existing literature, we have proposed a way to effectively and flexibly incorporate prior information contained in vast publications. Simulation and the analysis of a breast cancer dataset have demonstrated the practical utility of the proposed approach.
This study can be extended in multiple ways. The proposed analysis is not limited to “gene expressions + regulators.” For example, it is directly applicable to “protein expressions + gene expressions” with ligand-receptor pairs and protein-protein interaction (PPI) networks as the prior information. When there are multiple types of regulators, there have been a few developments in more subtly integrating them (as opposed to directly stacking them together). It is also possible to refine text mining and extract higher-quality prior information. For example, the convolutional neural network (CNN) technique developed in Wang et al. (2023) can be a viable choice. Some domain-specific language representation models, such as BioBERT pre-trained on large-scale biomedical corpora, can be further applied to enhance the extraction of prior information (Lee et al. 2020). Additionally, integrating some natural language processing techniques with web-based tools, such as GeneDive (Previde et al. 2018), can also facilitate the exploration of gene interactions. Theoretical exploration, such as the identifiability of the mixture model and consistency properties, may also be of interest (Ho and Nguyen 2016; Balakrishnan et al. 2017). The proposed strategy for incorporating prior information has been motivated by several recent successes. It is possible to develop other information-incorporating strategies. It will also be of interest to develop more data analysis.

Supplementary Material

Supplementary Material
SupplementSupplementary material is available at Biostatistics Journal online.

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

🟢 PMC 전문 열기