Genotype subtyping approach to identify unnoticed variants in diseases from GWAS data.
1/5 보강
[UNLABELLED] Genome-wide association studies (GWAS) are vital for investigating single-nucleotide polymorphisms (SNPs) in diseases.
APA
Garza-Hernandez D, Martinez-Ledesma E, Trevino V (2026). Genotype subtyping approach to identify unnoticed variants in diseases from GWAS data.. BioData mining, 19(1), 8. https://doi.org/10.1186/s13040-025-00512-2
MLA
Garza-Hernandez D, et al.. "Genotype subtyping approach to identify unnoticed variants in diseases from GWAS data.." BioData mining, vol. 19, no. 1, 2026, pp. 8.
PMID
41519843 ↗
Abstract 한글 요약
[UNLABELLED] Genome-wide association studies (GWAS) are vital for investigating single-nucleotide polymorphisms (SNPs) in diseases. Comparisons of detected SNPs for the same disease between populations often reveal gaps, which are attributed to factors such as population, sample size, and rare variants, among others. We propose a method referred to as ‘Genotype Subtyping’ that is based on a classical GWAS to identify seed SNPs; each SNP stratifies cases and controls by its genotype, followed by a sub-GWAS per genotype while keeping the additional number of statistical tests low. We evaluated four databases and populations, including Crohn’s disease, schizophrenia, breast cancer, and type 2 diabetes. Our findings and simulations demonstrate that Genotype Subtyping identifies additional significant SNPs or highlights SNPs that increase their significance importantly in all tested diseases. Some of these SNPs have already shown associations with the disorders in other populations. This result suggests that additional associations not observed before can be potentially novel. For instance, we confirmed associations near (rs7760611) with breast cancer and near (chr6:97339280) with schizophrenia, even though they initially fell short of genome-wide significance. Moreover, we noted a SNP near BUD13 (rs1263149) showing a potential link to type 2 diabetes. We demonstrate that Genotype Subtyping is highly sensitive and specific by positive and negative simulations. Our results underscore the reliability of Genotype Subtyping as a method for identifying and validating associations for specific phenotypes by reanalyzing GWAS data, ultimately facilitating target gene identification. The implementation is available in a GitHub repository (https://github.com/vtrevino/Genotype_Subtyping).
[SUPPLEMENTARY INFORMATION] The online version contains supplementary material available at 10.1186/s13040-025-00512-2.
[SUPPLEMENTARY INFORMATION] The online version contains supplementary material available at 10.1186/s13040-025-00512-2.
📖 전문 본문 읽기 PMC JATS · ~85 KB · 영문
Introduction
Introduction
Genome-wide association studies (GWAS) are indispensable for investigating the relationships between diseases and thousands of single-nucleotide polymorphisms (SNPs) [1]. With over 45,000 GWAS resulting in roughly 6,000 publications, totaling around 400,000 SNP-trait associations in the GWAS Catalog [2], our understanding of the biology underlying diseases has significantly progressed. These findings have informed genetic predictors, potential drug targets, and SNP heritability estimates [3].
However, GWAS results from diverse populations sometimes display discrepancies in identified SNPs due to factors like statistical chance, noise, and confounding variables [4]. Replication of GWAS findings can falter under specific conditions, such as rare variants [1], false positives [5], differences in linkage disequilibrium (LD) [6], population structure [7], and sample size. The choice of the GWAS model or parameters can also influence variant overlap across studies investigating the same phenotype [1]. For instance, the number of principal components used to correct for population structure ranges from 2 to 20 [8]. Furthermore, complex phenotypes, such as complex diseases, result from an intricate interplay between genetic and environmental factors, often involving numerous genes or SNPs making modest contributions [9], which further obscures the identification of similar SNPs among studies. For example, height correlates with 25,318 reported genetic associations, and obesity links to 294 genetic associations in the GWAS Catalog [2], and these associations come from different studies and populations, where not all results replicate. Thus, it is crucial to explore and develop methodologies capable of capturing features not accounted for by conventional or classical approaches to enhance predictive models and deepen our understanding of complex diseases.
One caveat in GWAS is the multiple testing problem. Larger sample sizes are necessary for GWAS to achieve significance when testing hundreds of thousands of SNPs due to an increase in the risk of false positives by small effect sizes [10]. In this context, an association is considered genome-wide significant with a p-value threshold of 5 × 10− 8 [11]. Nevertheless, this can lead to the discarding of some genuine associations within GWAS, which have insufficient power to reach significance [12]. For explorative purposes, it may be convenient to use other thresholds, which will be employed here to illustrate the potential use of the proposed strategy.
Utilizing novel methodologies on existing GWAS data presents a cost-effective opportunity to tackle complex diseases through data-sharing initiatives [13]. Various strategies have been proposed to enhance GWAS results, including the implementation of specific statistical models [14], integration of GWAS summary data from genetically correlated traits [15], estimation of the required sample size to achieve significant findings [16], utilization of extreme sampling techniques [17], and performing meta-analysis [18]. These approaches aim to improve GWAS outcomes’ accuracy, reliability, and interpretability, ultimately advancing our understanding of complex diseases. In this manuscript, we propose a generalization of a strategy found to be useful for variant discovery in two recent studies detailed below.
In a preliminary analysis for Crohn’s Disease (CD), we performed a classical GWAS analysis to generate a predictive CD model and selected an interesting variant in the gene MARCKS that appeared in the topmost important variants [19]. Then, we stratified the samples by the MARCKS genotypes, performing three exploratory secondary GWAS analyses, one per genotype [19]. Each analysis used only those samples carrying the same MARCKS genotype. In these analyses, we observed significant associations with known variants that were not found to be significant in the original classic GWAS. A similar approach was employed in a study investigating alcohol consumption, where sample stratification was performed using a well-known variant associated with the phenotype in the Asian population [20]. In this analysis, the authors found three additional SNPs associated with homozygous individuals and five different SNPs associated with heterozygous subjects.
Given the above-emerging evidence that conducting secondary GWAS analyses through variant stratification can be beneficial in identifying additional or novel variants [19, 20], we aimed to extend the applicability and generalize the use of this approach, termed Genotype Subtyping (GS). The extension of GS involves stratifying GWAS analyses based on each significant or potentially relevant variants, while the generalization entails applying this approach to analyze other phenotypes or GWAS databases. The increase of statistical tests is not a critical issue, if done prudently, because the number of additional tests is a minor fraction of the original GWAS. By exploring the potential of GS in diverse contexts, we sought to uncover its broader utility in elucidating the genetic underpinnings of various phenotypes. We aimed to assess the impact of stratifying four different phenotypes obtained from four distinct databases with diverse genetic backgrounds. By examining these phenotypes in different populations and considering their genetic heterogeneity, we aimed to uncover valuable information and contribute to a deeper understanding of these diseases. Our results provide evidence of associations near CASC15 for breast cancer, NDUFAF4 for schizophrenia, and a potentially novel SNP near BUD13 for type 2 diabetes, even though these SNPs initially fell short of genome-wide significance.
Genome-wide association studies (GWAS) are indispensable for investigating the relationships between diseases and thousands of single-nucleotide polymorphisms (SNPs) [1]. With over 45,000 GWAS resulting in roughly 6,000 publications, totaling around 400,000 SNP-trait associations in the GWAS Catalog [2], our understanding of the biology underlying diseases has significantly progressed. These findings have informed genetic predictors, potential drug targets, and SNP heritability estimates [3].
However, GWAS results from diverse populations sometimes display discrepancies in identified SNPs due to factors like statistical chance, noise, and confounding variables [4]. Replication of GWAS findings can falter under specific conditions, such as rare variants [1], false positives [5], differences in linkage disequilibrium (LD) [6], population structure [7], and sample size. The choice of the GWAS model or parameters can also influence variant overlap across studies investigating the same phenotype [1]. For instance, the number of principal components used to correct for population structure ranges from 2 to 20 [8]. Furthermore, complex phenotypes, such as complex diseases, result from an intricate interplay between genetic and environmental factors, often involving numerous genes or SNPs making modest contributions [9], which further obscures the identification of similar SNPs among studies. For example, height correlates with 25,318 reported genetic associations, and obesity links to 294 genetic associations in the GWAS Catalog [2], and these associations come from different studies and populations, where not all results replicate. Thus, it is crucial to explore and develop methodologies capable of capturing features not accounted for by conventional or classical approaches to enhance predictive models and deepen our understanding of complex diseases.
One caveat in GWAS is the multiple testing problem. Larger sample sizes are necessary for GWAS to achieve significance when testing hundreds of thousands of SNPs due to an increase in the risk of false positives by small effect sizes [10]. In this context, an association is considered genome-wide significant with a p-value threshold of 5 × 10− 8 [11]. Nevertheless, this can lead to the discarding of some genuine associations within GWAS, which have insufficient power to reach significance [12]. For explorative purposes, it may be convenient to use other thresholds, which will be employed here to illustrate the potential use of the proposed strategy.
Utilizing novel methodologies on existing GWAS data presents a cost-effective opportunity to tackle complex diseases through data-sharing initiatives [13]. Various strategies have been proposed to enhance GWAS results, including the implementation of specific statistical models [14], integration of GWAS summary data from genetically correlated traits [15], estimation of the required sample size to achieve significant findings [16], utilization of extreme sampling techniques [17], and performing meta-analysis [18]. These approaches aim to improve GWAS outcomes’ accuracy, reliability, and interpretability, ultimately advancing our understanding of complex diseases. In this manuscript, we propose a generalization of a strategy found to be useful for variant discovery in two recent studies detailed below.
In a preliminary analysis for Crohn’s Disease (CD), we performed a classical GWAS analysis to generate a predictive CD model and selected an interesting variant in the gene MARCKS that appeared in the topmost important variants [19]. Then, we stratified the samples by the MARCKS genotypes, performing three exploratory secondary GWAS analyses, one per genotype [19]. Each analysis used only those samples carrying the same MARCKS genotype. In these analyses, we observed significant associations with known variants that were not found to be significant in the original classic GWAS. A similar approach was employed in a study investigating alcohol consumption, where sample stratification was performed using a well-known variant associated with the phenotype in the Asian population [20]. In this analysis, the authors found three additional SNPs associated with homozygous individuals and five different SNPs associated with heterozygous subjects.
Given the above-emerging evidence that conducting secondary GWAS analyses through variant stratification can be beneficial in identifying additional or novel variants [19, 20], we aimed to extend the applicability and generalize the use of this approach, termed Genotype Subtyping (GS). The extension of GS involves stratifying GWAS analyses based on each significant or potentially relevant variants, while the generalization entails applying this approach to analyze other phenotypes or GWAS databases. The increase of statistical tests is not a critical issue, if done prudently, because the number of additional tests is a minor fraction of the original GWAS. By exploring the potential of GS in diverse contexts, we sought to uncover its broader utility in elucidating the genetic underpinnings of various phenotypes. We aimed to assess the impact of stratifying four different phenotypes obtained from four distinct databases with diverse genetic backgrounds. By examining these phenotypes in different populations and considering their genetic heterogeneity, we aimed to uncover valuable information and contribute to a deeper understanding of these diseases. Our results provide evidence of associations near CASC15 for breast cancer, NDUFAF4 for schizophrenia, and a potentially novel SNP near BUD13 for type 2 diabetes, even though these SNPs initially fell short of genome-wide significance.
Results
Results
Our methodology comprises a two-step approach, beginning with a conventional GWAS and followed by stratified GWAS analyses. We tested our approach on four distinct databases representing phenotypes with diverse genetic backgrounds: Crohn’s Disease (CD), Schizophrenia (SCZ), Breast Cancer (BC), and Type 2 Diabetes (T2D). The overall procedure is outlined in Fig. 1. The study begins with a conventional GWAS, where p-values (PGWAS) for each SNP are computed. Potential markers for subtyping are identified as SNPs with PGWAS < PST, where PST
= 10− 6 (generating k SNPs as shown in Fig. 1). Because we would like to identify novel signals, this threshold favors more exploration for the sub-GWAS process. These potential markers undergo a linkage disequilibrium-based filtering process, forming the basis for the subsequent subtyping GWAS analysis. Then, only SNPs having a PGWAS < PSG are chosen for the subsequent subtyping GWAS (sub-GWAS) where we used PSG
= 10− 3 (selecting s SNPs as shown in Fig. 1). This threshold serves to speed up further SNP analysis, favors exploration over exploitation, helps to keep the number of additional tests low, and, importantly, assumes that SNPs showing similar allele frequencies (those whose PGWAS
> PSG) would be unlikely to show associations in the next sub-GWAS. The study then stratifies cases and controls based on the genotype of each potential marker, conducting sub-GWAS within each stratum, and selects only SNPs that increased importantly their p-value association compared to the original GWAS. Through this approach, our study aims to pinpoint and scrutinize potential genetic markers associated with the selected phenotypes, considering genotype-based stratification and building upon the foundation laid by the classic GWAS results without inflating too much the number of additional statistical tests.
Identification of additional SNPs through genotype subtyping
In the context of Genotype Subtyping (GS), where we explore potential associations using a reduced number of SNPs, SNPs were deemed additional and potentially associated with the phenotype if they met two criteria: (i) a GS-association p-value (PGS) < 5 × 10− 7 (indicating an explorative suggestive association); (ii) an increase in significance by at least one order of magnitude compared to the classic PGWAS, defined as Δp > = 1, where Δp = log10(PGS) - log10(PGWAS). Additionally, we considered SNPs as “novel” (new predicted association) if (iii) they were located more than 500,000 base pairs away from any SNP used for GS stratification. Table 1 provides an overview of the signals identified through classic GWAS and GS for each analyzed phenotype, while Table 2 shows the top identifications. These results were achieved by keeping the additional tests low as shown in Table 3, either they are a small fraction of the original GWAS tests, or they do not reach millions, which may compromise common genome-wide significance thresholds. The collective findings emphasize that the GS approach facilitated the discovery of previously undetected signals compared to the classic GWAS methodology, as elaborated in the following sections.
As an illustrative example, Fig. 2 contrasts the results of the classic GWAS and the GS approach applied to breast cancer data. This analysis employed 14 independent SNPs as subtyping markers, enabling 42 sub-GWAS analyses based on each genotype. The GS analysis revealed 139 additional SNPs associated with 12 genes, as shown in Fig. 2. Remarkably, some of these additional SNPs corroborated our approach by confirming previously identified genes from the classic GWAS, albeit with different nearby SNPs. For instance, SNPs in DIRC3 exhibited genome-wide significant associations in the classic GWAS (PGWAS < 5 × 10− 8). However, certain SNPs proximate to DIRC3, albeit not reaching genome-wide significance in the classic GWAS (PGWAS > 10− 6), displayed increased association after the GS (PGS < 10− 8). Similarly, SNPs within or near genes like CASC16, FGFR2, FAM72B, EBF1, TNP1, and PTHLH were identified in the classic GWAS and GS, proving that GS signals are not contradictory. On the other hand, genes like CASC15, PPIAL4A, LGMN, MAP3K1, MIER3, and MYT1 were regarded as novel findings because the SNPs within these genes were not identified in the classic GWAS, even at the suggestive p-value (PGWAS < 10− 6). These identifications indicate that the GS approach highlighted new predicted associations SNPs previously undiscovered in the context of breast cancer. Most of these genes have established associations with cancer in PubMed or the BC phenotype in other databases such as GWAS Catalog and OpenTargets (refer to Supplementary Table 1). These associations suggest that the SNPs identified exclusively through the GS represent credible associations. To delve into a specific case, we will focus on CASC15 (rs7760611), which was detected by subtyping an SNP in CCND1 (at the heterozygous genotype AC). CCND1, recognized as an oncogene encoding cyclin D1, is overexpressed in more than 50% of breast cancer cases [21], with documented associations reported in OpenTargets and GWAS Catalog stemming from genetic studies, somatic mutations, and pathways and system biology [2, 22]. CASC15, cancer susceptibility 15, is a non-protein-coding RNA gene with a potential role in regulating cell proliferation [21]. It exhibits increased expression across various tumors, including breast cancer, but reduced expression in ovarian cancer, glioma, and neuroblastoma [23]. The biological interplay between these two genes has been established in laryngeal squamous cell carcinoma, where CASC15 upregulates cyclin D1 by downregulating miR-365, thereby promoting cell proliferation [24]. Therefore, we explored a possible statistical interaction between the SNPs, which yielded a significant interaction (p < 0.05). These concise analyses underscore the potential validity of the originally unnoticed SNP in CASC15 (rs7760611), unveiled through the GS. Moreover, CASC15, together with MAP3K1 and MIER3, has already been linked to breast cancer according to the GWAS Catalog. Remarkably, these associations would remain undiscovered in this database without applying the GS strategy.
In summary, the combination of validated and predicted findings concerning known genes highlights the potential of the GS approach to reveal additional variants and enhance our comprehension of the genetic landscape associated with various phenotypes. Table 2 summarizes the top additional findings and validation results for the four analyzed phenotypes.
For Crohn’s Disease, Schizophrenia, and Type 2 Diabetes, we conducted 217, 57, and 23 subtyping GWAS analyses, respectively, as described in the subsequent section. These figures offer a comprehensive overview of the GS results for each analyzed phenotype, providing further insights into the genetic associations uncovered through this stratification strategy.
The SNPs exhibiting significance solely in the GS analysis may suggest an interaction with the SNP from which the GS was derived. To explore this possibility, we conducted an interaction test for all GS results and their corresponding GS subtyping markers. Notably, many SNPs whose p-values improved during GS also displayed a potential interaction (p < 0.05) with the GS-derived polymorphism. This suggests that the interaction hypothesis, even if subtle, could explain why these SNPs gained importance in the subtyping analysis. However, it is crucial to acknowledge that not all identifications can be solely explained by the interaction hypothesis. For example, in the case of breast cancer, three of the five SNPs listed in Table 2 did not significantly interact with the corresponding SNP used in GS, indicating that the interaction hypothesis does not account for all the identifications made through the GS strategy.
Genotype subtyping reveals potentially associated SNPs for disease phenotypes
In our exploration of various phenotypes, we have conducted a brief analysis of results obtained from both classical GWAS and Genotype Subtyping (GS) approaches. The following paragraphs provide a concise summary of the key findings.
Type 2 diabetes (T2D) In our classical GWAS analysis for T2D, we identified noteworthy SNPs like TCF7L2 and RASSF8, along with other suggestive SNPs (Fig. 3). TCF7L2, a well-established signal in T2D [25], validates our classical GWAS approach. Additionally, RASSF8 is associated with height in the GWAS Catalog, which is subsequently associated with T2D in women [26]. Our GS analysis identified 8 additional candidate SNPs in genes like AHCTF1, BUD13, TMEM261 (DMAC1), CHD1, NCK2, and ATOH1. Remarkably, none of these signals were initially identified in the classical GWAS (PGWAS < 10− 6), but most have known links to T2D. For example, CHD1 has been statistically associated with T2D through a meta-analysis [26], while other genes have been connected to phenotypes related to T2D in the GWAS Catalog: ATOH1 for body mass index, BUD13 for triglyceride levels, and NCK2 and DMAC1 for height [27]. AHCTF1 is also associated with triglycerides (triacylglycerol 58:8) in the GWAS Catalog. Notably, three of these genes exhibited significant interaction with their corresponding GS SNP (see Table 2). The case of BUD13 stands out (rs1263149), as it increased its p-value by three orders of magnitude (Δp = 3.04) and was identified through subtyping SNAPC5 (PGWAS=8 × 10− 7), a gene scarcely studied but showing a significant interaction (Pint=0.04). BUD13, an RNA-binding protein involved in mRNA spliceosome [28], has been associated with lipodystrophy [29], plays a potential role in the pathogenesis of metabolic syndrome [30], and has been confirmed in both Caucasian and Asian populations by non-GWAS approaches [31–33]. Given that metabolic syndrome increases the risk of developing T2D [34, 35], BUD13 can be regarded as a new predicted association for T2D. In summary, our analysis suggests that the novel signals identified through GS are either known or likely linked to T2D.
Breast cancer BC was discussed as an example in the previous section. Here, we add an interesting observation. The 14 SNPs used for GS from the original GWAS analysis included SNPs from EBF1 and DIRC3. Markers in these genes are commonly positive associations in BC [36, 37]. Interestingly, the GS applied to the EBF1 marker enhances other SNPs in DIRC3, while the GS from the DIRC3 marker enhances others from EBF1. Both detections have significant interaction (Fig. 2 and Supplementary Table 1). These results suggest that EBF1 and DIRC3 may have a close functional relation in BC.
Schizophrenia Our classical GWAS analysis for schizophrenia revealed numerous significant SNPs (Fig. 4). Among these, the strongest signal on ZMYM2 had been identified in another genome sequencing study [38]. Additionally, in the context of schizophrenia, LY6G5C has been linked to aberrant epigenetic marks [39], PITPNM1 was found in de novo mutations [40], and COX19 expression is dysregulated in blood cells [41]; all these identifications support our GWAS analysis. In the subsequent GS analysis derived from 30 SNPs, only three new SNPs met our criteria, found in the genes NDUFAF4 and PKMYT1-PAQR4. The most significant variant (chr6:97339280) near NDUFAF4 (PGS=7 × 10− 9) is associated with neurological disorders [42, 43] and has been identified as a risk gene for schizophrenia by a network analysis [44], but not previously using GWAS.
Crohn’s Disease (CD) Our classical GWAS analysis identified 185 SNPs, as shown in Fig. 5. Among the most significant genes near these SNPs, we found IL23R, NOD2, IL12B, ZNF365, RPL3, and PDGFB, all of which are well-known in CD or inflammatory bowel disease and have been reported in the GWAS Catalog. This validation supports the robustness of our initial GWAS analysis. Subsequently, our GS analysis identified 59 additional SNPs covering 34 genes, as listed in Supplementary Table 1. In this analysis, we excluded SNPs in proximity (< 500kbp) to identify known significant SNPs in the GWAS to avoid confusion. As a result, we recognized SNPs within or close to 19 genes (ADAD1, C6orf10, CCR6, FGFR1OP, GCKR, HDAC2, HIC2, HLA-DQA2, IL2, IL21, JAK2, PTPN2, PTPRD, RNF144A, SCAF8, SLC45A1, SYPL1, TMEM44, TRIM31) associated with CD. Among these genes, GCKR, FGFR1OP (CEP43), IL2, IL21, JAK2, SYPL1, and PTPN2 have already been linked to CD in the GWAS Catalog and other databases. Some other genes have also been associated with CD in specific studies, such as CCR6 [45], HIC2 [46], HLA-DQA2 [47], SCAF8 [48], and TRIM31 [49]. Therefore, the GS analysis revealed additional SNPs that showed positive associations with CD within the same dataset that were not identified in a classical GWAS analysis. Furthermore, several of these positive SNPs have not been extensively studied in the context of CD (ADAD1, C6orf10, HDAC2, PTPN2, PTPRD, RNF144A, SLC45A1, SYPL1, and TMEM44), indicating research opportunities in this field. In addition to the above genes, we noted SNX20, which is highly associated with CD (2 × 10− 54 in GWAS Catalog), that was significant only after applying GS.
These findings collectively demonstrate the potential of the GS approach in uncovering additional variants and enhancing our understanding of genetic landscapes associated with various phenotypes.
Randomized analysis To determine the possible random effects of using GS, we performed two randomized analyses using the data of Crohn’s Disease, which shows the largest fraction of additional tests (Table 3). The first randomized analysis was performed as follows. We performed a similar pipeline, randomizing the samples between healthy controls (HC) and CD, that is, randomizing the phenotype information. As expected, the association p-values were poor (Fig. 6). The top SNP showed a p = 4 × 10− 7, followed by SNPs at 5 × 10− 6 and 1.6 × 10− 5 at very different locations, suggesting that these SNPs showed no clear associations for a GWAS. Then, for the GS in real databases, we used a threshold of 10− 6 to select SNPs for the stratification. Nevertheless, in this randomized analysis, only 1 SNP would pass this criterion. Therefore, we relaxed this cut-off and used 10− 4 to add a similar number of SNPs as those observed in a real database. The SNPs used for GS stratification reached 24. Then, we selected potential SNPs for sub-GWAS, which yielded 232 SNPs (at p = 10− 3), and we ran the GS analysis. In the analysis of the real datasets, we utilized two criteria to select potentially interesting SNPs. The most important criterion is the strength of association, that is, the p-value from the sub-GWAS was more significant than the original GWAS for at least one order of magnitude. After the GS, we did not observe any SNP whose p-value was stronger (Fig. 6, inset). Indeed, we observed that all tested SNPs decreased the p-value by around three orders of magnitude, overall, while in the real databases, some SNP p-values increased more than one order of magnitude, reaching inclusive two or three orders of magnitude. In summary, this randomized analysis shows that GS does not generate SNPs with increased significance randomly.
The second randomized analysis was performed by injecting positive SNPs into the previous randomized analysis. We added 10 GWAS-positive SNPs by generating random genotypes using the Hardy-Weinberg equilibrium equation (p2 + 2pq + q2 = 1, where p and q represent fractions of the common allele and minor allele, respectively, and p = 1-q), utilizing q as a parameter and the same number of controls and cases from Crohn’s Disease data. We generated the set A1 with 5 SNPs with q around 0.45 and the set B1 with 5 SNPs with q around 0.145 assigned to the odd chromosomes one by one at position 1 (see Methods for details). Then we added one SNP for each SNP in the sets A1 and B1 that were GWAS-negative (p-value > 5 × 10− 8) but that were GS-positive (p-value < 5 × 10− 8 within the GS algorithm) by adjusting the q values per genotype around their corresponding baseline (see Methods). These SNPs were assigned to even chromosomes one by one at position 1. Thus, this procedure also generated 10 SNPs, 5 SNPs for the set A2, and 5 SNPs for the set B2. Then, we performed the GS pipeline. The results are shown in Fig. 7. As expected, we observed only the 10 injected SNPs of sets A1 and B1 detected as genome-wide significant (squares in Fig. 7) and the 10 injected SNPs of sets A2 and B2 below 10− 6 and above 10− 4 (filled triangles in Fig. 7). These results show that our 20 simulated SNPs are correctly detected. After the GS pipeline, the 10 injected GWAS-negative and GS-positive SNPs were all detected as positives by our GS pipeline (open triangles in Fig. 7). Importantly, none of the other SNPs included in the GS procedure (more than 300) were detected as positive; that is, there were no false positives. The Δp values were, on average, 3.8 and 4.4 for sets A2 and B2, respectively, while the interaction p-values were correspondingly 10− 3.6 and 10− 6.2 (see Supplementary Table 1 for details). This analysis shows that the GS pipeline is able to detect true positive SNPs with high sensitivity and specificity.
Comparison against conditional analysis After a classical GWAS, significant SNPs can be used as covariates, performing additional GWAS analyses as an attempt to identify more variants [50]. This procedure is continued until no more variants are identified. We performed conditional analyses to compare the identified SNPs against those from Genotype Subtyping. As a proof of concept, we used the Crohn’s Disease database. We performed conditional analyses using the 185 SNPs identified in the classical GWAS either together or separated, one for each SNP. The result of the pulled conditional analysis is shown in Fig. 8A, while some examples of the conditional analysis on individual SNPs are shown in Fig. 8B. The pulled analysis conditioning all SNPs seems to contain the aggregated identifications of the individual analyses. This is expected because the conditional analyses use a linear combination of all SNPs independently. We then compared the SNPs identified by the pulled conditional analyses (red spots in Fig. 8A) with those identified by GS. From the 57 SNPs identified by GS in the Crohn’s Disease dataset, 24 were also present in the conditional analysis at 5 × 10− 7 or 17 SNPs at 5 × 10− 8 (marked in columns “present in conditional analysis” in Supplementary Table S1). That is, only between 30% and 42% of the results obtained by GS are identified by a conditional analysis. Therefore, GS can be seen as a complementary tool to identify interesting SNPs.
Analyses of synthetic datasets The analysis of synthetic data can be useful because they are generated under specific hypotheses and may overcome the limitation of using a few thousand samples. HAPNEST generated realistic large databases that include genotypes and phenotypes at diverse parameters [51]. The HAPNEST dataset BSST936 (https://www.ebi.ac.uk/biostudies/studies/S-BSST936) was generated under classical association parameters among cases and controls and generated up to 1 M simulated individuals across 6 ancestry groups and up to 6.8 M simulated SNPs. Therefore, it can be used as a blind case study. We used the Phenotype 9 (generated at 0.5 heritability and 0.0001 polygenicity) from simulated AFR ancestry and generated two datasets: the top 100 K individuals and the top 10 K individuals. We assume that true association is given by the GWAS analysis of the 100 K individuals, but we used the 10 K individuals to perform GS and compare the results. Indeed, the GWAS analysis in both datasets shows correlated p-values but a stronger association for the 100 K dataset (Fig. 9A). Note that the 10 K dataset is a subset of the 100 K; thus, the differences are attributed to sub-sampling. We then applied the GS to the 10 K dataset to simulate a common analysis limited by sample size. This analysis is important because it helps to identify false and true positives. For comparisons with other datasets, we first used the same threshold and criteria used for the disease phenotypes. The results of our GS pipeline are shown in Fig. 9B. We identified 342 SNPs (at GS p < 5 × 10− 7 & Δp > 1) from a total of ~ 1.2 M additional tests (adding 17% tests) that represent a positive rate of 0.03% of total tests. We compared the p-values of the identifications obtained by GS with those of the 100 K dataset under the assumption that the identifications in the 100 K are true (Fig. 9C). We observed that 201 of the 317 unique SNPs were actually significant in the 100 K GWAS. This result suggests an estimated 63.4% of true positives and 36.6% of false positives. Nevertheless, given that this dataset contains 6.4 M SNPs, the equivalent Bonferroni correction for the GWAS should be 7.3 × 10− 9 while the correction for the GS should be 6.2 × 10− 9 due to the increase in tests. At these thresholds, the GS would generate 30 positive SNPs (at GS p < 6.2 × 10− 9 & Δp > 1) of which only 2 are false, estimating a false positive rate of 6.6%. Thus, the false discovery rate of GS in this dataset lies in the range of 36.6% and 6.6% depending on the p-value threshold used, which in this case represents more exploration in the first and more precision in the last. Besides the false positive estimations, the analysis of the 10 K result compared with the 100 K also shows that GS is capable of finding genuine SNPs that could be obscured due to sub-sampling.
Our methodology comprises a two-step approach, beginning with a conventional GWAS and followed by stratified GWAS analyses. We tested our approach on four distinct databases representing phenotypes with diverse genetic backgrounds: Crohn’s Disease (CD), Schizophrenia (SCZ), Breast Cancer (BC), and Type 2 Diabetes (T2D). The overall procedure is outlined in Fig. 1. The study begins with a conventional GWAS, where p-values (PGWAS) for each SNP are computed. Potential markers for subtyping are identified as SNPs with PGWAS < PST, where PST
= 10− 6 (generating k SNPs as shown in Fig. 1). Because we would like to identify novel signals, this threshold favors more exploration for the sub-GWAS process. These potential markers undergo a linkage disequilibrium-based filtering process, forming the basis for the subsequent subtyping GWAS analysis. Then, only SNPs having a PGWAS < PSG are chosen for the subsequent subtyping GWAS (sub-GWAS) where we used PSG
= 10− 3 (selecting s SNPs as shown in Fig. 1). This threshold serves to speed up further SNP analysis, favors exploration over exploitation, helps to keep the number of additional tests low, and, importantly, assumes that SNPs showing similar allele frequencies (those whose PGWAS
> PSG) would be unlikely to show associations in the next sub-GWAS. The study then stratifies cases and controls based on the genotype of each potential marker, conducting sub-GWAS within each stratum, and selects only SNPs that increased importantly their p-value association compared to the original GWAS. Through this approach, our study aims to pinpoint and scrutinize potential genetic markers associated with the selected phenotypes, considering genotype-based stratification and building upon the foundation laid by the classic GWAS results without inflating too much the number of additional statistical tests.
Identification of additional SNPs through genotype subtyping
In the context of Genotype Subtyping (GS), where we explore potential associations using a reduced number of SNPs, SNPs were deemed additional and potentially associated with the phenotype if they met two criteria: (i) a GS-association p-value (PGS) < 5 × 10− 7 (indicating an explorative suggestive association); (ii) an increase in significance by at least one order of magnitude compared to the classic PGWAS, defined as Δp > = 1, where Δp = log10(PGS) - log10(PGWAS). Additionally, we considered SNPs as “novel” (new predicted association) if (iii) they were located more than 500,000 base pairs away from any SNP used for GS stratification. Table 1 provides an overview of the signals identified through classic GWAS and GS for each analyzed phenotype, while Table 2 shows the top identifications. These results were achieved by keeping the additional tests low as shown in Table 3, either they are a small fraction of the original GWAS tests, or they do not reach millions, which may compromise common genome-wide significance thresholds. The collective findings emphasize that the GS approach facilitated the discovery of previously undetected signals compared to the classic GWAS methodology, as elaborated in the following sections.
As an illustrative example, Fig. 2 contrasts the results of the classic GWAS and the GS approach applied to breast cancer data. This analysis employed 14 independent SNPs as subtyping markers, enabling 42 sub-GWAS analyses based on each genotype. The GS analysis revealed 139 additional SNPs associated with 12 genes, as shown in Fig. 2. Remarkably, some of these additional SNPs corroborated our approach by confirming previously identified genes from the classic GWAS, albeit with different nearby SNPs. For instance, SNPs in DIRC3 exhibited genome-wide significant associations in the classic GWAS (PGWAS < 5 × 10− 8). However, certain SNPs proximate to DIRC3, albeit not reaching genome-wide significance in the classic GWAS (PGWAS > 10− 6), displayed increased association after the GS (PGS < 10− 8). Similarly, SNPs within or near genes like CASC16, FGFR2, FAM72B, EBF1, TNP1, and PTHLH were identified in the classic GWAS and GS, proving that GS signals are not contradictory. On the other hand, genes like CASC15, PPIAL4A, LGMN, MAP3K1, MIER3, and MYT1 were regarded as novel findings because the SNPs within these genes were not identified in the classic GWAS, even at the suggestive p-value (PGWAS < 10− 6). These identifications indicate that the GS approach highlighted new predicted associations SNPs previously undiscovered in the context of breast cancer. Most of these genes have established associations with cancer in PubMed or the BC phenotype in other databases such as GWAS Catalog and OpenTargets (refer to Supplementary Table 1). These associations suggest that the SNPs identified exclusively through the GS represent credible associations. To delve into a specific case, we will focus on CASC15 (rs7760611), which was detected by subtyping an SNP in CCND1 (at the heterozygous genotype AC). CCND1, recognized as an oncogene encoding cyclin D1, is overexpressed in more than 50% of breast cancer cases [21], with documented associations reported in OpenTargets and GWAS Catalog stemming from genetic studies, somatic mutations, and pathways and system biology [2, 22]. CASC15, cancer susceptibility 15, is a non-protein-coding RNA gene with a potential role in regulating cell proliferation [21]. It exhibits increased expression across various tumors, including breast cancer, but reduced expression in ovarian cancer, glioma, and neuroblastoma [23]. The biological interplay between these two genes has been established in laryngeal squamous cell carcinoma, where CASC15 upregulates cyclin D1 by downregulating miR-365, thereby promoting cell proliferation [24]. Therefore, we explored a possible statistical interaction between the SNPs, which yielded a significant interaction (p < 0.05). These concise analyses underscore the potential validity of the originally unnoticed SNP in CASC15 (rs7760611), unveiled through the GS. Moreover, CASC15, together with MAP3K1 and MIER3, has already been linked to breast cancer according to the GWAS Catalog. Remarkably, these associations would remain undiscovered in this database without applying the GS strategy.
In summary, the combination of validated and predicted findings concerning known genes highlights the potential of the GS approach to reveal additional variants and enhance our comprehension of the genetic landscape associated with various phenotypes. Table 2 summarizes the top additional findings and validation results for the four analyzed phenotypes.
For Crohn’s Disease, Schizophrenia, and Type 2 Diabetes, we conducted 217, 57, and 23 subtyping GWAS analyses, respectively, as described in the subsequent section. These figures offer a comprehensive overview of the GS results for each analyzed phenotype, providing further insights into the genetic associations uncovered through this stratification strategy.
The SNPs exhibiting significance solely in the GS analysis may suggest an interaction with the SNP from which the GS was derived. To explore this possibility, we conducted an interaction test for all GS results and their corresponding GS subtyping markers. Notably, many SNPs whose p-values improved during GS also displayed a potential interaction (p < 0.05) with the GS-derived polymorphism. This suggests that the interaction hypothesis, even if subtle, could explain why these SNPs gained importance in the subtyping analysis. However, it is crucial to acknowledge that not all identifications can be solely explained by the interaction hypothesis. For example, in the case of breast cancer, three of the five SNPs listed in Table 2 did not significantly interact with the corresponding SNP used in GS, indicating that the interaction hypothesis does not account for all the identifications made through the GS strategy.
Genotype subtyping reveals potentially associated SNPs for disease phenotypes
In our exploration of various phenotypes, we have conducted a brief analysis of results obtained from both classical GWAS and Genotype Subtyping (GS) approaches. The following paragraphs provide a concise summary of the key findings.
Type 2 diabetes (T2D) In our classical GWAS analysis for T2D, we identified noteworthy SNPs like TCF7L2 and RASSF8, along with other suggestive SNPs (Fig. 3). TCF7L2, a well-established signal in T2D [25], validates our classical GWAS approach. Additionally, RASSF8 is associated with height in the GWAS Catalog, which is subsequently associated with T2D in women [26]. Our GS analysis identified 8 additional candidate SNPs in genes like AHCTF1, BUD13, TMEM261 (DMAC1), CHD1, NCK2, and ATOH1. Remarkably, none of these signals were initially identified in the classical GWAS (PGWAS < 10− 6), but most have known links to T2D. For example, CHD1 has been statistically associated with T2D through a meta-analysis [26], while other genes have been connected to phenotypes related to T2D in the GWAS Catalog: ATOH1 for body mass index, BUD13 for triglyceride levels, and NCK2 and DMAC1 for height [27]. AHCTF1 is also associated with triglycerides (triacylglycerol 58:8) in the GWAS Catalog. Notably, three of these genes exhibited significant interaction with their corresponding GS SNP (see Table 2). The case of BUD13 stands out (rs1263149), as it increased its p-value by three orders of magnitude (Δp = 3.04) and was identified through subtyping SNAPC5 (PGWAS=8 × 10− 7), a gene scarcely studied but showing a significant interaction (Pint=0.04). BUD13, an RNA-binding protein involved in mRNA spliceosome [28], has been associated with lipodystrophy [29], plays a potential role in the pathogenesis of metabolic syndrome [30], and has been confirmed in both Caucasian and Asian populations by non-GWAS approaches [31–33]. Given that metabolic syndrome increases the risk of developing T2D [34, 35], BUD13 can be regarded as a new predicted association for T2D. In summary, our analysis suggests that the novel signals identified through GS are either known or likely linked to T2D.
Breast cancer BC was discussed as an example in the previous section. Here, we add an interesting observation. The 14 SNPs used for GS from the original GWAS analysis included SNPs from EBF1 and DIRC3. Markers in these genes are commonly positive associations in BC [36, 37]. Interestingly, the GS applied to the EBF1 marker enhances other SNPs in DIRC3, while the GS from the DIRC3 marker enhances others from EBF1. Both detections have significant interaction (Fig. 2 and Supplementary Table 1). These results suggest that EBF1 and DIRC3 may have a close functional relation in BC.
Schizophrenia Our classical GWAS analysis for schizophrenia revealed numerous significant SNPs (Fig. 4). Among these, the strongest signal on ZMYM2 had been identified in another genome sequencing study [38]. Additionally, in the context of schizophrenia, LY6G5C has been linked to aberrant epigenetic marks [39], PITPNM1 was found in de novo mutations [40], and COX19 expression is dysregulated in blood cells [41]; all these identifications support our GWAS analysis. In the subsequent GS analysis derived from 30 SNPs, only three new SNPs met our criteria, found in the genes NDUFAF4 and PKMYT1-PAQR4. The most significant variant (chr6:97339280) near NDUFAF4 (PGS=7 × 10− 9) is associated with neurological disorders [42, 43] and has been identified as a risk gene for schizophrenia by a network analysis [44], but not previously using GWAS.
Crohn’s Disease (CD) Our classical GWAS analysis identified 185 SNPs, as shown in Fig. 5. Among the most significant genes near these SNPs, we found IL23R, NOD2, IL12B, ZNF365, RPL3, and PDGFB, all of which are well-known in CD or inflammatory bowel disease and have been reported in the GWAS Catalog. This validation supports the robustness of our initial GWAS analysis. Subsequently, our GS analysis identified 59 additional SNPs covering 34 genes, as listed in Supplementary Table 1. In this analysis, we excluded SNPs in proximity (< 500kbp) to identify known significant SNPs in the GWAS to avoid confusion. As a result, we recognized SNPs within or close to 19 genes (ADAD1, C6orf10, CCR6, FGFR1OP, GCKR, HDAC2, HIC2, HLA-DQA2, IL2, IL21, JAK2, PTPN2, PTPRD, RNF144A, SCAF8, SLC45A1, SYPL1, TMEM44, TRIM31) associated with CD. Among these genes, GCKR, FGFR1OP (CEP43), IL2, IL21, JAK2, SYPL1, and PTPN2 have already been linked to CD in the GWAS Catalog and other databases. Some other genes have also been associated with CD in specific studies, such as CCR6 [45], HIC2 [46], HLA-DQA2 [47], SCAF8 [48], and TRIM31 [49]. Therefore, the GS analysis revealed additional SNPs that showed positive associations with CD within the same dataset that were not identified in a classical GWAS analysis. Furthermore, several of these positive SNPs have not been extensively studied in the context of CD (ADAD1, C6orf10, HDAC2, PTPN2, PTPRD, RNF144A, SLC45A1, SYPL1, and TMEM44), indicating research opportunities in this field. In addition to the above genes, we noted SNX20, which is highly associated with CD (2 × 10− 54 in GWAS Catalog), that was significant only after applying GS.
These findings collectively demonstrate the potential of the GS approach in uncovering additional variants and enhancing our understanding of genetic landscapes associated with various phenotypes.
Randomized analysis To determine the possible random effects of using GS, we performed two randomized analyses using the data of Crohn’s Disease, which shows the largest fraction of additional tests (Table 3). The first randomized analysis was performed as follows. We performed a similar pipeline, randomizing the samples between healthy controls (HC) and CD, that is, randomizing the phenotype information. As expected, the association p-values were poor (Fig. 6). The top SNP showed a p = 4 × 10− 7, followed by SNPs at 5 × 10− 6 and 1.6 × 10− 5 at very different locations, suggesting that these SNPs showed no clear associations for a GWAS. Then, for the GS in real databases, we used a threshold of 10− 6 to select SNPs for the stratification. Nevertheless, in this randomized analysis, only 1 SNP would pass this criterion. Therefore, we relaxed this cut-off and used 10− 4 to add a similar number of SNPs as those observed in a real database. The SNPs used for GS stratification reached 24. Then, we selected potential SNPs for sub-GWAS, which yielded 232 SNPs (at p = 10− 3), and we ran the GS analysis. In the analysis of the real datasets, we utilized two criteria to select potentially interesting SNPs. The most important criterion is the strength of association, that is, the p-value from the sub-GWAS was more significant than the original GWAS for at least one order of magnitude. After the GS, we did not observe any SNP whose p-value was stronger (Fig. 6, inset). Indeed, we observed that all tested SNPs decreased the p-value by around three orders of magnitude, overall, while in the real databases, some SNP p-values increased more than one order of magnitude, reaching inclusive two or three orders of magnitude. In summary, this randomized analysis shows that GS does not generate SNPs with increased significance randomly.
The second randomized analysis was performed by injecting positive SNPs into the previous randomized analysis. We added 10 GWAS-positive SNPs by generating random genotypes using the Hardy-Weinberg equilibrium equation (p2 + 2pq + q2 = 1, where p and q represent fractions of the common allele and minor allele, respectively, and p = 1-q), utilizing q as a parameter and the same number of controls and cases from Crohn’s Disease data. We generated the set A1 with 5 SNPs with q around 0.45 and the set B1 with 5 SNPs with q around 0.145 assigned to the odd chromosomes one by one at position 1 (see Methods for details). Then we added one SNP for each SNP in the sets A1 and B1 that were GWAS-negative (p-value > 5 × 10− 8) but that were GS-positive (p-value < 5 × 10− 8 within the GS algorithm) by adjusting the q values per genotype around their corresponding baseline (see Methods). These SNPs were assigned to even chromosomes one by one at position 1. Thus, this procedure also generated 10 SNPs, 5 SNPs for the set A2, and 5 SNPs for the set B2. Then, we performed the GS pipeline. The results are shown in Fig. 7. As expected, we observed only the 10 injected SNPs of sets A1 and B1 detected as genome-wide significant (squares in Fig. 7) and the 10 injected SNPs of sets A2 and B2 below 10− 6 and above 10− 4 (filled triangles in Fig. 7). These results show that our 20 simulated SNPs are correctly detected. After the GS pipeline, the 10 injected GWAS-negative and GS-positive SNPs were all detected as positives by our GS pipeline (open triangles in Fig. 7). Importantly, none of the other SNPs included in the GS procedure (more than 300) were detected as positive; that is, there were no false positives. The Δp values were, on average, 3.8 and 4.4 for sets A2 and B2, respectively, while the interaction p-values were correspondingly 10− 3.6 and 10− 6.2 (see Supplementary Table 1 for details). This analysis shows that the GS pipeline is able to detect true positive SNPs with high sensitivity and specificity.
Comparison against conditional analysis After a classical GWAS, significant SNPs can be used as covariates, performing additional GWAS analyses as an attempt to identify more variants [50]. This procedure is continued until no more variants are identified. We performed conditional analyses to compare the identified SNPs against those from Genotype Subtyping. As a proof of concept, we used the Crohn’s Disease database. We performed conditional analyses using the 185 SNPs identified in the classical GWAS either together or separated, one for each SNP. The result of the pulled conditional analysis is shown in Fig. 8A, while some examples of the conditional analysis on individual SNPs are shown in Fig. 8B. The pulled analysis conditioning all SNPs seems to contain the aggregated identifications of the individual analyses. This is expected because the conditional analyses use a linear combination of all SNPs independently. We then compared the SNPs identified by the pulled conditional analyses (red spots in Fig. 8A) with those identified by GS. From the 57 SNPs identified by GS in the Crohn’s Disease dataset, 24 were also present in the conditional analysis at 5 × 10− 7 or 17 SNPs at 5 × 10− 8 (marked in columns “present in conditional analysis” in Supplementary Table S1). That is, only between 30% and 42% of the results obtained by GS are identified by a conditional analysis. Therefore, GS can be seen as a complementary tool to identify interesting SNPs.
Analyses of synthetic datasets The analysis of synthetic data can be useful because they are generated under specific hypotheses and may overcome the limitation of using a few thousand samples. HAPNEST generated realistic large databases that include genotypes and phenotypes at diverse parameters [51]. The HAPNEST dataset BSST936 (https://www.ebi.ac.uk/biostudies/studies/S-BSST936) was generated under classical association parameters among cases and controls and generated up to 1 M simulated individuals across 6 ancestry groups and up to 6.8 M simulated SNPs. Therefore, it can be used as a blind case study. We used the Phenotype 9 (generated at 0.5 heritability and 0.0001 polygenicity) from simulated AFR ancestry and generated two datasets: the top 100 K individuals and the top 10 K individuals. We assume that true association is given by the GWAS analysis of the 100 K individuals, but we used the 10 K individuals to perform GS and compare the results. Indeed, the GWAS analysis in both datasets shows correlated p-values but a stronger association for the 100 K dataset (Fig. 9A). Note that the 10 K dataset is a subset of the 100 K; thus, the differences are attributed to sub-sampling. We then applied the GS to the 10 K dataset to simulate a common analysis limited by sample size. This analysis is important because it helps to identify false and true positives. For comparisons with other datasets, we first used the same threshold and criteria used for the disease phenotypes. The results of our GS pipeline are shown in Fig. 9B. We identified 342 SNPs (at GS p < 5 × 10− 7 & Δp > 1) from a total of ~ 1.2 M additional tests (adding 17% tests) that represent a positive rate of 0.03% of total tests. We compared the p-values of the identifications obtained by GS with those of the 100 K dataset under the assumption that the identifications in the 100 K are true (Fig. 9C). We observed that 201 of the 317 unique SNPs were actually significant in the 100 K GWAS. This result suggests an estimated 63.4% of true positives and 36.6% of false positives. Nevertheless, given that this dataset contains 6.4 M SNPs, the equivalent Bonferroni correction for the GWAS should be 7.3 × 10− 9 while the correction for the GS should be 6.2 × 10− 9 due to the increase in tests. At these thresholds, the GS would generate 30 positive SNPs (at GS p < 6.2 × 10− 9 & Δp > 1) of which only 2 are false, estimating a false positive rate of 6.6%. Thus, the false discovery rate of GS in this dataset lies in the range of 36.6% and 6.6% depending on the p-value threshold used, which in this case represents more exploration in the first and more precision in the last. Besides the false positive estimations, the analysis of the 10 K result compared with the 100 K also shows that GS is capable of finding genuine SNPs that could be obscured due to sub-sampling.
Discussion
Discussion
The core of the proposed genotype subtyping strategy is to perform sub-GWAS analyses on case-control samples filtered by a specific genotype. The thresholds used here can be adapted for particular datasets. We showed that our proposed strategy yields additional interesting SNPs that were unnoticed in the results of a classical GWAS analysis. The fact that some of these SNPs have already been associated with studies using other datasets from different populations indicates that the other new predicted associations could be genuine. Consequently, other SNPs representing novel associations are supported by the analysis and should be considered biologically plausible, which could be validated by further studies or assays.
One might expect that the power of GS would be diminished due to the smaller sample sizes in each split, potentially resulting in less significant p-values compared to the analysis of the full dataset. Indeed, the simulation analysis (summarized in Fig. 6) shows less significant p-values when the phenotype is random, where we observed decreased p-values for 3 orders of magnitude or more. However, intriguingly, in the real phenotypes, GS has revealed several SNPs that exhibit an increase in significance by orders of magnitude (Figs. 2, 3, 4 and 5). This phenomenon challenges our expectations and leads us to explore three potential hypotheses. First, there may be interactions between the genotype used for splitting and the subsequent associations observed, modifying or enhancing the effect of the associated marker. Second, the split itself may set cases and controls within similar genetic contexts around the splitting SNP, leading to a clearer distinction between groups in other loci and potentially enhancing the detection of further associations. Third, the increase in significance might be random, which is less likely, considering the specific criteria used in the analysis (Dp > = 1) and the consistent results that novel signals include genes already associated with the phenotypes in other studies, and supported by the simulated analysis. Our findings provide evidence supporting the interaction hypothesis for three out of four studied phenotypes (type 2 diabetes, Crohn’s disease, and schizophrenia), suggesting that the presence of interactions between genetic markers and subsequent associations may explain the increased significance observed in GS analyses. This is also the case in recent research conducted on alcohol consumption, where 3 and 5 additional SNPs were found to be associated with homozygous and heterozygous, respectively, of a susceptible SNP [20]. Thus, the interaction hypothesis seems to be important. However, it should be noted that the level and number of interactions may vary across phenotypes. In breast cancer, the interactions were scarce; thus, the second hypothesis seems more reasonable.
GWAS studies often rely on strict p-value thresholds (e.g., 5 × 10− 8) to define genome-wide significance, ensuring robust signals and minimizing false positives [11]. However, such stringent cutoffs may lead to omitting potentially valid associations [12]. To compensate for the decreased sample size, the reduced number of tested SNPs in GS, and more importantly, to favor explorations, our analysis employed less stringent p-value thresholds, in particular for PST. In doing so, we sought to explore a broader range of discoveries while maintaining robustness in our findings by requiring that identified SNPs increase in significance by at least one order of magnitude, which is hard to achieve given the decrease in sample size. Notably, we observed specific SNPs that demonstrated substantial increases in significance, ranging from 2 to 4 orders of magnitude (10− 2 to 10− 4), as indicated in the Supplementary Material (column deltaP). However, if a p-value threshold of 5 × 10− 8 had been applied, we would have identified fewer significant signals in each phenotype, such as 18 signals in BC, 16 in CD, and 1 in SCZ. This decrease underscores the trade-off between stringency and the potential for detecting additional signals. Overall, this approach allowed us to capture a broader range of associations, uncover previously unnoticed SNPs already associated with the respective phenotypes in other studies, and accentuate the value of exploring beyond strict p-value cutoffs in genetic association studies. Nevertheless, with sufficiently large sample sizes or no explorative analyses, using relaxed thresholds may not be advisable. In these cases, we recommend keeping the standard genome-wide level (5 × 10− 8) if the total number of tests per dataset is less than 1 million; otherwise, a more stringent level must be applied to compensate for the multiplicity of tests problem, similar to what we showed in the simulated dataset study.
Our method depends mainly on two p-value thresholds, PST and PSG, used to select SNPs for subtyping and sub-GWAS, respectively. We used here PST =10− 6 to generate a seed SNP for subtyping. Depending on the dataset, this value could be more stringent if many signals are selected. We have analyzed here real datasets of modest associations, but other datasets may show many associations and could increase importantly the number of additional statistical tests that must be kept low, as done with the simulated analysis. Less stringent values of PST are discouraged because an additionally selected SNP may lack biological or statistical justification. Our analyses also used PSG =10− 3 to select SNPs for the sub-GWAS analysis. Less stringent values could be used, but the additional number of statistical tests must be observed and kept low. Additionally, SNPs with low differences between cases and controls are more likely to be random and therefore unlikely to be important in the sub-GWAS. We recommend counting both the number of variants used for sub-typing and the number of variants used for sub-GWAS to estimate and control the total number of additional tests.
A rational concern derived from our proposed GS procedure is the possible inflation of false positive results due to the increased number of statistical tests. However, if the number of additional tests remains low, this should not pose a significant problem. In our experiments, we ought to employ a fraction of additional tests. For example, for T2D, we only added 2.1% of tests relative to the total number of SNPs initially performed in the GWAS (see Table 3). For SCZ, we added 6.7% and 21% for BC. The largest number of tests was performed in Crohn’s Disease, where 57% of tests were added. Nevertheless, in this last case, we added ~ 153 K tests to the ~ 270 K of the GWAS, summing up to 423 K tests. This number is still lower than the “pseudo” limit of 1 M tests used to estimate the Bonferroni correction, resulting in the 5 × 10− 8 threshold known as genome-wide significance [52, 53]. Thus, even with the additional tests, the total tests were kept below 1 M in most databases or added just a tiny fraction of the original GWAS tests. Moreover, the simulation analysis in which we randomized the phenotype did not generate false associations after the GS; that is, no SNPs increased their statistical association even when relaxing the SNP selection. Moreover, in a synthetic dataset, we estimated between 6% and 36% of false positive rate. Consequently, if used prudently, the GS strategy does not seem to represent a possible and important false-positive inflation issue.
In the GS strategy, samples were divided into SNP genotypes (AA, Aa, aa), leading to varying detection levels across the three genotype combinations. For most disease phenotypes, no occurrences were detected in the less frequent genotype (homozygous G0, see Table 1), which is likely linked to the sample size of each genotype subgroup. This uneven distribution suggests that non-random data partition, driven by phenotype-associated variants, may result in differences compared to classic GWAS analysis.
The primary advantage of GS lies in its ability to enhance the significance of specific SNPs (or, as a reviewer called, ‘genotype-specific effects’), thereby revealing previously unnoticed associations within the same dataset, whether they are novel findings or were simply overlooked in the initial analysis. This approach would become even more powerful when applied to data-sharing initiatives that pool large datasets from multiple studies. This collaborative approach increases sample size and statistical power, facilitating the discovery and validation of novel genetic associations. As a result, when used alongside GWAS data and data-sharing initiatives, GS becomes a valuable strategy for uncovering hidden genetic associations and advancing our understanding of complex traits and diseases. For modest-sized datasets, GS may not be feasible since the sub-sample partitions will further decrease the sample size, resulting in less significant results, complicating the interpretation and selection of interesting SNPs. Classical estimations of minimal or effective samples should also apply in each GS partition [54].
In our implementation, the computational procedure performs sub-GWAS using generalized linear models in the R language using specific subsets of samples. For us, this facilitated the analyses. The implementation is available in a GitHub repository (https://github.com/vtrevino/Genotype_Subtyping). However, it can be thought that any other GWAS software that can estimate SNP associations correcting for co-factors could also be used. This could be achieved by encoding the genotypes of all samples into dummy co-factors and performing a run for every seed SNP, considering also the filtering of the SNP. Nevertheless, we did not try this procedure.
We anticipate that GS can be applied in nested analysis, that is, a second (or additional) GS analysis using the SNP identified in the GWAS + GS analysis. Unfortunately, one may require a larger number of individuals at each instance. For example, the application of GS in the UK Biobank, where around 500,000 individuals are available, can lead to more than two nested GS analyses. Perhaps until no more variants are identified, as is commonly done in conditional analyses. Moreover, for some SNPs in our analysis, it was not possible to perform the sub-sampled GWAS because the number of individuals was so scarce. Thus, using datasets of the size of the UK Biobank, the analysis can be effectively applied to the three genotypes of each SNP. In addition, such large datasets can be split into two sets, one for sub-SNP analysis and the other as a validation set.
The GS approach shown here splits samples by a genotype to perform sub-GWAS analyses in each genotype, assuming that the genotype is associated with the studied phenotype. In epidemiology, there is a similar strategy known as Mendelian Randomization (MR), which is used to infer the causal effects of exposures on outcomes by stratifying the analyses based on known genetic variations, assuming random allocation of alleles [55, 56]. Differences in outcome associations can be revealed by splitting the dataset into subsets defined by genetic variants. We would like to emphasize that MR and GS represent two distinct approaches. While both methods rely on the stratification of samples based on genetic markers, they differ in how they define cases and controls and subsequently test for differences in disease outcomes. MR stratifies samples into cases and controls using known genetic variants associated with the outcome of interest [55, 56]. That is, cases become those of genotype 1 while controls become those of genotype 2. In contrast, the genotype subtyping approach shown here keeps cases and controls unchanged but divides them based on the genotype of an associated marker, followed by independent GWAS analyses. Thus, MR and GS represent different strategies and should not be considered competing or useful for similar purposes.
Another method that can be similar to GS is the conditional analysis applied to GWAS [50]. In conditional analysis, a classical GWAS is performed, and then, significant SNPs are used as covariates, performing additional GWAS analyses to identify more variants until no more variants are identified. However, there are differences compared to GS. In GS, the selected SNPs are not used as covariates, and therefore, the phenotype variance is not absorbed by the SNP covariates. Moreover, in GS, the partition of samples into three sub-samples driven by the genotyped SNP allows the estimation of 3 independent β coefficients, one for each partitioning genotype. This can be related to an interaction component as discussed above. Additionally, we propose to use a small subset of potential SNPs to speed up calculations. Still, we compared GS with conditional analysis. The results show that more than 50% of the results observed in GS are not called by the conditional analysis. This experiment highlights the utility of GS and suggests that GS is a complementary tool for GWAS.
There are some limitations to using GS. The most important constraint is that it requires genotype data to properly perform sample subsets and re-estimate the association. This is an important drawback since summary statistics are readily available in most studies and platforms such as the GWAS Catalog. Additionally, it is harder to obtain genome-wide significant results corresponding to a reduction in power because the number of samples is decreased in the genotype stratification. Another drawback is the possibility of a multiplicity of tests that may increase the false positives by performing additional sub-GWAS for each SNP. The additional tests are reported in Table 3 for each dataset, showing that the increased tests are marginal or the total number does not reach a million, which could compromise common genome-wide thresholds. In addition, considering possible inflation, we added criteria to increase the strength of the estimated p-value respective to the original GWAS, which is harder to reach and confirmed by one of the simulations performed, and also proposed to use a subset of SNPs for the GS instead of all of them. We showed that among the predicted associations, many variants have already been associated with other studies. This strong result validates that among the further SNPs identified, there are genuine associations and therefore it is worth exploring. Other studies have performed a similar analysis with similar conclusions [20].
Our study using HAPNEST synthetic data, besides showing estimations of false positives (between 6% and 36% depending on p-value thresholds), showed that many detected SNPs are mainly due to sub-sampling. That is, when using a limited dataset, paradoxically, we were able to detect new SNPs using GS that will be called positive in the GWAS if the dataset is increased 10x. This important result also supports the use of GS in limited datasets.
In conclusion, genotype subtyping is a powerful tool for identifying and validating more associations for a phenotype by reanalyzing GWAS data and offering a means for target gene identification. Using simulations, we showed that genotype subtyping is highly sensitive and specific and shows low false discoveries if used properly. Further research should aim to apply genotype subtyping in larger datasets and diverse ancestries, and to validate the relevance of identified genes and associations through experimental analyses. We are actively working in this direction. This approach has the potential to significantly contribute to our understanding of the genetic basis of complex traits and diseases.
The core of the proposed genotype subtyping strategy is to perform sub-GWAS analyses on case-control samples filtered by a specific genotype. The thresholds used here can be adapted for particular datasets. We showed that our proposed strategy yields additional interesting SNPs that were unnoticed in the results of a classical GWAS analysis. The fact that some of these SNPs have already been associated with studies using other datasets from different populations indicates that the other new predicted associations could be genuine. Consequently, other SNPs representing novel associations are supported by the analysis and should be considered biologically plausible, which could be validated by further studies or assays.
One might expect that the power of GS would be diminished due to the smaller sample sizes in each split, potentially resulting in less significant p-values compared to the analysis of the full dataset. Indeed, the simulation analysis (summarized in Fig. 6) shows less significant p-values when the phenotype is random, where we observed decreased p-values for 3 orders of magnitude or more. However, intriguingly, in the real phenotypes, GS has revealed several SNPs that exhibit an increase in significance by orders of magnitude (Figs. 2, 3, 4 and 5). This phenomenon challenges our expectations and leads us to explore three potential hypotheses. First, there may be interactions between the genotype used for splitting and the subsequent associations observed, modifying or enhancing the effect of the associated marker. Second, the split itself may set cases and controls within similar genetic contexts around the splitting SNP, leading to a clearer distinction between groups in other loci and potentially enhancing the detection of further associations. Third, the increase in significance might be random, which is less likely, considering the specific criteria used in the analysis (Dp > = 1) and the consistent results that novel signals include genes already associated with the phenotypes in other studies, and supported by the simulated analysis. Our findings provide evidence supporting the interaction hypothesis for three out of four studied phenotypes (type 2 diabetes, Crohn’s disease, and schizophrenia), suggesting that the presence of interactions between genetic markers and subsequent associations may explain the increased significance observed in GS analyses. This is also the case in recent research conducted on alcohol consumption, where 3 and 5 additional SNPs were found to be associated with homozygous and heterozygous, respectively, of a susceptible SNP [20]. Thus, the interaction hypothesis seems to be important. However, it should be noted that the level and number of interactions may vary across phenotypes. In breast cancer, the interactions were scarce; thus, the second hypothesis seems more reasonable.
GWAS studies often rely on strict p-value thresholds (e.g., 5 × 10− 8) to define genome-wide significance, ensuring robust signals and minimizing false positives [11]. However, such stringent cutoffs may lead to omitting potentially valid associations [12]. To compensate for the decreased sample size, the reduced number of tested SNPs in GS, and more importantly, to favor explorations, our analysis employed less stringent p-value thresholds, in particular for PST. In doing so, we sought to explore a broader range of discoveries while maintaining robustness in our findings by requiring that identified SNPs increase in significance by at least one order of magnitude, which is hard to achieve given the decrease in sample size. Notably, we observed specific SNPs that demonstrated substantial increases in significance, ranging from 2 to 4 orders of magnitude (10− 2 to 10− 4), as indicated in the Supplementary Material (column deltaP). However, if a p-value threshold of 5 × 10− 8 had been applied, we would have identified fewer significant signals in each phenotype, such as 18 signals in BC, 16 in CD, and 1 in SCZ. This decrease underscores the trade-off between stringency and the potential for detecting additional signals. Overall, this approach allowed us to capture a broader range of associations, uncover previously unnoticed SNPs already associated with the respective phenotypes in other studies, and accentuate the value of exploring beyond strict p-value cutoffs in genetic association studies. Nevertheless, with sufficiently large sample sizes or no explorative analyses, using relaxed thresholds may not be advisable. In these cases, we recommend keeping the standard genome-wide level (5 × 10− 8) if the total number of tests per dataset is less than 1 million; otherwise, a more stringent level must be applied to compensate for the multiplicity of tests problem, similar to what we showed in the simulated dataset study.
Our method depends mainly on two p-value thresholds, PST and PSG, used to select SNPs for subtyping and sub-GWAS, respectively. We used here PST =10− 6 to generate a seed SNP for subtyping. Depending on the dataset, this value could be more stringent if many signals are selected. We have analyzed here real datasets of modest associations, but other datasets may show many associations and could increase importantly the number of additional statistical tests that must be kept low, as done with the simulated analysis. Less stringent values of PST are discouraged because an additionally selected SNP may lack biological or statistical justification. Our analyses also used PSG =10− 3 to select SNPs for the sub-GWAS analysis. Less stringent values could be used, but the additional number of statistical tests must be observed and kept low. Additionally, SNPs with low differences between cases and controls are more likely to be random and therefore unlikely to be important in the sub-GWAS. We recommend counting both the number of variants used for sub-typing and the number of variants used for sub-GWAS to estimate and control the total number of additional tests.
A rational concern derived from our proposed GS procedure is the possible inflation of false positive results due to the increased number of statistical tests. However, if the number of additional tests remains low, this should not pose a significant problem. In our experiments, we ought to employ a fraction of additional tests. For example, for T2D, we only added 2.1% of tests relative to the total number of SNPs initially performed in the GWAS (see Table 3). For SCZ, we added 6.7% and 21% for BC. The largest number of tests was performed in Crohn’s Disease, where 57% of tests were added. Nevertheless, in this last case, we added ~ 153 K tests to the ~ 270 K of the GWAS, summing up to 423 K tests. This number is still lower than the “pseudo” limit of 1 M tests used to estimate the Bonferroni correction, resulting in the 5 × 10− 8 threshold known as genome-wide significance [52, 53]. Thus, even with the additional tests, the total tests were kept below 1 M in most databases or added just a tiny fraction of the original GWAS tests. Moreover, the simulation analysis in which we randomized the phenotype did not generate false associations after the GS; that is, no SNPs increased their statistical association even when relaxing the SNP selection. Moreover, in a synthetic dataset, we estimated between 6% and 36% of false positive rate. Consequently, if used prudently, the GS strategy does not seem to represent a possible and important false-positive inflation issue.
In the GS strategy, samples were divided into SNP genotypes (AA, Aa, aa), leading to varying detection levels across the three genotype combinations. For most disease phenotypes, no occurrences were detected in the less frequent genotype (homozygous G0, see Table 1), which is likely linked to the sample size of each genotype subgroup. This uneven distribution suggests that non-random data partition, driven by phenotype-associated variants, may result in differences compared to classic GWAS analysis.
The primary advantage of GS lies in its ability to enhance the significance of specific SNPs (or, as a reviewer called, ‘genotype-specific effects’), thereby revealing previously unnoticed associations within the same dataset, whether they are novel findings or were simply overlooked in the initial analysis. This approach would become even more powerful when applied to data-sharing initiatives that pool large datasets from multiple studies. This collaborative approach increases sample size and statistical power, facilitating the discovery and validation of novel genetic associations. As a result, when used alongside GWAS data and data-sharing initiatives, GS becomes a valuable strategy for uncovering hidden genetic associations and advancing our understanding of complex traits and diseases. For modest-sized datasets, GS may not be feasible since the sub-sample partitions will further decrease the sample size, resulting in less significant results, complicating the interpretation and selection of interesting SNPs. Classical estimations of minimal or effective samples should also apply in each GS partition [54].
In our implementation, the computational procedure performs sub-GWAS using generalized linear models in the R language using specific subsets of samples. For us, this facilitated the analyses. The implementation is available in a GitHub repository (https://github.com/vtrevino/Genotype_Subtyping). However, it can be thought that any other GWAS software that can estimate SNP associations correcting for co-factors could also be used. This could be achieved by encoding the genotypes of all samples into dummy co-factors and performing a run for every seed SNP, considering also the filtering of the SNP. Nevertheless, we did not try this procedure.
We anticipate that GS can be applied in nested analysis, that is, a second (or additional) GS analysis using the SNP identified in the GWAS + GS analysis. Unfortunately, one may require a larger number of individuals at each instance. For example, the application of GS in the UK Biobank, where around 500,000 individuals are available, can lead to more than two nested GS analyses. Perhaps until no more variants are identified, as is commonly done in conditional analyses. Moreover, for some SNPs in our analysis, it was not possible to perform the sub-sampled GWAS because the number of individuals was so scarce. Thus, using datasets of the size of the UK Biobank, the analysis can be effectively applied to the three genotypes of each SNP. In addition, such large datasets can be split into two sets, one for sub-SNP analysis and the other as a validation set.
The GS approach shown here splits samples by a genotype to perform sub-GWAS analyses in each genotype, assuming that the genotype is associated with the studied phenotype. In epidemiology, there is a similar strategy known as Mendelian Randomization (MR), which is used to infer the causal effects of exposures on outcomes by stratifying the analyses based on known genetic variations, assuming random allocation of alleles [55, 56]. Differences in outcome associations can be revealed by splitting the dataset into subsets defined by genetic variants. We would like to emphasize that MR and GS represent two distinct approaches. While both methods rely on the stratification of samples based on genetic markers, they differ in how they define cases and controls and subsequently test for differences in disease outcomes. MR stratifies samples into cases and controls using known genetic variants associated with the outcome of interest [55, 56]. That is, cases become those of genotype 1 while controls become those of genotype 2. In contrast, the genotype subtyping approach shown here keeps cases and controls unchanged but divides them based on the genotype of an associated marker, followed by independent GWAS analyses. Thus, MR and GS represent different strategies and should not be considered competing or useful for similar purposes.
Another method that can be similar to GS is the conditional analysis applied to GWAS [50]. In conditional analysis, a classical GWAS is performed, and then, significant SNPs are used as covariates, performing additional GWAS analyses to identify more variants until no more variants are identified. However, there are differences compared to GS. In GS, the selected SNPs are not used as covariates, and therefore, the phenotype variance is not absorbed by the SNP covariates. Moreover, in GS, the partition of samples into three sub-samples driven by the genotyped SNP allows the estimation of 3 independent β coefficients, one for each partitioning genotype. This can be related to an interaction component as discussed above. Additionally, we propose to use a small subset of potential SNPs to speed up calculations. Still, we compared GS with conditional analysis. The results show that more than 50% of the results observed in GS are not called by the conditional analysis. This experiment highlights the utility of GS and suggests that GS is a complementary tool for GWAS.
There are some limitations to using GS. The most important constraint is that it requires genotype data to properly perform sample subsets and re-estimate the association. This is an important drawback since summary statistics are readily available in most studies and platforms such as the GWAS Catalog. Additionally, it is harder to obtain genome-wide significant results corresponding to a reduction in power because the number of samples is decreased in the genotype stratification. Another drawback is the possibility of a multiplicity of tests that may increase the false positives by performing additional sub-GWAS for each SNP. The additional tests are reported in Table 3 for each dataset, showing that the increased tests are marginal or the total number does not reach a million, which could compromise common genome-wide thresholds. In addition, considering possible inflation, we added criteria to increase the strength of the estimated p-value respective to the original GWAS, which is harder to reach and confirmed by one of the simulations performed, and also proposed to use a subset of SNPs for the GS instead of all of them. We showed that among the predicted associations, many variants have already been associated with other studies. This strong result validates that among the further SNPs identified, there are genuine associations and therefore it is worth exploring. Other studies have performed a similar analysis with similar conclusions [20].
Our study using HAPNEST synthetic data, besides showing estimations of false positives (between 6% and 36% depending on p-value thresholds), showed that many detected SNPs are mainly due to sub-sampling. That is, when using a limited dataset, paradoxically, we were able to detect new SNPs using GS that will be called positive in the GWAS if the dataset is increased 10x. This important result also supports the use of GS in limited datasets.
In conclusion, genotype subtyping is a powerful tool for identifying and validating more associations for a phenotype by reanalyzing GWAS data and offering a means for target gene identification. Using simulations, we showed that genotype subtyping is highly sensitive and specific and shows low false discoveries if used properly. Further research should aim to apply genotype subtyping in larger datasets and diverse ancestries, and to validate the relevance of identified genes and associations through experimental analyses. We are actively working in this direction. This approach has the potential to significantly contribute to our understanding of the genetic basis of complex traits and diseases.
Methods
Methods
Databases and phenotypes
Four databases were selected to include phenotypes of different genetic backgrounds. Crohn’s Disease (CD), Schizophrenia (SCZ), Breast Cancer (BC), and Type 2 diabetes (T2D). The corresponding accession numbers are EGAD00010001157, EGAD00010001158, and EGAC00001000205 for CD, phs000473 for SCZ, phs001265 for BC, and phs000888 for T2D. Table 3 shows the number of samples and the provided variables for each phenotype. Permission to access the data was granted by EGA (European Genome-Phenome Archive) and NIH Commons/eRA (dbGap). CD data for this manuscript were accessed on 01/07/2023, while SCZ, BC, and T2D were accessed on 28/03/2023.
Data pre-processing
Quality control for samples and SNPs was performed according to the standard criteria for GWAS [57]. For the samples, the criteria were to exclude the samples with a missing rate > 5%, deviate ± 3 SD from the samples’ heterozygosity rate mean, related individuals determined by Identity by Descent (IBD) estimations, and ethnic outliers when the database had different ancestries. Markers were filtered if they had missing values > 10%, deviated from the Hardy–Weinberg equilibrium (HWE) in the controls (p < e− 5), and had minor allelic frequencies (MAFs) < 1%. PLINK v1.90b7 was used in most processes [58].
GWAS analysis
The top 10 principal components (PCs) were calculated to correct for population structure. The PCs were obtained using common independent variants not located within LD regions. A frequentist additive test model for each phenotype was fitted using snptest v2 [59], correcting for the 10 PCs, sex, and age, depending on the phenotype assessed. The genomic inflation factor was estimated for all the GWAS, estimating the amount of inflation by comparing the test statistics observed in the analysis to those expected under a null hypothesis [60]. The genomic control divides the inflated test statistics by the genomic inflation factor (). Genomic controls were applied to GWAS results if the inflation factor was above 1.1. Table 3 summarizes the number of samples and markers used for the GWAS analysis after filtering. Snptest v2 was also used for conditional analysis using additive models.
Genotype subtyping
The scheme for genotype subtyping methodology is shown in Fig. 1. We sampled the GWAS data by splitting the datasets into three sets, G0, G1, and G2, for each genotype (AA, Aa, and aa) for each independent marker with a p-value < 10 − 6 in the GWAS (Table 2). Then, a new subtyping GWAS (sub-GWAS) for a preselected set of variants with a p-value < 10 − 3 (Table 2) was performed in each split. A frequentist additive test model for each phenotype was fitted using the R language [61], correcting for the 10 PCs from the initial GWAS, sex, and age, depending on the phenotype assessed. A logistic regression model, utilizing the generalized linear model glm() function, was fit for each subtyping marker. Only significant GS results (PGS < 5 × 10− 7 and Dp > 1) were tested for interactions (epistasis) by adding an interaction term (see Supplementary Table 1), in which case the Wald statistical test using t-distributions comparing the fitted coefficient against zero is used as proof of interaction as estimated by the glm() function in R. The main code and an example performed on Crohn’s disease data are shown in the GitHub repository https://github.com/vtrevino/Genotype_Subtyping.
Variant annotations
To identify possible novel signals, we used annotations for the nearest gene using biomaRt [62], OpenTargets genetics [24], and GWAS Catalog databases [2] for each specific disease phenotype. Briefly, variants were annotated using biomaRt [62] and were assigned to the closest protein-coding gene. Variants with a distance higher than 500Kb [63] from the subtyping marker were considered “novel” signals compared with the standard GWAS. Results with a change of the p-value higher than one order and a subtyping GWAS p-value < 5 × 10− 7 were considered genotype subtyping hits. These were evaluated in PubMed articles, assessing their relationship with the disease analyzed. Also, the GWAS catalog [2] and OpenTargets [22] were used to identify if the variants identified were considered validation or discovery hits.
Randomized simulation
For the first “negative” simulation, we performed the same pipeline as shown in Fig. 1 using the Crohn’s disease data but randomizing the sample labeling. That is, we assigned randomly the labels HC and CD to the samples. The pipeline was kept almost the same. In brief, we first changed the phenotype labels, then performed the GWAS analysis, selected the SNPs for subtyping (at 10− 4 because there was only 1 at 10− 6), then selected the SNPs for sub-GWAS (at 10− 3), and finally performed the sub-GWAS. For the second “positive” simulation, we added 20 SNPs to the previous simulation corresponding to 10 GWAS-positive SNPs and 10 GWAS-negative but GS-positive (referred to as sub-SNPs), and these are dependent on the genotype of the first 10 SNPs. The first 10 SNPs corresponded to GWAS-positives using minor allele frequencies of 0.45 and 0.145 (q parameter in the Hardy-Weinberg equation). Five SNPs were generated for each value of q. For the q = 0.45, the actual SNPs were generated using q = 0.43 for the 4508 subjects corresponding to the CD cases and using q = 0.47 for the 9194 subjects corresponding to CD controls. Similarly, for the q = 0.145, the actual SNPs were generated using q = 0.13 and q = 0.16 for cases and controls, respectively. For the 10 sub-SNPs, we adjusted the q values according to the genotypes of the simulated GWAS-positive SNPs. For the cases at q = 0.45, we actually used q={0.415, 0.445, 0.445} for the genotypes AA = 0, AB = 1, and BB = 2 respectively. For the controls at q = 0.45, we used q={0.485, 0.455, 0.455} correspondingly. Likewise, for the cases and controls at q = 0.145, we used q={0.12, 0.16, 0.18} and q={0.151, 0.14, 0.16} respectively. These SNPs were computationally added to the files, and finally, the same pipeline was run. The code for adding these positive SNPs is shown in the GitHub repository.
Databases and phenotypes
Four databases were selected to include phenotypes of different genetic backgrounds. Crohn’s Disease (CD), Schizophrenia (SCZ), Breast Cancer (BC), and Type 2 diabetes (T2D). The corresponding accession numbers are EGAD00010001157, EGAD00010001158, and EGAC00001000205 for CD, phs000473 for SCZ, phs001265 for BC, and phs000888 for T2D. Table 3 shows the number of samples and the provided variables for each phenotype. Permission to access the data was granted by EGA (European Genome-Phenome Archive) and NIH Commons/eRA (dbGap). CD data for this manuscript were accessed on 01/07/2023, while SCZ, BC, and T2D were accessed on 28/03/2023.
Data pre-processing
Quality control for samples and SNPs was performed according to the standard criteria for GWAS [57]. For the samples, the criteria were to exclude the samples with a missing rate > 5%, deviate ± 3 SD from the samples’ heterozygosity rate mean, related individuals determined by Identity by Descent (IBD) estimations, and ethnic outliers when the database had different ancestries. Markers were filtered if they had missing values > 10%, deviated from the Hardy–Weinberg equilibrium (HWE) in the controls (p < e− 5), and had minor allelic frequencies (MAFs) < 1%. PLINK v1.90b7 was used in most processes [58].
GWAS analysis
The top 10 principal components (PCs) were calculated to correct for population structure. The PCs were obtained using common independent variants not located within LD regions. A frequentist additive test model for each phenotype was fitted using snptest v2 [59], correcting for the 10 PCs, sex, and age, depending on the phenotype assessed. The genomic inflation factor was estimated for all the GWAS, estimating the amount of inflation by comparing the test statistics observed in the analysis to those expected under a null hypothesis [60]. The genomic control divides the inflated test statistics by the genomic inflation factor (). Genomic controls were applied to GWAS results if the inflation factor was above 1.1. Table 3 summarizes the number of samples and markers used for the GWAS analysis after filtering. Snptest v2 was also used for conditional analysis using additive models.
Genotype subtyping
The scheme for genotype subtyping methodology is shown in Fig. 1. We sampled the GWAS data by splitting the datasets into three sets, G0, G1, and G2, for each genotype (AA, Aa, and aa) for each independent marker with a p-value < 10 − 6 in the GWAS (Table 2). Then, a new subtyping GWAS (sub-GWAS) for a preselected set of variants with a p-value < 10 − 3 (Table 2) was performed in each split. A frequentist additive test model for each phenotype was fitted using the R language [61], correcting for the 10 PCs from the initial GWAS, sex, and age, depending on the phenotype assessed. A logistic regression model, utilizing the generalized linear model glm() function, was fit for each subtyping marker. Only significant GS results (PGS < 5 × 10− 7 and Dp > 1) were tested for interactions (epistasis) by adding an interaction term (see Supplementary Table 1), in which case the Wald statistical test using t-distributions comparing the fitted coefficient against zero is used as proof of interaction as estimated by the glm() function in R. The main code and an example performed on Crohn’s disease data are shown in the GitHub repository https://github.com/vtrevino/Genotype_Subtyping.
Variant annotations
To identify possible novel signals, we used annotations for the nearest gene using biomaRt [62], OpenTargets genetics [24], and GWAS Catalog databases [2] for each specific disease phenotype. Briefly, variants were annotated using biomaRt [62] and were assigned to the closest protein-coding gene. Variants with a distance higher than 500Kb [63] from the subtyping marker were considered “novel” signals compared with the standard GWAS. Results with a change of the p-value higher than one order and a subtyping GWAS p-value < 5 × 10− 7 were considered genotype subtyping hits. These were evaluated in PubMed articles, assessing their relationship with the disease analyzed. Also, the GWAS catalog [2] and OpenTargets [22] were used to identify if the variants identified were considered validation or discovery hits.
Randomized simulation
For the first “negative” simulation, we performed the same pipeline as shown in Fig. 1 using the Crohn’s disease data but randomizing the sample labeling. That is, we assigned randomly the labels HC and CD to the samples. The pipeline was kept almost the same. In brief, we first changed the phenotype labels, then performed the GWAS analysis, selected the SNPs for subtyping (at 10− 4 because there was only 1 at 10− 6), then selected the SNPs for sub-GWAS (at 10− 3), and finally performed the sub-GWAS. For the second “positive” simulation, we added 20 SNPs to the previous simulation corresponding to 10 GWAS-positive SNPs and 10 GWAS-negative but GS-positive (referred to as sub-SNPs), and these are dependent on the genotype of the first 10 SNPs. The first 10 SNPs corresponded to GWAS-positives using minor allele frequencies of 0.45 and 0.145 (q parameter in the Hardy-Weinberg equation). Five SNPs were generated for each value of q. For the q = 0.45, the actual SNPs were generated using q = 0.43 for the 4508 subjects corresponding to the CD cases and using q = 0.47 for the 9194 subjects corresponding to CD controls. Similarly, for the q = 0.145, the actual SNPs were generated using q = 0.13 and q = 0.16 for cases and controls, respectively. For the 10 sub-SNPs, we adjusted the q values according to the genotypes of the simulated GWAS-positive SNPs. For the cases at q = 0.45, we actually used q={0.415, 0.445, 0.445} for the genotypes AA = 0, AB = 1, and BB = 2 respectively. For the controls at q = 0.45, we used q={0.485, 0.455, 0.455} correspondingly. Likewise, for the cases and controls at q = 0.145, we used q={0.12, 0.16, 0.18} and q={0.151, 0.14, 0.16} respectively. These SNPs were computationally added to the files, and finally, the same pipeline was run. The code for adding these positive SNPs is shown in the GitHub repository.
Supplementary Information
Supplementary Information
Below is the link to the electronic supplementary material.
Below is the link to the electronic supplementary material.
출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.