Large Impact of Genetic Data Processing Steps on Stability and Reproducibility of Set-Based Analyses in Genome-Wide Association Studies.

Kui N; Yu Y; Choi J; McCaw ZR; Li X; Huff C; Sun R

doi:10.1093/genetics/iyag079

← 뒤로

Large Impact of Genetic Data Processing Steps on Stability and Reproducibility of Set-Based Analyses in Genome-Wide Association Studies.

Genetics 2026

Kui N, Yu Y, Choi J, McCaw ZR, Li X, Huff C, Sun R

원문 ↗ DOI ↗ BibTeX ↓ RIS ↓

📝 환자 설명용 한 줄

Genome-wide association studies (GWAS) are a foundational tool in human genetics research, however, challenges in stability and reproducibility of GWAS results are often noted.

이 논문을 인용하기

BibTeX ↓ RIS ↓

APA Kui N, Yu Y, et al. (2026). Large Impact of Genetic Data Processing Steps on Stability and Reproducibility of Set-Based Analyses in Genome-Wide Association Studies.. Genetics. https://doi.org/10.1093/genetics/iyag079

MLA Kui N, et al.. "Large Impact of Genetic Data Processing Steps on Stability and Reproducibility of Set-Based Analyses in Genome-Wide Association Studies.." Genetics, 2026.

PMID 41870342

DOI 10.1093/genetics/iyag079

Abstract

Genome-wide association studies (GWAS) are a foundational tool in human genetics research, however, challenges in stability and reproducibility of GWAS results are often noted. The main goals of this work are to describe, analyze, and provide tools for solving such reproducibility challenges in a popular component of GWAS literature: set-based (a) hypothesis testing and (b) effect size estimation studies. Common forms of (a) include rare variant or gene-based association studies, while (b) frequently occurs in polygenic score construction and fine-mapping studies. Specifically, we focus on how the set-based natures of (a) and (b) often fuel non-reproducible results due to seemingly innocuous differences in data processing pipelines that are rarely discussed. Such obstacles present enormous challenges for the robustness and reliability of GWAS findings. First, we describe the processing challenges both qualitatively and quantitively, casting the statistical models in a model misspecification framework. Second, we analytically calculate the differences in power and amounts of bias that can arise in (a) and (b), respectively, due to small, relatively under-appreciated choices in data cleaning. Third, we provide tools for quantifying and avoiding the data quality obstacles in GWAS. We validate our analytical calculations through a simulation study, and we demonstrate the aforementioned challenges empirically through analysis of a pancreatic cancer dataset. In our analysis, we demonstrate that top associations, such as between pancreatic cancer and ATM, can be entirely lost due to small differences in data preparation, underscoring the need to make data processing choices clear and explicit.