Abstract

Motivation

Genome-Wide Association Studies (GWAS) are widely used to investigate the role of genetics in disease traits, but the resulting file sizes from these studies are large, posing barriers to efficient storage, sharing, and querying. This issue is especially important for biobanks like the UK Biobank that publish GWAS for thousands of traits, increasing the volume of data that must be effectively managed. Current compression and query methods reduce file sizes and allow for quick genomic position-based queries but do not provide utility for quickly finding loci based on their summary statistics. For example, finding all SNVs in a particular p-value range would require decompressing and scanning the whole file. We propose a new tool, STABIX, which introduces summary-statistic-based queries and improves upon the standard bgzip compression and Tabix query tool in both compression ratio and decompression speed.

Results

When applied to ten GWAS files from PanUKBB, STABIX created smaller compressed data and indices than Tabix for all files, where bgzip and tbi files were an average of 1.2 times the size of STABIX compressed files and indexes. In the same ten files, STABIX per gene decompression was, on average 7x faster than Tabix per gene decompression, and achieved faster per gene decompression times for over 99% of nearly 20,000 genes.

Availability

Software freely available for download at GitHub: https://github.com/kristen-schneider/stabix/.

Supplementary information

Supplementary data are available at Bioinformatics online.

Information Accepted manuscripts
Accepted manuscripts are PDF versions of the author’s final manuscript, as accepted for publication by the journal but prior to copyediting or typesetting. They can be cited using the author(s), article title, journal title, year of online publication, and DOI. They will be replaced by the final typeset articles, which may therefore contain changes. The DOI will remain the same throughout.
This content is only available as a PDF.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Associate Editor: Macha Nikolski
Macha Nikolski
Associate Editor
Search for other works by this author on:

Supplementary data