Figure 1.
(A) LMEMs and suffix array queries. Formally, MEMs (Maximal Exact Matches) are matches between sequences that cannot be extended in either direction without encountering mismatches. SMEMs (Super Maximal Exact Matches) are a subset of MEMs that are not entirely contained within any other MEM. Similar to SMEMs, LMEMs (Longest Maximal Exact Matches) are not contained within any other MEM but additionally LMEMs do not overlap with other MEMs. When two SMEMs overlap, the overlapping region is assigned to the longer MEM, forming the LMEMs (black arrow). (B) This example demonstrates querying a reference walk, $w_1^r = [ {1,2,5} ]$ (orange) in a suffix array of two query walks, $w_1^q = [ {1,2,5,6} ]$ (purple) and $w_2^q = [ {3,2,5} ]$ (pink), where $C$ is the concatenation of both walks separated by a negative identifier (−1). The Longest Common Prefix array (LCP) stores the length of the shared prefix between two adjacent suffixes in the SA. The inverse SA (ISA) tell us, for each suffix in $C$, where it is located in the suffix array. For example, the first suffix (C, index = 1) is at index 2 in the suffix array according to the ISA, as $isa[ 1 ] = 2$. A binary search of the first suffix of $w_1^r$, that is [1,2,5] in $C$ locates a prefix match at index 2 with $w_1^q$, being [1,2,5] (insert point arrow). Matching the prefix [1,2,5] we know that all sub-suffixes in the prefix also exist as matches between C and $w_1^r$, that is [2,5] and [5] (not shown). Using a combination of the LCP and ISA we can directly identify other prefix matches between any query and suffixes of $w_1^r$. Following the ISA, according to ISA[SA[2]+1]=ISA[1 + 1]=ISA[2]=4, we first locate another prefix match [2,5] with $w_1^q$ at index 4. Using the LCP (red arrow) we also locate a match between $w_1^r$ and $w_2^q$. This is described in detail in the section ‘Reference Alignment’ and Algorithm 2.

(A) LMEMs and suffix array queries. Formally, MEMs (Maximal Exact Matches) are matches between sequences that cannot be extended in either direction without encountering mismatches. SMEMs (Super Maximal Exact Matches) are a subset of MEMs that are not entirely contained within any other MEM. Similar to SMEMs, LMEMs (Longest Maximal Exact Matches) are not contained within any other MEM but additionally LMEMs do not overlap with other MEMs. When two SMEMs overlap, the overlapping region is assigned to the longer MEM, forming the LMEMs (black arrow). (B) This example demonstrates querying a reference walk, |$w_1^r = [ {1,2,5} ]$| (orange) in a suffix array of two query walks, |$w_1^q = [ {1,2,5,6} ]$| (purple) and |$w_2^q = [ {3,2,5} ]$| (pink), where |$C$| is the concatenation of both walks separated by a negative identifier (−1). The Longest Common Prefix array (LCP) stores the length of the shared prefix between two adjacent suffixes in the SA. The inverse SA (ISA) tell us, for each suffix in |$C$|⁠, where it is located in the suffix array. For example, the first suffix (C, index = 1) is at index 2 in the suffix array according to the ISA, as |$isa[ 1 ] = 2$|⁠. A binary search of the first suffix of |$w_1^r$|⁠, that is [1,2,5] in |$C$| locates a prefix match at index 2 with |$w_1^q$|⁠, being [1,2,5] (insert point arrow). Matching the prefix [1,2,5] we know that all sub-suffixes in the prefix also exist as matches between C and |$w_1^r$|⁠, that is [2,5] and [5] (not shown). Using a combination of the LCP and ISA we can directly identify other prefix matches between any query and suffixes of |$w_1^r$|⁠. Following the ISA, according to ISA[SA[2]+1]=ISA[1 + 1]=ISA[2]=4, we first locate another prefix match [2,5] with |$w_1^q$| at index 4. Using the LCP (red arrow) we also locate a match between |$w_1^r$| and |$w_2^q$|⁠. This is described in detail in the section ‘Reference Alignment’ and Algorithm 2.

Close
This Feature Is Available To Subscribers Only

Sign In or Create an Account

Close

This PDF is available to Subscribers Only

View Article Abstract & Purchase Options

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Close