Stochastic EM algorithm for partially observed stochastic epidemics with individual heterogeneity

Explanation of notation.

Notation	Explanation
N	Total population size (assumed fixed)
$n_{I_{s}}, n_{I_{a}}, n_{E}, n_{I}, n_{R}$	Total number of I_s, I_a, exposed (E), infectious (I) and recovered (R) cases
$I_{i}^{a} (t), I_{i}^{s} (t)$	Total number of I_a and I_s neighbors of i at time t
$I^{a} (t), I^{s} (t), E (t)$	Total number of status I_a, I_s, and E individuals in the population at time t
$t_{i}^{(E)}, t_{i}^{(I)}$	Exposure time and manifestation time for individual i (set to T if never exposed/manifested)
C_ABk, D_ABk	Total number of link activation & termination events among type $A \sim B$ pairs in phase $T_{k}$
$M_{A B}^{c} (t), M_{A B}^{d} (t)$	Number of connected & disconnected type A – B pairs at time t

Notation	Explanation
N	Total population size (assumed fixed)
$n_{I_{s}}, n_{I_{a}}, n_{E}, n_{I}, n_{R}$	Total number of I_s, I_a, exposed (E), infectious (I) and recovered (R) cases
$I_{i}^{a} (t), I_{i}^{s} (t)$	Total number of I_a and I_s neighbors of i at time t
$I^{a} (t), I^{s} (t), E (t)$	Total number of status I_a, I_s, and E individuals in the population at time t
$t_{i}^{(E)}, t_{i}^{(I)}$	Exposure time and manifestation time for individual i (set to T if never exposed/manifested)
C_ABk, D_ABk	Total number of link activation & termination events among type $A \sim B$ pairs in phase $T_{k}$
$M_{A B}^{c} (t), M_{A B}^{d} (t)$	Number of connected & disconnected type A – B pairs at time t

Table 1

Explanation of notation.

Notation	Explanation
N	Total population size (assumed fixed)
$n_{I_{s}}, n_{I_{a}}, n_{E}, n_{I}, n_{R}$	Total number of I_s, I_a, exposed (E), infectious (I) and recovered (R) cases
$I_{i}^{a} (t), I_{i}^{s} (t)$	Total number of I_a and I_s neighbors of i at time t
$I^{a} (t), I^{s} (t), E (t)$	Total number of status I_a, I_s, and E individuals in the population at time t
$t_{i}^{(E)}, t_{i}^{(I)}$	Exposure time and manifestation time for individual i (set to T if never exposed/manifested)
C_ABk, D_ABk	Total number of link activation & termination events among type $A \sim B$ pairs in phase $T_{k}$
$M_{A B}^{c} (t), M_{A B}^{d} (t)$	Number of connected & disconnected type A – B pairs at time t

Notation	Explanation
N	Total population size (assumed fixed)
$n_{I_{s}}, n_{I_{a}}, n_{E}, n_{I}, n_{R}$	Total number of I_s, I_a, exposed (E), infectious (I) and recovered (R) cases
$I_{i}^{a} (t), I_{i}^{s} (t)$	Total number of I_a and I_s neighbors of i at time t
$I^{a} (t), I^{s} (t), E (t)$	Total number of status I_a, I_s, and E individuals in the population at time t
$t_{i}^{(E)}, t_{i}^{(I)}$	Exposure time and manifestation time for individual i (set to T if never exposed/manifested)
C_ABk, D_ABk	Total number of link activation & termination events among type $A \sim B$ pairs in phase $T_{k}$
$M_{A B}^{c} (t), M_{A B}^{d} (t)$	Number of connected & disconnected type A – B pairs at time t

Since the generative model is a CTMC comprised of individual-level Poisson processes, the above likelihood can be decomposed into epidemic-related components (1st and 3rd lines above) and network-related components (2nd and 4th lines). Evaluation of this seemingly lengthy likelihood function involves either bookkeeping of population-level quantities (such as $n_{E} =$ total number of exposed cases), or parallelizable computation of individual-level quantities (such as $I_{i}^{s} (t) =$ number of I neighbors for i at time t).

When complete data are available, we can obtain closed-form maximum likelihood estimates (MLEs) for most of the parameters, and find the remaining MLEs for parameters $β, η$ and b_S through simple numerical procedures, which can be implemented by fitting conditional Poisson regression models (See Supplementary Section 2 for full derivations). This suggests that likelihood-based inference given completely observed data is easily implementable and can be modularized toward inference in the missing data setting.

3.2 Inference with partial observations

We now discuss our inferential framework for partial observations, the Data-Augmented Network Contagion EM (DANCE) algorithm. Since real-world epidemic data rarely include measurements of the full event sequence, our goal is to utilize the simplicity of complete data inference (described above) through data augmentation, based on the stochastic EM approach (Celeux 1985).

The EM algorithm offers an approach to efficiently carry out maximum likelihood estimation for continuous-time Markov chain models in missing data settings (Doss et al. 2013; Xu et al. 2015; Guttorp 2018). Imputing the missing data in the E-step requires access to the conditional expectation, and sEM is a variant that approximates the conditional expectation using augmented data obtained via conditional simulation. To be more precise, let X denote the observed data and Z be the missing data; a general outline of sEM for estimating parameter θ is as follows: For $s = 1 : maxIter$ ⁠, do

(E-step) draw one sample of missing data, $Z^{(s)}$ from its conditional distribution $p (Z | X, θ^{(s - 1)})$ ⁠, and then let
$Q (θ | θ^{(s - 1)}) = log L (θ; X, Z^{(s)});$
(M-step) maximize with respect to target function $Q (θ | θ^{(s)})$ to update θ:
$θ^{(s)} = \arg \max_{θ} Q (θ | θ^{(s - 1)}) .$

There are two advantages of this approach. First, in the E-step, integrating to obtain an expected log-likelihood (as in the traditional EM algorithm) is replaced by sampling, which avoids the often intractable marginalization step in the case of complex models (Renshaw 2015; Xu and Minin 2015; Stutz et al. 2022). Second, the M-step simply requires solving for the MLEs given a version of the complete data, which is often straightforward, as discussed previously for the present setting.

These advantages come at the cost of a potential challenge: we have to conditionally sample the missing data given our observed data and current parameter estimates. In our framework, this is equivalent to sampling event times of a continuous-time Markov chain conditioned on end-points, a notably difficult problem (Hobolth and Stone 2009; Rao and Teg 2013).

In the case of our motivating study, eX-FLU, true exposure times are not available even though the data contain daily symptom reports, due to the incubation period. Exact recovery times are not available either, with recoveries discernible only at a weekly resolution from epidemic surveys. Therefore, we need to consider inference with partially observed epidemic data, in particular with exposure times and recovery times unknown. Data augmentation under sEM thus involves conditionally simulating these missing event times, while preserving consistency between epidemic events and the dynamic contact network.

Let $t^{(E)}$ and $t^{(R)}$ denote all missing exposure times and recovery times, respectively. We assume that (i) that all manifestation times ${t_{i}^{(I)}}$ are observed, available through daily symptom monitoring or routine testing, and (ii) the contact network events are fully observed with high-resolution contact-tracing. Thus, our DANCE framework for partial observations is outlined as follows. For $s = 1 : maxIter$ ⁠, do

sample missing exposure times $t^{(E)}^{(s)}$ from their joint conditional distribution $p (t^{(E)} | observed events, t^{(R)}^{(s - 1)}, Θ^{(s - 1)})$ ⁠;
sample missing recovery times $t^{(R)}^{(s)}$ from their joint conditional distribution $p (t^{(R)} | observed events, t^{(E)}^{(s)}, Θ^{(s - 1)})$ ⁠;
form an augmented dataset by combining sampled event times in Steps 1 and 2 with observed data, then solve for the complete data MLEs to obtain updated parameter estimates $Θ^{(s)}$ ⁠.

Since Step 3 is already addressed in the previous section, we derive conditional sampling algorithms for Steps 1 and 2. One essential consideration is that the conditional samplers must respect the dynamic contact network constraints while leveraging dynamic contact information.

Step 1: conditional sampling of missing exposure times. We derive a rejection sampler for all individual exposure times, conditional on parameter values and recovery times. In fact, it is sufficient to separately sample exposure time $t_{i}^{(E)}$ for each individual i who has ever become infectious. This is because each person i’s exposure time is independent from other individuals’ exposure times conditional on all other event times, as implied by the form of the complete data likelihood in (3.5).

We consider sampling the missing exposure time $t_{i}^{(E)}$ within a plausible interval, $L_{i} = (t_{\min^{i}}, t_{\max^{i}})$ ⁠, possibly informed by prior knowledge or computational capacity. For example, we may set $t_{\min^{i}} = \max (0, t_{i}^{(I)} - 14)$ and $t_{\max^{i}} = \max (0, t_{i}^{(I)} - 2)$ ⁠, if we believe the incubation period should be longer than 2 days but shorter than 2 weeks. Here, for generosity, we consider $L_{i} = (0, t_{i}^{(I)})$ ⁠, meaning exposure could occur any time before the start of infectiousness.

The target we wish to sample from is the conditional density for i’s exposure time, which can be written as

\begin{matrix} p_{i} (t | t_{i}^{(I)}, β, δ_{i}, η, φ, network events) \\ = \frac{λ_{i} (t) exp (- \int_{t_{\min^{i}}}^{t} λ_{i} (u) d u) \times φ exp (- φ (t_{i}^{(I)} - t)) I (t_{\min^{i}} < t < t_{\max^{i}})}{C_{i} (λ_{i} (t), φ; t_{\min^{i}}, t_{\max^{i}})} . \end{matrix}

(3.6)

Here $λ_{i} (t)$ is i’s time-varying total exposure risk, which is a step-constant function with change points fully determined when all other event times are known (see Supplementary Section 3 for full details). The normalizing constant $C_{i} (λ_{i} (t), φ; t_{\min^{i}}, t_{\max^{i}})$ can be explicitly evaluated since $λ_{i} (t)$ is a step function.

Consider the density

q_{i} (t) = \frac{λ_{i} (t) exp (- \int_{0}^{t} λ_{i} (u) d u) I (0 < t < t_{i}^{(I)})}{1 - exp (- \int_{0}^{t_{i}^{(I)}} λ_{i} (u) d u)},

(3.7)

which is the density function of a truncated inhomogeneous Exponential distribution with rate

λ_{i} (t)

⁠. It is straightforward to show that

p_{i} (t) / q_{i} (t) \leq M

for a constant M > 1 (Supplementary Section 3), which suggests we can use

q_{i} (t)

as a proposal for a rejection sampling scheme for sampling from the conditional density

p_{i} (t)

⁠.

Therefore, we have the following rejection sampler for $t_{i}^{(E)}$ that runs in two steps:

Sample t from $q_{i} (t)$ ⁠, an inhomogeneous Exponential with step-constant rate $λ_{i} (t)$ truncated on L_i (sampling details included in Supplementary Section 3).
Compute the acceptance probability for t by (here M > 1 is a constant, see Supplementary Section 3)
$\frac{p_{i} (t)}{M q_{i} (t)} = exp (- φ (t_{i}^{(I)} - t)),$
(3.8)
and draw $U \sim Unif (0, 1)$ ⁠; accept t as a sample of $t_{i}^{(E)}$ if $U < exp (- φ (t_{i}^{(I)} - t))$ ⁠, and otherwise go back to Step 1 and repeat.

A full derivation of the above (importantly showing that M > 1) and other technical details are provided in Section 3 of the Supplementary Material. This step is also fully parallelizable across individuals, as the conditional sampling is performed separately for each person i. Through simulation experiments (see Supplementary Section 6), we see that the rejection sampler is very efficient, with an average acceptance rate of approximately 45%.

Step 2: conditional sampling of missing recovery times The conditional samples of missing recovery times should satisfy two conditions: first, an individual i cannot recover when they are still known to be infectious; second, i cannot recover either if they should serve as the infector of another exposure case. This amounts to conditionally sampling event times with endpoints restricted by low-resolution epidemic data and high-resolution contact data.

This challenge was previously addressed by the DARCI algorithm developed in Bu et al. (2020) (Proposition 4.2) for a simpler epidemic model with only one type of infectives. Here, we can adapt and modify DARCI for our two-type infective setting, conditional on the value of p_s (the proportions of I_s among all I individuals) and the sampled exposure times in Step 1. For brevity, we leave technical details to the Supplementary Material (Section 4).

Uncertainty quantification. For estimates produced by DANCE, we can quantify uncertainty by leveraging expressions for their asymptotic variances, using results established in Nielsen (2000). We further implement a multiple-chain strategy to reduce variance, by (i) averaging the last m iterations in one chain, or (ii) averaging m independent chains. As derived in Nielsen (2000), for example, averaging m = 10 independent chains of DANCE would provide a conservative variance estimate of $1.05 {(I (\hat{Θ}))}^{- 1}$ where $\hat{Θ}$ are the parameter estimates and $I (\cdot)$ denotes the Fisher information matrix. This allows us to produce conservative Wald-type confidence intervals. See full details in Supplementary Section 5.

Validation via simulations. We perform comprehensive simulation studies to validate the DANCE inference framework, by first validating the complete data inference procedure (Supplementary Section 6.1), and then testing the data augmentation component of DANCE (Step 1 and Step 2) by first taking out all simulated exposure times followed by removing all exposure and recovery times from simulated datasets (Supplementary Section 6.2). Across 40 independent simulations for each scenario, our inference algorithm is able to accurately recover the parameter values and produce confidence intervals with good coverage rates. We present a detailed description and all results of the simulation studies in Supplementary Section 6.

Remarks on computing time. As a stochastic EM algorithm, DANCE enjoys fast computation. With moderate efforts of parallelized implementation, on a regular 4-core laptop, each iteration typically takes a few seconds and the algorithm usually converges in about 100 iterations for a 100-200 person population (similar in size to our motivating dataset), amounting to total computing time only in the order of minutes.

4 Case study: flu season on a university campus

To illustrate our model and inference framework, we present a case study on transmissions of influenza-like illnesses among students on a university campus, where high-resolution contact tracing was performed to track physical proximity between study subjects and individual-level baseline characteristics were collected.

This dataset was collected over a 10-week epidemiological study, eX-FLU (Aiello et al. 2016), where inter-personal physical contacts of study participants were surveyed to investigate the effect of social intervention on respiratory infection transmissions. 590 university students enrolled in the study and were asked to respond to weekly surveys on influenza-like illness symptoms and social interactions; they also completed a comprehensive entry survey about demographic information, lifestyles, immunization history, health-related habits, and tendencies of behavioral changes during a flu season or a hypothetical pandemic. 103 individuals among the study population were further recruited to participate in a sub-study in which each study subject was provided a smartphone equipped with an application, iEpi. This application pairs smartphones with other nearby study devices via Bluetooth and thus can record individual-level contacts (ie physical proximity) at five-minute intervals. Bluetooth signals are pre-processed based on signal strengths to identify sufficiently intimate pairwise physical proximity which is treated as a contact link (Supplementary Section 7.2).

The iEpi sub-study took place from January 28, 2013 to April 15, 2013 (that is, from week 2 until after week 10 in the main study). Between weeks 6 and 7, there was a one-week spring break (March 1 to March 7), during which epidemic data collection was paused and volume of recorded contacts also dropped considerably. In our application case study, we use data obtained on the N = 103 sub-study population from January 28 to April 4 (week 2 to week 10), and treat the two periods before and after the spring break as two different social behavior phases. That is, we regard weeks 2-6 as $T_{0}$ and weeks 7-10 as $T_{1}$ in our analysis.

We consider two types of “infectious” (status I) members within the study population: (1) multi-symptomatic (I_s) – a case with a cough AND one of these three symptoms: fever or feverishness, chills, or body aches (definition of “influenza-like-illnesses”); (2) uni-symptomatic (I_a) – a case with a cough, a non-specific but important symptom for influenza.

For each infection case, we set the reported symptom onset time as the manifestation time (denoted by $t_{i}^{(I)}$ in previous sections), and treat the exposure time (⁠ $t_{i}^{(E)}$ ⁠) and recovery time (⁠ $t_{i}^{(R)}$ ⁠) as unobserved. Since $t_{i}^{(E)} < t_{i}^{(I)}$ (implied by the assumed SEIR mechanism), we set the plausible incubation interval as $L_{i} = (0, t_{i}^{(I)})$ ⁠. Using weekly surveys (which asked each participant if they felt sick in the past week), we know that the missing recovery times must lie within a 7-day interval for each individual, where the lower and upper bounds are the start and end of a week. Moreover, we assume that all the contact network events are fully observed, as the high-resolution contact tracing can provide timepoints of activation and termination of all individual-level contacts. This suggests that our proposed DANCE algorithm is applicable to this dataset.

4.1 Inference with external infection sources

Since the 103 individuals in the dataset are sub-sampled from the 590 study participants, which are also sub-sampled from the entire university campus population, we have to treat the data as observed from an open population instead of a closed one. Therefore, some slight modifications should be made to the model. Specifically, individuals in our target population may get infected from outside infection sources, whom we refer to as “external infectors.”

For simplicity, we represent the joint forces of all external infectors by a single infector that exists outside of the population and exhibits a constant level of transmissibility over time, and this external force of infection is exerted uniformly on all members of the target population.

For each susceptible individual j, let the rate of disease onset (ie manifestation) due to external infectors be ξ_j, and let this onset rate depend on individual characteristics x_j, similar to our treatment of the internal exposure rate β_ij: $log ξ_{j} = log ξ + x_{j}^{T} b_{E}$ ⁠, where ξ denotes the population average external onset rate, and coefficients b_E represent coefficients to explain associations between individual characteristics x_j and subject j’s deviations of susceptibility from the average level.

Here ξ_j is the rate of moving from status S directly to either I_a or I_s, rather than from S to E, and that is why we are naming it the “external onset rate” instead of “external exposure/infection rate.” We are not introducing both an exposure rate (like β_ij) and a manifestation rate (like $φ$ ⁠) for external infection cases because of identifiability concerns: since all susceptible people are exposed to the same external infector with time-invariant transmissibility, the exposure rate and manifestation rate would not be identifiable at the same time when the exposure times are not observed. Thus, to ensure identification, we choose to include only one rate instead of two, and the “onset rate” can be thought of as the rate of any susceptible individual developing contagiousness due to external infection forces.

Now the set of parameters is extended to $\tilde{Θ} = {β, φ, γ, η, b_{S}, ξ, b_{E}, α, ω}$ ⁠, and we can write down a complete data likelihood by slightly modifying Eq. (3.5), where the term related to the new parameters ξ and b_E are separate from the other terms (Supplementary Section 7.4). This means that introducing external cases does not affect estimation of the other parameters at all, and that we can still use the DANCE algorithm detailed in Section 3.2.

4.2 Data analysis

We first discuss how we identify internal and external infection cases and describe the individual characteristics used in the analysis. If an infected person had any infectious contact (within the 103-person population) up to 2 weeks prior to symptom onset, then we label this case as “internal,” and otherwise this case is labeled as “external.” This procedure gives us 18 internal cases and 16 external cases in total. Moreover, among all 34 cases, 13 are multi-symptomatic (I_s) and 21 are uni-symptomatic (I_a). We provide a summary of the breakdown of all infection cases in Supplementary Section 7.

We consider the following four individual-level characteristics collected from the entry survey that have previously been linked to disease transmission risk (the original survey questions used to calculate the derived covariates “change_behavior” and “prevention” are provided in Supplementary Section 7): (i) flushot—whether or not the study subject has taken a flu shot for this year; (ii) wash_opt—whether or not the study subject’s hand-washing habit is considered “optimal,” derived from survey questions about how long and how frequently one usually washes their hands; (iii) change_behavior—a derived numeric score measuring how willingly the study subject would change their lifestyle during a hypothetical pandemic, where a higher score represents more willingness in changing one’s lifestyle in response to a pandemic; (iv) prevention—a derived score measuring one’s belief in the effectiveness of different preventative practices in reducing the risk of catching the flu; a higher score represents stronger belief in the effectiveness of preventative practices.

We perform 20 independent runs of the stochastic-EM inference procedure on the dataset, each time with a different random initialization and 60 burn-in steps. For each run, we take the average of the last 20 iterations (after burn-in) and then average over the 20 averages (across runs) to produce estimates of the parameters. Convergence is assessed by examining traceplots and Geweke diagnostics and model fit is validated by simulation-based predictive checks (Supplementary Section 7.8). Conservative asymptotic standard errors are obtained using the method described in Section 3.2, setting m = 20 and upper-bounding the asymptotic variance matrix by $1.025 I {(\hat{\tilde{Θ}})}^{- 1}$ ⁠, where $\hat{\tilde{Θ}}$ are the final parameter estimates produced by averaging.

Tables 2 and 3 present estimates of key epidemic parameters. Here we take one day as 1 unit of time. For this population, the baseline exposure rate is quite high, indicating fast disease exposure upon contact—it takes approximately 0.22 days on average for an H—I contact to lead to infection if the susceptible individual is not vaccinated, does not wash hands properly and has neutral attitudes about disease prevention. On average, the incubation period lasts slightly less than 5 days, while recovery takes about 6 days. The total external infection force experienced by the entire N = 103-person population is on the scale of $0.00445 \times 103 \approx 0.458$ ⁠, indicating on average there would be a disease onset due to external sources every other day if nobody in the study population had a flu shot or washed their hands optimally. In terms of the coefficients for individual-level covariates, we note that the estimates are associated with relatively large standard errors (indicated in the parentheses), potentially due to the small sample size reflected by the moderate number of infection cases. Nevertheless, hand-washing (“wash_opt”) seems to be a considerably influential mitigation measure, given that there is a 11-fold reduction (⁠ $1 / e^{- 2.42} \approx 11.2$ ⁠) in the exposure risk if one washes their hands optimally compared to suboptimal hand-washing; such a statistical association appears significant, with a 95% Wald confidence interval of $(- 4.054, - 0.786)$ that does not contain zero.

Table 2

Estimates of key epidemic parameters, with conservative estimates of asymptotic standard errors.

Parameter	Estimate	Standard error
β (internal exposure)	4.497	2.005
ξ (external onset)	0.00445	0.00114
$φ$ (latency)	0.221	0.0591
γ (recovery)	0.161	0.0279
$e^{η}$ (I_s v.s. I_a infectiousness)	0.0622	0.0526
p_s (proportion of I_s)	0.382	0.0854

Parameter	Estimate	Standard error
β (internal exposure)	4.497	2.005
ξ (external onset)	0.00445	0.00114
$φ$ (latency)	0.221	0.0591
γ (recovery)	0.161	0.0279
$e^{η}$ (I_s v.s. I_a infectiousness)	0.0622	0.0526
p_s (proportion of I_s)	0.382	0.0854

Table 2

Estimates of key epidemic parameters, with conservative estimates of asymptotic standard errors.

Parameter	Estimate	Standard error
β (internal exposure)	4.497	2.005
ξ (external onset)	0.00445	0.00114
$φ$ (latency)	0.221	0.0591
γ (recovery)	0.161	0.0279
$e^{η}$ (I_s v.s. I_a infectiousness)	0.0622	0.0526
p_s (proportion of I_s)	0.382	0.0854

Parameter	Estimate	Standard error
β (internal exposure)	4.497	2.005
ξ (external onset)	0.00445	0.00114
$φ$ (latency)	0.221	0.0591
γ (recovery)	0.161	0.0279
$e^{η}$ (I_s v.s. I_a infectiousness)	0.0622	0.0526
p_s (proportion of I_s)	0.382	0.0854

Table 3

Estimates of epidemic coefficients on individual characteristics, with conservative asymptotic standard deviations in the parentheses.

	(flushot)	(wash_opt)	(change_behavior)	(prevention)
b_S (internal exposure)	–0.105 (0.671)	–2.42 (0.817)	–0.201 (0.326)	–0.0541 (0.273)
b_E (external onset)	–0.805 (0.597)	–0.139 (0.471)	0.257 (0.263)	–0.0362 (0.273)

	(flushot)	(wash_opt)	(change_behavior)	(prevention)
b_S (internal exposure)	–0.105 (0.671)	–2.42 (0.817)	–0.201 (0.326)	–0.0541 (0.273)
b_E (external onset)	–0.805 (0.597)	–0.139 (0.471)	0.257 (0.263)	–0.0362 (0.273)

Table 3

Estimates of epidemic coefficients on individual characteristics, with conservative asymptotic standard deviations in the parentheses.

	(flushot)	(wash_opt)	(change_behavior)	(prevention)
b_S (internal exposure)	–0.105 (0.671)	–2.42 (0.817)	–0.201 (0.326)	–0.0541 (0.273)
b_E (external onset)	–0.805 (0.597)	–0.139 (0.471)	0.257 (0.263)	–0.0362 (0.273)

	(flushot)	(wash_opt)	(change_behavior)	(prevention)
b_S (internal exposure)	–0.105 (0.671)	–2.42 (0.817)	–0.201 (0.326)	–0.0541 (0.273)
b_E (external onset)	–0.805 (0.597)	–0.139 (0.471)	0.257 (0.263)	–0.0362 (0.273)

In Table 4 we include estimates of key parameters related to the contact network process. Here we emphasize the difference between the change rates of $H \sim H$ (healthy-healthy) links and $H \sim I$ (healthy-ill) links, as well as the difference between the two social phases (⁠ $T_{0}$ before spring break and $T_{1}$ after). The link termination rates for $H \sim I$ links are higher than those of $H \sim H$ links in both phases, suggesting that the duration of contact between a healthy-infectious pair is on average shorter than the contact between two healthy people; this might be because infected students avoided social activities as they felt unwell, or susceptible individuals interacted less frequently with peers who seemed sick in order to avoid infection. Moreover, the level of network activity seems much higher (both in terms of establishing and breaking contact) in $T_{0}$ (weeks 2 to 6, before spring break) compared to $T_{1}$ (weeks 7 to 10, after spring break) when we compare the rates for phase $T_{0}$ and phase $T_{1}$ ⁠, possibly due to increased outdoor activities (thus less contacts via close physical proximity) after the spring break. Such findings are enabled by our model design which allows for different levels of network activities by introducing different time phases.

Table 4

Estimates of link activation and termination rates for different link types in the two phases (⁠ $T_{0}$ spans from week 2 to week 6, and $T_{1}$ from week 7 to week 10), with estimates of standard deviations in the parentheses.

Event type	Activation ( $α, \times 10^{- 4}$ )		Deletion ( $ω, \times 10^{0}$ )
Phase	$T_{0}$	$T_{1}$	$T_{0}$	$T_{1}$
H ∼H	181 (1.77)	8.68 (1.29)	11.6 (0.132)	5.27 (0.0783)
H ∼I	153 (6.67)	0.653 (0.0420)	16.6 (0.725)	8.71 (0.589)

Event type	Activation ( $α, \times 10^{- 4}$ )		Deletion ( $ω, \times 10^{0}$ )
Phase	$T_{0}$	$T_{1}$	$T_{0}$	$T_{1}$
H ∼H	181 (1.77)	8.68 (1.29)	11.6 (0.132)	5.27 (0.0783)
H ∼I	153 (6.67)	0.653 (0.0420)	16.6 (0.725)	8.71 (0.589)

Table 4

Estimates of link activation and termination rates for different link types in the two phases (⁠ $T_{0}$ spans from week 2 to week 6, and $T_{1}$ from week 7 to week 10), with estimates of standard deviations in the parentheses.

Event type	Activation ( $α, \times 10^{- 4}$ )		Deletion ( $ω, \times 10^{0}$ )
Phase	$T_{0}$	$T_{1}$	$T_{0}$	$T_{1}$
H ∼H	181 (1.77)	8.68 (1.29)	11.6 (0.132)	5.27 (0.0783)
H ∼I	153 (6.67)	0.653 (0.0420)	16.6 (0.725)	8.71 (0.589)

Event type	Activation ( $α, \times 10^{- 4}$ )		Deletion ( $ω, \times 10^{0}$ )
Phase	$T_{0}$	$T_{1}$	$T_{0}$	$T_{1}$
H ∼H	181 (1.77)	8.68 (1.29)	11.6 (0.132)	5.27 (0.0783)
H ∼I	153 (6.67)	0.653 (0.0420)	16.6 (0.725)	8.71 (0.589)

Through our data analysis, we have found quantitative evidence that proper hand-washing is significantly associated with reduced risks of flu infection, and that there is a considerable external force of infection for the study population. Moreover, study participants exhibit adaptive contact behavior to flu transmission with less frequent and shorter-lasting contacts between healthy and infectious individuals. These findings are consistent with intuition and are also reflected in the dataset where optimal hand-washers seem less prone to infections and infectious individuals tend to lose contact links (Supplementary Section 7.9).

5 Discussion

In this paper, we present a data-augmented stochastic EM inference algorithm for partially observed epidemics on a dynamic contact network while accounting for heterogeneous infection risks associated with individual characteristics. The design of a likelihood-based inferential framework is challenged by and benefits from the availability of high-resolution contact tracing data—the state space of latent variables is expanded to all unobserved individual epidemic event times, but at the same time largely reduced thanks to the knowledge of dynamic contact links.

It is important to note that the modeling framework we propose is flexible beyond our choice of underlying compartments. That is, our approach can be easily adapted to incorporate notions of reinfection (by allowing some individuals to reenter the susceptible population) or to distinguish between more than two types of infections. In pursuing generalizations of the methodology, introducing additional parameters requires careful consideration of the uncertainty quantification from the stochastic EM algorithm. Although in our setting, the estimated confidence intervals perform well empirically compared to their nominal coverage, they rely on variance approximation formulas, and it is crucial to conduct similar validations in more complex models.

There are, however, several limitations in our model assumptions that can motivate future research. First, for mathematical convenience we assume a Markovian model, but extensions could be made toward non-Markovian infection or recovery processes, using Gamma or Weibull distributions for inter-event wait times. Second, we assume all dynamic contact links are observed, but for a larger population where high resolution contact tracing is less feasible, one could use a social network model (stochastic block models or latent factor models) to account for unobserved contacts; our assumed binary contact links could be extended to categorical or continuous weighted links using the signal strengths of mobile device contact tracing. Lastly, we made a couple of pragmatic compromises in the real data case study: we identified external infections based on lack of internal contacts, but if more population-level data were to become available (e.g. contact surveys or viral sequencing data) we could estimate external infections in a joint statistical model; similarly, we used the asymptomatic or less symptomatic compartment (I_a) to identify non-specific symptomatic cases rather than truly asymptomatic cases, but one could introduce additional latent compartments to infer asymptomatic infections based on differential contact patterns or other data sources such as surveillance testing.

Our analysis of the iEpi data provides further evidence of the importance of personal hygiene and health habits on the reduction of the spread of influenza-like-illness. Through a careful analysis of real observational epidemiological data, we found a considerable association between hand-washing and the transmission rate of a disease in an active population with dynamically changing contact patterns. We hope that this development encourages greater data collection of high-frequency individual-level data in this area to gain better understanding of other pharmaceutical and non-pharmaceutical interventions. For example, future studies will be able to estimate the effectiveness of vaccination in preventing transmission under different social interaction rates and population densities, and assess claims about the efficacy of mask-wearing and active social distancing. Importantly, such data can be collected discretely in closed populations and provide invaluable insight into the deployment of public health interventions (Motta et al. 2021).

Supplementary material

Supplementary material is available at Biostatistics Journal online.

Funding

A.V. and J.X. were partially supported by DMS-2230074. A.V. was partially supported by DMS-2046880.

Conflict of interest statement

None declared.

Data availability

Software in the form of R and Python code, together with a sample input data set and complete documentation is available on Github at https://github.com/fanbu1995/EpiNetHetero

References

Aiello

AE

,

Simanek

AM

,

Eisenberg

MC

,

Walsh

AR

,

Davis

B

,

Volz

E

,

Cheng

C

,

Rainey

JJ

,

Uzicanin

A

,

Gao

H

, et al.

Design and methods of a social network isolation study for reducing respiratory infection transmission: the eX-FLU cluster randomized trial

.

Epidemics.

2016

:

15

:

38

–

55

.

Andrieu

C

,

Doucet

A

,

Holenstein

R.

Particle Markov chain Monte Carlo methods

.

J R Stat Soc Ser B (Stat Methodol)

.

2010

:

72

(

3

):

269

–

342

.

Ball

F

,

Britton

T.

Epidemics on networks with preventive rewiring

.

Random Struct Algorithms

.

2022

:

61

(

2

):

250

–

297

.

Bor

A

,

Jørgensen

F

,

Petersen

MB.

Discriminatory attitudes against unvaccinated people during the pandemic

.

Nature.

2023

:

613

(

7945

):

704

–

711

.

Bu

F

,

Aiello

AE

,

Xu

J

,

Volfovsky

A.

Likelihood-based inference for partially observed epidemics on dynamic networks

.

J Am Stat Assoc

2022

:

117

(

537

):

510

–

526

.

Celeux

G.

The sem algorithm: a probabilistic teacher algorithm derived from the em algorithm for the mixture problem

.

Comput Stat Q

.

1985

:

2

:

73

–

82

.

Doss

CR

,

Suchard

MA

,

Holmes

I

,

Kato-Maeda

M

,

Minin

VN.

Fitting birth-death processes to panel data with applications to bacterial dna fingerprinting

.

Ann Appl Stat

.

2013

:

8

(

4

):

2315

.

Eames

KTD

,

Keeling

MJ.

Contact tracing and disease control

.

Proc R Soc Lond Ser B Biol Sci

.

2003

:

270

(

1533

):

2565

–

2571

.

Ferguson

NM

,

Laydon

D

,

Nedjati-Gilani

G

,

Imai

N

,

Ainslie

K

,

Baguelin

M

,

Bhatia

S

,

Boonyasiri

A

,

Cucunubá

Z

,

Cuomo-Dannenburg

G

et al.

Report 9: impact of non-pharmaceutical interventions (npis) to reduce COVID19 mortality and healthcare demand

(Vol.

16

).

London

:

Imperial College London

.

Google Preview

Fintzi

J

,

Wakefield

J

,

Minin

VN.

A linear noise approximation for stochastic epidemic models fit to partially observed incidence counts

.

Biometrics.

2022

:

78

(

4

):

1530

–

1541

.

Guttorp

P.

Stochastic Modeling of Scientific Data

(1st ed.).

Chapman and Hall/CRC

;

1995

.

He

D

,

Ionides

EL

,

King

AA.

Plug-and-play inference for disease dynamics: measles in large and small populations as a case study

.

J R Soc Interface.

2010

:

7

(

43

):

271

–

283

.

Ho

LST

,

Crawford

FW

,

Suchard

MA.

Direct likelihood-based inference for discretely observed stochastic compartmental models of infectious disease

.

Ann Appl Stat.

2018a

:

12

(

3

):

1993

–

2021

.

Ho

LST

,

Xu

J

,

Crawford

FW

,

Minin

VN

,

Suchard

MA.

Birth/birth-death processes and their computable transition probabilities with biological applications

.

J Math Biol

.

2018b

:

76

(

4

):

911

–

944

.

Hobolth

A

,

Stone

EA.

Simulation from endpoint-conditioned, continuous-time Markov chains on a finite state space, with applications to molecular evolution

.

Ann Appl Stat.

2009

:

3

(

3

):

1204

.

Ju

N

,

Heng

J

,

Jacob

PE.

2021

. Sequential monte carlo algorithms for agent-based models of disease transmission, arXiv, arXiv:arXiv:2101.12156, preprint: not peer reviewed.

Kenah

E

,

Lipsitch

M

,

Robins

JM.

Generation interval contraction and epidemic data analysis

.

Math Biosci.

2008

:

213

(

1

):

71

–

79

.

Kermack

WO

,

McKendrick

AG.

A contribution to the mathematical theory of epidemics

.

Proc R Soc Lond Ser A

.

1927

:

115

(

772

):

700

–

721

.

Kiss

IZ

,

Green

DM

,

Kao

RR.

Infectious disease control using contact tracing in random and scale-free networks

.

J R Soc Interface.

2006

:

3

(

6

):

55

–

62

.

Lunz

D

,

Batt

G

,

Ruess

J.

To quarantine, or not to quarantine: A theoretical framework for disease control via contact tracing

.

Epidemics.

2021

:

34

:

100428

.

Morsomme

R

,

Xu

J.

2022

. Exact inference for stochastic epidemic models via uniformly ergodic block sampling, arXiv, arXiv:2201.09722, preprint: not peer reviewed.

Motta

FC

,

McGoff

KA

,

Deckard

A

,

Wolfe

CR

,

Bonsignori

M

,

Moody

MA

,

Cavanaugh

K

,

Denny

TN

,

Harer

J

,

Haase

SB.

Assessment of simulated surveillance testing and quarantine in a sars-cov-2–vaccinated population of students on a university campus

.

JAMA Health Forum.

2021

:

2

(

10

):

e213035

.

Nielsen

SF.

The stochastic em algorithm: estimation and asymptotic results

.

Bernoulli

.

2000

:

6

(

3

):

457

–

489

.

Pooley

CM

,

Bishop

SC

,

Marion

G.

Using model-based proposals for fast parameter inference on discrete state space, continuous-time Markov processes

.

J R Soc Interface.

2015

:

12

(

107

):

20150225

.

Rao

V

,

Teg

YW.

Fast mcmc sampling for markov jump processes and extensions

.

J Mach Learn Res.

2013

:

14

(

11

):

3295

–

3320

Renshaw

E.

Stochastic population processes: analysis, approximations, simulations.

Oxford, New York

:

Oxford University Press

;

2015

.

Google Preview

Rieger

NS

,

Worley

NB

,

Ng

AJ

,

Christianson

JP.

Insular cortex modulates social avoidance of sick rats

.

Behav Brain Res.

2022

:

416

:

113541

.

Rose

EB

,

Roy

JA

,

Castillo-Neyra

R

,

Ross

ME

,

Condori-Pino

C

,

Peterson

JK

,

Naquira-Velarde

C

,

Levy

MZ.

A real-time search strategy for finding urban disease vector infestations

.

Epidemiol Methods.

2020

:9(

1

):

20200001

.

Shukla

P

,

Lee

M

,

Whitman

SA

,

Pine

KH.

Delay of routine health care during the covid-19 pandemic: a theoretical model of individuals’ risk assessment and decision making

.

Soc Sci Med.

2022

:

307

:

115164

.

Soriano-Arandes

A

,

Gatell

A

,

Serrano

P

,

Biosca

M

,

Campillo

F

,

Capdevila

R

,

Fàbrega

A

,

Lobato

Z

,

López

N

,

Moreno

AM

et al. .

Household sars-cov-2 transmission and children: a network prospective study

.

Clin Infect Dis Off Public Infect Dis Soc Am

.

2021

:73(

6

):

e1261

–

e1269

.

Stutz

TC

,

Sinsheimer

JS

,

Sehl

M

,

Xu

J.

2022

.

Computational tools for assessing gene therapy under branching process models of mutation

.

Bulletin of Mathematical Biology

.

84

:

1

–

17

.

Wang

S

,

Walker

SG.

Bayesian data augmentation for partially observed stochastic compartmental models

.

Bayesian Anal.

2023

:

1

(

1

):1–

24

.

Xu

J

,

Guttorp

P

,

Kato-Maeda

M

,

Minin

VN.

Likelihood-based inference for discretely observed birth–death-shift processes, with applications to evolution of mobile genetic elements

.

Biometrics.

2015

:

71

(

4

):

1009

–

1021

.

Xu

J

,

Minin

VN.

Efficient transition probability computation for continuous-time branching processes via compressed sensing. In

Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence (UAI’15).

Arlington, Virginia, USA

:

AUAI Press

. p.

952

–

961

.

Google Preview