A transformation perspective on marginal and conditional models

Interpretation of the linear predictor $x^{⊤} β$ under different inverse link functions F. In practice, many models are known with respect to the link function $F^{- 1}$ ⁠, which we report accordingly. We denote the baseline (⁠ $x^{⊤} β = 0$ ⁠) cumulative distribution function by $F (h (y))$ and the conditional cumulative distribution function by $F (h (y) | x)$

$F^{- 1} (z)$	F(z)	Interpretation of $x^{⊤} β$
probit	$Φ_{0, 1} (z)$	conditional mean
	Standard normal	$E (h (Y) \| x) = x^{⊤} β$
logit	${logit}^{- 1} (z) = \frac{1}{1 + exp (- z)}$	log-odds ratio
	Standard logistic	$\frac{F (h (y) \| x)}{1 - F (h (y) \| x)} = exp (- x^{⊤} β) \frac{F (h (y))}{1 - F (h (y))}$
cloglog	cloglog $^{- 1} (z) = 1 - exp (- exp (z))$	log-hazard ratio
	Gompertz/Min. Extreme Value	$1 - F (h (y) \| x) = {(1 - F (h (y)))}^{exp (- x^{⊤} β)}$
loglog	loglog $^{- 1} (z) = exp (- exp (- z))$	log-reverse time hazard ratio
	Gumbel/Max. Extreme Value	$F (h (y) \| x) = F {(h (y))}^{exp (x^{⊤} β)}$

$F^{- 1} (z)$	F(z)	Interpretation of $x^{⊤} β$
probit	$Φ_{0, 1} (z)$	conditional mean
	Standard normal	$E (h (Y) \| x) = x^{⊤} β$
logit	${logit}^{- 1} (z) = \frac{1}{1 + exp (- z)}$	log-odds ratio
	Standard logistic	$\frac{F (h (y) \| x)}{1 - F (h (y) \| x)} = exp (- x^{⊤} β) \frac{F (h (y))}{1 - F (h (y))}$
cloglog	cloglog $^{- 1} (z) = 1 - exp (- exp (z))$	log-hazard ratio
	Gompertz/Min. Extreme Value	$1 - F (h (y) \| x) = {(1 - F (h (y)))}^{exp (- x^{⊤} β)}$
loglog	loglog $^{- 1} (z) = exp (- exp (- z))$	log-reverse time hazard ratio
	Gumbel/Max. Extreme Value	$F (h (y) \| x) = F {(h (y))}^{exp (x^{⊤} β)}$

Table 1

$F^{- 1} (z)$	F(z)	Interpretation of $x^{⊤} β$
probit	$Φ_{0, 1} (z)$	conditional mean
	Standard normal	$E (h (Y) \| x) = x^{⊤} β$
logit	${logit}^{- 1} (z) = \frac{1}{1 + exp (- z)}$	log-odds ratio
	Standard logistic	$\frac{F (h (y) \| x)}{1 - F (h (y) \| x)} = exp (- x^{⊤} β) \frac{F (h (y))}{1 - F (h (y))}$
cloglog	cloglog $^{- 1} (z) = 1 - exp (- exp (z))$	log-hazard ratio
	Gompertz/Min. Extreme Value	$1 - F (h (y) \| x) = {(1 - F (h (y)))}^{exp (- x^{⊤} β)}$
loglog	loglog $^{- 1} (z) = exp (- exp (- z))$	log-reverse time hazard ratio
	Gumbel/Max. Extreme Value	$F (h (y) \| x) = F {(h (y))}^{exp (x^{⊤} β)}$

$F^{- 1} (z)$	F(z)	Interpretation of $x^{⊤} β$
probit	$Φ_{0, 1} (z)$	conditional mean
	Standard normal	$E (h (Y) \| x) = x^{⊤} β$
logit	${logit}^{- 1} (z) = \frac{1}{1 + exp (- z)}$	log-odds ratio
	Standard logistic	$\frac{F (h (y) \| x)}{1 - F (h (y) \| x)} = exp (- x^{⊤} β) \frac{F (h (y))}{1 - F (h (y))}$
cloglog	cloglog $^{- 1} (z) = 1 - exp (- exp (z))$	log-hazard ratio
	Gompertz/Min. Extreme Value	$1 - F (h (y) \| x) = {(1 - F (h (y)))}^{exp (- x^{⊤} β)}$
loglog	loglog $^{- 1} (z) = exp (- exp (- z))$	log-reverse time hazard ratio
	Gumbel/Max. Extreme Value	$F (h (y) \| x) = F {(h (y))}^{exp (x^{⊤} β)}$

For clustered or longitudinal data, we observe multiple values of the response Y for each observational unit (clusters or subjects) whose interdependencies are not reflected in model (2.1). Adding, in analogy to GLMMs, a random effects term

u^{⊤} r

to the linear predictor in (2.1) defines mixed-effects transformation models

P_{Y} (Y \leq y | x, u, r) = F (h (y) - x^{⊤} β - u^{⊤} r) .

(tramME)

When the random effects follow a specific bridge distribution to F, that is, the normal distribution for $F = Φ$ ⁠, the stable distribution when $F = {cloglog}^{- 1}$ (Aalen and others, 2008), or the distribution derived by Wang and Louis (2003) for $F = {logit}^{- 1}$ ⁠, marginal distributions can be derived. Neither the likelihood nor marginal distributions, and thus a marginal interpretation of $β$ ⁠, are available in closed form when the model is formulated differently, especially when normal random effects are coupled with $F \neq Φ$ (Tamási and others, 2022).

2.1 Joint transformation models

To address these issues, we present a novel transformation model for the joint distribution that provides simple analytic expressions for marginal predictive distributions of the form (2.1). In this setup, $i = 1, \dots, N$ independent observational units, each consisting of N_i correlated observations of the response $Y_{i} = {(Y_{i 1}, \dots, Y_{i N_{i}})}^{⊤} \in Ξ^{N_{i}}$ ⁠, are available for estimating the joint distribution. While refraining to specify a certain parametric joint multivariate distribution for $Y_{i}$ ⁠, we assume that probabilities on the scale of a suitable transformation of $Y_{i}$ can be evaluated using a multivariate normal distribution whose structured covariance matrix captures the correlations between the transformed elements of $Y_{i}$ ⁠. The aim of this article is to simultaneously estimate the transformation, regression coefficients, and the structured covariance from data using models which emphasize predictive distributions and parameter interpretability.

The nondecreasing transformation function

h : Ξ \to R

is applied element-wise to the response vector

h_{N_{i}} (Y_{i}) = {(h (Y_{i 1}), \dots, h (Y_{i N_{i}}))}^{⊤}

ensuring that the same transformation is applied to all N_i observations. Together with

Y_{i}

⁠, one observes a corresponding matrix

X_{i} = {(x_{i 1} | \dots | x_{i N_{i}})}^{⊤} \in R^{N_{i} \times Q}

of full rank containing treatment assignment or covariates whose corresponding regression coefficients

β

are of interest. In addition, the design of the experiment is described by a matrix

U_{i} = {(u_{i 1} | \dots | u_{i N_{i}})}^{⊤} \in R^{N_{i} \times R}

⁠. We exclusively study setups with simple cluster assignment encoded in this matrix (⁠

U_{i} = {(1)}_{N_{i}, 1}

⁠) or longitudinal data (⁠

u_{i j} = (1, t_{i j})

indicating that Y_ij for the ith subject was observed for at time t_ij). We propose to study models for the joint distribution function of

Y_{i}

given

X_{i}

and

U_{i}

of the form

P (Y_{i} \leq y | X_{i}, U_{i}) = Φ_{0_{N_{i}}, Σ_{i} (γ)} (D_{i} (γ) Φ_{N_{i}}^{- 1} (F_{N_{i}} {D_{i} {(γ)}^{- 1} [h_{N_{i}} (y) - X_{i} β]})) .

(2.2)

Here,

Φ_{0_{N_{i}}, Σ_{i} (γ)} (\cdot)

is the distribution function of an N_i-dimensional normal random vector with mean vector zero and structured covariance matrix

Σ_{i} (γ) : = U_{i} Λ (γ) Λ {(γ)}^{⊤} U_{i}^{⊤} + I_{N_{i}}

as defined by the random effects design matrix and an unstructured Cholesky factor

Λ (γ) \in R^{R \times R}

depending on unknown variance parameters

γ \in R^{R (R + 1) / 2}

⁠;

I_{N_{i}}

denotes the

N_{i} \times N_{i}

identity matrix. We isolate the square roots of the diagonal elements of

Σ_{i} (γ)

in the matrix

D_{i} (γ) = diag {(Σ_{i} (γ))}^{1 / 2} \cdot I_{N_{i}} = diag {(U_{i} Λ (γ) Λ {(γ)}^{⊤} U_{i}^{⊤} + I_{N_{i}})}^{1 / 2} \cdot I_{N_{i}}

⁠. A positive-semidefinite covariance matrix

Σ_{i} (γ)

is given under the constraint

diag (Λ (γ)) \geq 0_{R}

⁠. For the simple model with

U_{i} = {(1, \dots, 1)}^{⊤}

⁠, we have

Λ (γ) = γ_{1}, Σ_{i} (γ) = {(γ_{1}^{2})}_{N_{i}, N_{i}} + I_{N_{i}}

⁠, and

D_{i} {(γ)}^{- 1} = {(γ_{1}^{2} + 1)}^{- 1} \cdot I_{N_{i}}

is a scaling factor to the transformation function h and regression coefficients

β

which is instrumental for the derivation of marginal distributions. In the longitudinal setup,

Λ = (\begin{matrix} γ_{1} & 0 \\ γ_{2} & γ_{3} \end{matrix})

and the covariance

Σ_{i} {(γ)}_{j, ȷ}

depends on the observation times t_ij and

t_{i ȷ}

⁠. The key component is the shifted transformation

h_{N_{i}} (y) - X_{i} β

modeling the impact of the regression coefficients on the transformed scale.

The transformation function h, the regression coefficients $β$ ⁠, and the variance parameters $γ$ are unknowns to be estimated from data. In (2.2), $Φ_{N_{i}}^{- 1} (p) = {(Φ^{- 1} (p_{1}), \dots, Φ^{- 1} (p_{N_{i}}))}^{⊤}$ applies the quantile function $Φ^{- 1}$ of the standard normal element-wise to some vector of probabilities $p = {(p_{1}, \dots, p_{N_{i}})}^{⊤} \in {(0, 1)}^{N_{i}}$ ⁠. Furthermore, $F : R \to [0, 1]$ is an a priori defined cumulative distribution function of some absolute continuous distribution with log-concave density f; $F_{N_{i}}$ and $f_{N_{i}}$ are the element-wise applications of F and f, respectively.

For absolute continuous responses

Y_{i} \in R^{N_{i}}

⁠, model (2.2) implies that the latent variable

Z_{i} : = D_{i} (γ) Φ_{N_{i}}^{- 1} (F_{N_{i}} {D_{i} {(γ)}^{- 1} [h_{N_{i}} (Y_{i}) - X_{i} β]}) \in R^{N_{i}}

(2.3)

defined as an element-wise transformation of the observations

Y_{i}

follows a multivariate normal distribution

Z_{i} \sim N_{N_{i}} (0_{N_{i}}, Σ_{i} (γ))

⁠. The model is distribution-free in the sense that for a baseline configuration (with

X_{i} β = 0_{N_{i}}

⁠), such a transformation into multivariate normality exists for all marginal distributions (Klein and others, 2022). The model does, however, impose a certain correlation structure through the choice of

U_{i}

⁠. An example for the joint distributions induced by increasing correlations among bivariate repeated measurements with skewed marginal distributions is given in Figure 1.

Illustration. Bivariate joint density of an unconditional logistic (F=logit−1) transformation model for repeated measures (cluster size Ni≡2, Ui=(1,1)⊤ and Σi=γ12UiUi⊤+I2) with transformation functions h1=h2=1+γ12·logit°χ92 such that both marginal distributions follow the χ92 law. For γ1=0, observations within a cluster are independent, and their correlation increases with increasing values of γ1.

Fig. 1

Illustration. Bivariate joint density of an unconditional logistic (⁠ $F = l ogi t^{- 1}$ ⁠) transformation model for repeated measures (cluster size $N_{i} \equiv 2, U_{i} = {(1, 1)}^{⊤}$ and $Σ_{i} = γ_{1}^{2} U_{i} U_{i}^{⊤} + I_{2}$ ⁠) with transformation functions $h_{1} = h_{2} = \sqrt{1 + γ_{1}^{2}} \cdot l ogit ° χ_{9}^{2}$ such that both marginal distributions follow the $χ_{9}^{2}$ law. For $γ_{1} = 0$ ⁠, observations within a cluster are independent, and their correlation increases with increasing values of γ₁.

The key aspect of an implementation of model (2.2) is the parameterization of the transformation function as $h_{N_{i}} (y) = A (y) ϑ,$ where $A (y) = {(a (y_{1}) | \dots | a (y_{N_{i}}))}^{⊤} \in R^{N_{i} \times P}$ is the matrix of evaluated basis functions $a : Ξ \to R^{P}$ ⁠. Choices of basis functions a are problem-specific and several options are discussed in Section 3 and, in more detail, in Hothorn and others (2018) and Hothorn (2020).

2.2 Connection to normal LMMs

We first consider the special case

F = Φ

⁠, where the transformation of

Y_{i}

simplifies to

Z_{i} = h_{N_{i}} (Y_{i}) - X_{i} β = A (Y_{i}) ϑ - X_{i} β

⁠. Model (2.2) contains the LMM as a special case. In its standard notation, the LMM reads

Y_{i} = α + X_{i} \tilde{β} + U_{i} R_{i} + σ ε_{i}

(LMM)

with random effects

R_{i} \sim N_{R} (0_{R}, G (γ))

⁠, residuals

ε_{i} \sim N_{N_{i}} (0_{N_{i}}, I_{N_{i}})

under the assumption

R_{i} ⊥ ε_{i}

⁠, intercept

α \in R

and residual standard deviation

σ \in R^{+}

⁠. The matrices

X_{i}

and

U_{i}

are typically referred to as “fixed effects” and “random effects” design matrices in the literature. This model can be reformulated as a model for the joint multivariate distribution

Z_{i} = \frac{Y_{i} - α - X_{i} \tilde{β}}{σ} = U_{i} σ^{- 1} R_{i} + ε_{i} \sim N_{N_{i}} (0_{N_{i}}, U_{i} Λ (γ) Λ {(γ)}^{⊤} U_{i}^{⊤} + I_{N_{i}})

(2.4)

based on the relative covariance factorization

σ^{- 2} G (γ) = Λ (γ) Λ {(γ)}^{⊤} \in R^{R \times R}

⁠. This is model (2.2) with

F = Φ

⁠, linear transformation

h_{N_{i}} (Y_{i}) = {(σ^{- 1} (Y_{i 1} - α), \dots, σ^{- 1} (Y_{i N_{i}} - α))}^{⊤} = A (Y_{i}) ϑ

with linear basis functions

a (y) = {(y, - 1)}^{⊤}

and parameters

ϑ = {(σ^{- 1}, α σ^{- 1})}^{⊤}

⁠, and finally fixed effects

β = σ^{- 1} \tilde{β}

⁠.

Using this notation, the conditional distribution function of some element

Y \in Ξ

of Y, conditional on x, u, and unobservable random effects

R = r

⁠, is

P (Y \leq y | x, u, r) = Φ (a {(y)}^{⊤} ϑ - x^{⊤} β - σ^{- 1} u^{⊤} r) ​ .

The marginal distribution of some element

Y \in Ξ

of Y, which is still conditional on x and u but integrates over the random effects R, can be obtained from the joint multivariate normal (2.4) as

P (Y \leq y | x, u) = Φ (\frac{a {(y)}^{⊤} ϑ - x^{⊤} β}{\sqrt{u^{⊤} Λ (γ) Λ {(γ)}^{⊤} u + 1}}) ​ .

The shrunken marginal fixed effects $β / \sqrt{u^{⊤} Λ (γ) Λ {(γ)}^{⊤} u + 1}$ were also described by Wu and Wang (2019) in a Bayesian implementation of this model. Understanding the LMM as special case of a transformation model allows to relax the normality assumption for $Y_{i}$ by introducing nonlinear transformation functions $h (y) = a {(y)}^{⊤} ϑ$ defined by a nonlinear basis a (Hothorn and others, 2018). Section 3.1 contains a comparison of the two models. Probit GLMMs for binary responses $Y \in Ξ = {0, 1}$ can also be understood as a special case of a transformation model with intercept $h (0) = α$ and $h (1) = \infty$ ⁠. Several implementations of such GLMMs are compared empirically to an implementation motivated from a transformation model perspective in Section 3.2.

2.3 Distinction from generalized mixed-effects and frailty models

Two important extensions of the LMM include GLMMs and frailty models. For binary responses, the logistic GLMM has the conditional, given normal random effects r, interpretation

P (Y = 0 | x, u, r) = {logit}^{- 1} (α + x^{⊤} β + u^{⊤} r) ​ .

In survival analysis with

Y \in Ξ = R^{+}

⁠, a Weibull normal frailty model leads to the conditional interpretation

P (Y \leq y | x, u, r) = {cloglog}^{- 1} (α_{1} + α_{2} log (y) + x^{⊤} β + u^{⊤} r) ​ .

A normal frailty Cox model

P (Y \leq y | x, u, r) = {cloglog}^{- 1} (h (y) + x^{⊤} β + u^{⊤} r) ​ .

replaces the log-linear transformation function of the Weibull model with a smooth log-cumulative hazard function h(y). All three models are special cases of mixed-effects transformation models (tramME).

Assuming normal random effects u, neither model can be understood in terms of model (2.2) and two main difficulties are associated with these types of models assuming additivity of the fixed and random effects on the log-odds ratio or log-hazard ratio scales. First, unlike in (LMM), there is no analytic expression for the marginal distribution and thus a marginal interpretation of the fixed effects $β$ is difficult. Second, evaluation of the likelihood typically relies on a Laplace approximation of the integral with respect to the random effects’ distribution and problems with this approximation have been reported, for example by Ogden (2015). The novel multivariate transformation model for clustered observations based on (2.2) addresses both of these issues as shall be explained in the next subsections.

2.4 Transformation models with marginal interpretation

Simple analytic expressions for the marginal distribution are available (also for $F \neq Φ$ ⁠), independent of the choice of the basis function a, noting that the variance of the jth element of $Z_{i}$ (2.3) is $u_{i j}^{⊤} Λ (γ) Λ {(γ)}^{⊤} u_{i j} + 1$ ⁠.

The Gaussian copula distribution of 2.2 directly implies the marginal distribution function in form of a marginal transformation model (mtram):

\begin{matrix} P (Y \leq y | x, u) = Φ (\frac{\sqrt{u^{⊤} Λ (γ) Λ {(γ)}^{⊤} u + 1} Φ^{- 1} (F (\frac{a {(y)}^{⊤} ϑ - x^{⊤} β}{\sqrt{u^{⊤} Λ (γ) Λ {(γ)}^{⊤} u + 1}}))}{\sqrt{u^{⊤} Λ (γ) Λ {(γ)}^{⊤} u + 1}}) \\ = F (\frac{a {(y)}^{⊤} ϑ - x^{⊤} β}{\sqrt{u^{⊤} Λ (γ) Λ {(γ)}^{⊤} u + 1}}) ​ . \end{matrix}

(mtram)

In this model, the fixed effects $β$ divided by $\sqrt{u^{⊤} Λ (γ) Λ {(γ)}^{⊤} u + 1}$ are directly interpretable given $U = u$ ⁠, for example as log-odds ratios (⁠ $F = {logit}^{- 1}$ ⁠) or log-hazard ratios (⁠ $F = {cloglog}^{- 1}$ ⁠). Because $Λ (γ) Λ {(γ)}^{⊤}$ is positive semidefinite, there might be a reduction in effect size when comparing the fixed effects $β$ from formula (2.2) to the marginal effects $β / \sqrt{u^{⊤} Λ (γ) Λ {(γ)}^{⊤} u + 1}$ from model (mtram). For repeated measurements with $u = 1$ we get a constant reduction by $1 / \sqrt{γ_{1}^{2} + 1}$ ⁠. In longitudinal models, the marginal effect at time t is $β / \sqrt{γ_{1}^{2} + γ_{1} γ_{2} t + (γ_{2}^{2} + γ_{3}^{2}) t^{2} + 1}$ because $u = {(1, t)}^{⊤}$ ⁠. For positively correlated random intercepts and random slopes (i.e., $γ_{2} > 0$ ⁠), the marginal effect always decreases over time.

2.5 The likelihood function

For parameters

ϑ, β

⁠, and

γ

⁠, the log-likelihood contribution

ℓ_{i} (ϑ, β, γ)

of the ith subject or cluster is based on the transformation

z (y | ϑ, β, γ) = D_{i} (γ) Φ_{N_{i}}^{- 1} (F_{N_{i}} {D_{i} {(γ)}^{- 1} [A (y) ϑ - X_{i} β]})

(2.5)

of some

y \in Ξ^{N_{i}}

⁠.

For discrete or interval-censored observations

({\underline{y}}_{i}, {\bar{y}}_{i}] \subset R^{N_{i}}

⁠, the log-likelihood contribution is

\begin{matrix} ℓ_{i} (ϑ, β, γ) = log P ({\underline{y}}_{i} \leq Y_{i} < {\bar{y}}_{i}) = log P (z ({\underline{y}}_{i} | ϑ, β, γ) \leq Z_{i} < z ({\bar{y}}_{i} | ϑ, β, γ)) \\ = log {Φ_{0_{N_{i}}, Σ_{i} (γ)} [z ({\underline{y}}_{i} | ϑ, β, γ), z ({\bar{y}}_{i} | ϑ, β, γ)]}, \end{matrix}

(2.6)

where

Φ_{0_{N_{i}}, Σ_{i} (γ)} (\underline{z}, \bar{z}) = \int_{\underline{z}}^{\bar{z}} ϕ_{0_{N_{i}}, Σ_{i} (γ)} (z) d z

is the integral over the N_i-dimensional multivariate normal density

ϕ_{N_{i}}

with mean zero and covariance

Σ_{i}

⁠. The structure of

Σ_{i} (γ)

can be exploited to dramatically reduce the dimensionality of the integration problem. Applying the procedure by Marsaglia (1963), one can reduce this N_i-dimensional integral to an R-dimensional integral over the unit cube (see Appendix A).

For continuous observations

y \in R^{N_{i}}

⁠, it is common practice (Section 5, Lindsey, 1999) to approximate this log-likelihood by a log-density evaluated at the observations

y_{i}

⁠:

\begin{matrix} ℓ_{i} (ϑ, β, γ) \approx - \frac{1}{2} log | Σ_{i} (γ) | + \\ - \frac{1}{2} z {(y_{i} | ϑ, β, γ)}^{⊤} (Σ_{i} {(γ)}^{- 1} - D_{i} {(γ)}^{- 2}) z (y_{i} | ϑ, β, γ) + \\ {log}_{N_{i}} {(f_{N_{i}} {D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β]})}^{⊤} 1_{N_{i}} + \\ {log}_{N_{i}} {(A' (y_{i}) ϑ)}^{⊤} 1_{N_{i}}, \end{matrix}

(2.7)

where the Cholesky factorization

L_{i} (γ) L_{i} {(γ)}^{⊤} = Σ_{i} (γ)

is utilized. It should be noted that the exact log-likelihood function (2.6) does not require the precision matrix

Σ_{i} {(γ)}^{- 1}

to be computed. In the above approximation,

{log}_{N_{i}}

is the element-wise natural logarithm and

f_{N_{i}}

the element-wise density of F.

A' (y_{i})

denotes the matrix of evaluated derivatives

a'

of the basis function a. The log-likelihood of 2.7 is derived in Appendix B.

Using either log-likelihood, we obtain simultaneous maximum-likelihood estimates for all model parameters from

({\hat{ϑ}}_{N}, {\hat{β}}_{N}, {\hat{γ}}_{N}) = \underset{(ϑ, β, γ) \in R^{P + Q + M}}{argmax} \sum_{i = 1}^{N} ℓ_{i} (ϑ, β, γ) .

Some models require additional constraints on $ϑ$ to be implemented (Hothorn and others, 2018). Analytic score functions for all model parameters $ϑ, β$ ⁠, and $γ$ are available (see Appendix B). Score functions for the discrete or censored likelihood (2.6) and the observed Fisher information matrices for both likelihoods are obtained numerically. The full parameterization of h allows application of standard results for likelihood asymptotics (van der Vaart, 1998) to independent observations (Hothorn and others, 2018). Model (2.2) is a special case of the multivariate transformation model of Klein and others (2022) where the transformation function h and the fixed effects $β$ are constrained to be the same for all “coordinates” of the random vector $Y_{i}$ (i.e., observations in the same cluster). Therefore, model (2.2) benefits from the same asymptotic results reported by Klein and others (2022).

3 Applications

In this section, we discuss four potential applications of marginally interpretable transformation models. Data, numerical details, and code reproducing the results are available from the Online Appendix (Barbanti and Hothorn, 2022). We start with two head-to-head comparisons where model (mtram) suggested here can be estimated by already existing software implementations of mixed-effects probit models for the purpose of validating the implementation of model (mtram) in the add-on package tram (Hothorn and others, 2022) to the $R$ system for statistical computing.

3.1 Non-normal mixed-effects models

The average reaction times to a specific task over several days of sleep deprivation are given for

i = 1, \dots, N = 18

subjects (Belenky and others, 2003). The data are often used to illustrate LMMs with correlated random intercepts and slopes of the form (LMM)

P (Reaction time \leq y | day, i) = Φ (\frac{y - α - β day - α_{i} - β_{i} day}{σ}), (α_{i}, β_{i}) \sim N_{2} (0, G (γ)) .

(3.8)

This conditional normal model can be estimated by maximizing the corresponding normal log-likelihood and distinct implementations of classical normal linear mixed models (LMM, package lme4, Bates and others, 2015), conditional mixed-effects transformation models (tramME, package tramME, Tamási and Hothorn, 2021), and marginal transformation models (mtram, package tram, Hothorn and others, 2022) provide identical results (in-sample log-likelihood –875.97).

Because the reaction times can hardly be expected to follow a symmetric distribution, we consider the non-normal conditional and marginal transformation model

P (Reaction \leq y | day, i) = Φ (h (y) - β day - α_{i} - β_{i} day), (α_{i}, β_{i}) \sim N_{2} (0, G (γ)),

(3.9)

where a monotonically increasing transformation function h(y) is allowed to deviate from linearity. Such probit-type mixed-effects models have been studied before, for example, by merging a Box–Cox power transformation h with a grid-search over REML estimates (Gurka and others, 2006), a conditional likelihood (Hutmacher and others, 2011), or a grid-search maximizing the profile likelihood (Maruo and others, 2017). Tang and others (2018) and Wu and Wang (2019) proposed a monotone spline parameterization of h in a Bayesian context.

We parameterize $h (y) = a {(y)}^{⊤} ϑ$ in terms of a monotonically increasing polynomial in Bernstein form of order six (Hothorn and others, 2018). The conditional transformation model (Tamási and others, 2022) can be estimated by maximizing a Laplace approximation to the log-likelihood (Tamási and Hothorn, 2021) simultaneously with respect to all parameters $ϑ, β$ ⁠, and $γ$ ⁠. Direct optimization of the log-likelihood (2.7) for (mtram) leads to identical results (log-likelihood –859.55), because the conditional and marginal models are identical for $F = Φ$ and the Laplace approximation is very accurate in this case. For $F \neq Φ$ ⁠, conditional and marginal transformation models differ, and numerical integration with respect to the normal random effects is required when marginal distributions shall be obtained from a conditional model. In contrast, the (mtram) provides a closed-form expression for marginal distributions for all choices of F. With $F = {logit}^{- 1}$ ⁠, the log-likelihood of the marginal model increases slightly (–860.6377).

The daily marginal distribution functions of normal and non-normal models are compared to the daily marginal empirical cumulative distributions in Figure 2. Especially for short reaction times early in the experiment, the non-normal transformation models seem to fit the data better than the normal linear model. Between the probit transformation model and the logistic marginal transformation model, only minor discrepancies can be observed.

Fig. 2

Sleep deprivation. Marginal distribution of reaction times, separately for each day of study participation. The grey step-function corresponds to the empirical cumulative distribution function, the blue line to the marginal cumulative distribution of the normal LMM (3.8), estimated by the lmer function from package lme4 (Bates and others, 2015), the solid yellowish line to the probit transformation model (3.9), and the dotted yellowish line to the logistic marginal transformation model.

3.2 Binary marginal models

For a binary response $Y \in {0, 1}$ ⁠, the transformation $h (y) = α$ reduces to a scalar intercept. Thus, maximization of the discrete log-likelihood (2.6) provides an alternative to commonly applied approximations, such as Laplace or Adaptive Gauss–Hermite Quadrature, for fitting conditional probit mixed-effects models. In addition, the possibility to interpret parameters marginally also for $F \neq Φ$ asks for a comparison to generalized estimation equations (GEEs).

We first compared different implementations of binary probit mixed-effects models for the notoriously difficult to handle toe nail data (Backer and others, 1998) for which quasi-separation issues have been reported (Sauter and Held, 2016). The ordinal response measuring toe nail infection was categorized to two levels. We were interested in binary probit models featuring fixed main and interaction effects β₁, β₂, and β₃ of treatment (itraconazole vs. terbinafine) and time. Subject-specific random intercept models and models featuring correlated random intercepts and slopes were estimated by the glmer function from package lme4 (Bates and others, 2015), by the glmmTMB function from package glmmTMB (Brooks and others, 2017), and by direct maximization of the exact discrete log-likelihood (2.6) given in Appendix A.

The estimated model parameters, along with the discrete log-likelihood (2.6) evaluated at these parameters, are given in Table 2. For the random intercept models, AGQ, the Laplace approximation in glmmTMB, and the discrete log-likelihood gave the same results, the Laplace approximation implemented in package lme4 seemed to fail. It was not possible to apply the AGQ approach to the random intercept/random slope model. The two implementations of the Laplace approximation in packages lme4 and glmmTMB differed for the random intercept but not for the random intercept/random slope model. The log-likelihood obtained by direct maximization of (2.6) resulted in the best fitting model with the least extreme parameter estimates. Computing times for all procedures were comparable.

Table 2

Toe nail data. Binary probit models featuring fixed interceptsα, treatment effects β₁, time effects β₂, and time-treatment interactions β₃are compared. Random intercept (RI) and random intercept/random slope (RI + RS) models were estimated by the Laplace (L) and Adaptive Gauss-Hermite Quadrature (AGQ) approximations to the likelihood (implemented in packages lme4and glmmTMB). In addition, the exact discrete log-likelihood (2.6) was used for model fitting and evaluation (the in-sample log-likelihood (2.6) for all models and timings of all procedures are given in the last two lines)

	RI				RI + RS
	glmer	glmer	glmmTMB		glmer	glmmTMB
	L	AGQ	L	(2.6)	L	L	(2.6)
α	–3.39	–0.91	–1.10	0.91	–4.30	–4.30	1.58
β₁	–0.03	–0.11	–0.17	–0.11	0.05	0.05	0.27
β₂	–0.22	–0.19	–0.19	–0.19	–0.07	–0.07	–0.53
β₃	–0.07	–0.06	–0.06	–0.06	–0.23	–0.23	–0.18
γ₁	4.57	2.12	2.10	2.11	10.88	11.01	5.22
γ₂	0.00	0.00	0.00	0.00	–1.64	–1.68	–0.37
γ₃	0.00	0.00	0.00	0.00	0.79	0.83	0.53
LogLik	–675.22	–637.34	–638.54	–637.34	–628.12	–630.65	–545.12
Time (s)	3.83	2.40	2.04	2.20	7.53	3.44	8.08

	RI				RI + RS
	glmer	glmer	glmmTMB		glmer	glmmTMB
	L	AGQ	L	(2.6)	L	L	(2.6)
α	–3.39	–0.91	–1.10	0.91	–4.30	–4.30	1.58
β₁	–0.03	–0.11	–0.17	–0.11	0.05	0.05	0.27
β₂	–0.22	–0.19	–0.19	–0.19	–0.07	–0.07	–0.53
β₃	–0.07	–0.06	–0.06	–0.06	–0.23	–0.23	–0.18
γ₁	4.57	2.12	2.10	2.11	10.88	11.01	5.22
γ₂	0.00	0.00	0.00	0.00	–1.64	–1.68	–0.37
γ₃	0.00	0.00	0.00	0.00	0.79	0.83	0.53
LogLik	–675.22	–637.34	–638.54	–637.34	–628.12	–630.65	–545.12
Time (s)	3.83	2.40	2.04	2.20	7.53	3.44	8.08

Table 2

	RI				RI + RS
	glmer	glmer	glmmTMB		glmer	glmmTMB
	L	AGQ	L	(2.6)	L	L	(2.6)
α	–3.39	–0.91	–1.10	0.91	–4.30	–4.30	1.58
β₁	–0.03	–0.11	–0.17	–0.11	0.05	0.05	0.27
β₂	–0.22	–0.19	–0.19	–0.19	–0.07	–0.07	–0.53
β₃	–0.07	–0.06	–0.06	–0.06	–0.23	–0.23	–0.18
γ₁	4.57	2.12	2.10	2.11	10.88	11.01	5.22
γ₂	0.00	0.00	0.00	0.00	–1.64	–1.68	–0.37
γ₃	0.00	0.00	0.00	0.00	0.79	0.83	0.53
LogLik	–675.22	–637.34	–638.54	–637.34	–628.12	–630.65	–545.12
Time (s)	3.83	2.40	2.04	2.20	7.53	3.44	8.08

	RI				RI + RS
	glmer	glmer	glmmTMB		glmer	glmmTMB
	L	AGQ	L	(2.6)	L	L	(2.6)
α	–3.39	–0.91	–1.10	0.91	–4.30	–4.30	1.58
β₁	–0.03	–0.11	–0.17	–0.11	0.05	0.05	0.27
β₂	–0.22	–0.19	–0.19	–0.19	–0.07	–0.07	–0.53
β₃	–0.07	–0.06	–0.06	–0.06	–0.23	–0.23	–0.18
γ₁	4.57	2.12	2.10	2.11	10.88	11.01	5.22
γ₂	0.00	0.00	0.00	0.00	–1.64	–1.68	–0.37
γ₃	0.00	0.00	0.00	0.00	0.79	0.83	0.53
LogLik	–675.22	–637.34	–638.54	–637.34	–628.12	–630.65	–545.12
Time (s)	3.83	2.40	2.04	2.20	7.53	3.44	8.08

In a second step, (mtram) with logit link was compared to marginal odds ratios obtained from a GEE. We refitted published GEE models for this data (⁠ $SAS$ results in Chapter 10, Molenberghs and Verbeke, 2005) and noticed substantial differences indicating numerical instabilities for this data set (see Online Appendix, Barbanti and Hothorn, 2022). The monthly multiplicative treatment effect on the odds ratio scale was 0.91 (95% confidence interval 0.83–1.00) when a logistic GEE with unstructured working correlation was estimated. The logistic transformation model estimated the same parameter as 0.94 (95% confidence interval 0.89–0.99). Molenberghs and Verbeke (2005, p. 211) reported a GEE-based marginal odds ratio of 0.89 (95% confidence interval 0.81–0.98, with model-based standard errors and $exp$ -transformed Wald intervals). The performance of GEEs and marginal transformation models are compared against ground truth in a simulation experiment in Section 4.

3.3 Models for bounded responses

Chow and others (2006) report on a randomized two-arm clinical trial comparing a novel neck pain treatment to placebo. Neck pain levels of 90 subjects were assessed at baseline, after 7, and after 12 weeks (complete trajectories are available for 84 subjects) on a visual analog scale. Manuguerra and Heller (2010) proposed a mixed-effects model for such a bounded response. The fixed effects are interpretable as log-odds ratios, conditional on random effects. The data are presented in the top panel of Figure 3. A transformation model (mtram) with $F = {logit}^{- 1}$ featuring a transformation function $h (y) = a {(y)}^{⊤} ϑ$ defined by a polynomial in Bernstein form of order six on the unit interval, and correlated random intercept and random slope terms (⁠ $u = (1, t)$ for times t = 0, 7, 12 weeks) is visualized by means of the corresponding marginal distribution functions in the bottom panel of Figure 3. Similar to the results reported earlier (Manuguerra and Heller, 2010), the model highlights more severe pain in the active treatment group at baseline. A positive treatment effect can be inferred after 7 weeks which seemed to level-off when subjects were examined after 12 weeks. It is important to note that these results have a marginal interpretation and that the model does not assume a specific distribution of the response, such as a Beta distribution for example.

Fig. 3

Neck pain. Pain trajectories of 90 subjects under active treatment or placebo evaluated at baseline, after 7 and 12 weeks (top) and marginal distribution functions of neck pain at the three different time points (bottom). These results were obtained from model (mtram) using $F = {logit}^{- 1}$ and a polynomial in Bernstein form h(y) on the unit interval.

From the marginally interpretable transformation models, relevant quantities, like the probabilistic index, can be derived (Online Appendix, Barbanti and Hothorn, 2022). In this application, the marginal probabilistic index is the probability that, for a randomly selected patient in the treatment group, the neck pain score at time t is higher than the score for a subject in the placebo group randomly selected at the same time point. We obtain a probability of 0.72 (95% confidence interval [0.58–0.83]) at baseline, 0.29 (95% confidence interval [0.17–0.43]) after 7 weeks, and 0.38 (95% confidence interval [0.24–0.54]) after 12 weeks.

3.4 Marginally interpretable survival models

The CAO/ARO/AIO-04 randomized clinical trial (Rödel and others, 2015) compared Oxaliplatin added to fluorouracil-based preoperative chemoradiotherapy and postoperative chemotherapy for rectal cancer patients to the same therapy using fluorouracil only. Patients were randomized in the two treatment arms by block randomization taking the study center, the lymph node involvement (negative vs. positive), and tumor grading (T1–3 vs. T4) into account. The primary endpoint was disease-free survival, defined as the time between randomization and nonradical surgery of the primary tumor (R2 resection), locoregional recurrence after R0/1 resection, metastatic disease or progression, or death from any cause, whichever occurred first. The observed responses are a mix of exact dates (time to death or incomplete removal of the primary tumor), right-censoring (end of follow-up or drop-out), and interval-censoring (local or distant metastases). The conditional hazard ratio 0.79 (0.64–0.98) was reported as obtained from a Cox mixed-effects model with normal random intercepts and without stratification fitted to right-censored survival times (Rödel and others, 2015). This means that a rectal cancer patient treated with the novel combination therapy benefits from a 21% risk reduction compared to a patient from the same block treated with fluorouracil only.

We were interested in estimating a marginally interpretable treatment effect (acknowledging the fact that patients enrolled into the trial were not a random sample from all rectal cancer patients) based on a marginally interpretable stratified (with respect to lymph node involvement and tumor grading) Weibull model for clustered observations (blocks) in the presence of interval-censored survival times. This model can be formulated by (mtram) choosing $F = {cloglog}^{- 1}, a (y) = {(1, log (y))}^{⊤}, u = 1$ being the block indicator, and variance parameter γ₁ (corresponding to the correlation structure of a random intercept only model) as well as a treatment parameter β (comparing the novum to fluorouracil only). Stratification was implemented by strata-specific parameters $ϑ$ for each of the four strata. It should be noted that this model is not equivalent to a classical Weibull normal frailty model.

A confidence interval for the marginal hazard ratio $exp (β / \sqrt{γ_{1}^{2} + 1})$ was computed by simulating from the joint normal distribution of $(\hat{β}, {\hat{γ}}_{1})$ ⁠. With a relatively small ${\hat{γ}}_{1} = 0.15$ (with standard error 0.13), this resulted in a marginal hazard ratio of 0.80 (95% confidence interval $[0.65; 0.98]$ ⁠), meaning that rectal cancer patients treated with the combination therapy benefit from a 20% risk reduction on average.

By relaxing the Weibull assumption (log-linear transformation h) to a Cox proportional hazards model (nonlinear transformation h), we obtain a hazard ratio of 0.78 (95% confidence interval [0.64–0.96]) and a marginal probabilistic index of 0.56 (95% confidence interval [0.51–0.61]), meaning that over all study centers, a randomly selected patient receiving Oxaliplatin has a 56% probability of staying disease-free longer than a randomly selected patient receiving the standard treatment only, given that they both have the same lymph node involvement and tumor grading.

4 Empirical evaluation

Practitioners interested in inference for marginal effects will likely apply some form of GEE estimation when analyzing a binary response, or might integrate over random effects in a conditional mixed-effects model for more complex response distributions. In this section, we assess the quality of likelihood-based marginal transformation inference (model mtram) in comparison to GEEs for binary responses and to mixed-effects models for continuous responses.

4.1 Data generating process

We simulate N = 100 clusters of five repeated measurements (N_i = 5 and $U_{i} = {(1, 1, 1, 1, 1)}^{⊤}$ ⁠) from a logistic model (2.2) with $F = l ogi t^{- 1}$ and transformation function $h = \sqrt{1 + γ_{1}^{2}} \cdot logit ° χ_{9}^{2}$ ⁠. The dependencies between repeated measurements in each cluster are described by $Σ_{i} = {(γ_{1}^{2})}_{5 \times 5} + I_{5}$ ⁠. We are interested in inference for the marginal effects $μ : = {(1 + γ_{1}^{2})}^{- 1 / 2} β$ for various values of $γ_{1} \in {0, 0.5, 1, 1.5, 2, 3}$ ⁠. We simulated three uniform covariates X and defined $β = {(β_{1}, β_{2}, β_{3})}^{⊤} = {(0, 1, 2)}^{⊤}$ ⁠. The baseline distribution (with $x = {(0, 0, 0)}^{⊤}$ ⁠) induces the same marginal $χ_{9}^{2}$ laws for all five components with bivariate densities as depicted in Figure 1.

We report the mean-squared errors (MSEs) along with mean widths and coverages of 95% confidence intervals for $μ_{p}, p = 1, 2, 3$ based on 10 000 simulation iterations in Table 3.

Table 3

Simulations. MSE, widths, and coverages of 95% confidence intervals for three marginal effects. For dichotomized binary responses, results obtained from GEEs can be directly compared to results from marginal transformation models (first two blocks). The last block reports results of marginal transformation models fitted to continuous responses

			$γ_{1} = 0$	$γ_{1} = 0.5$	$γ_{1} = 1$	$γ_{1} = 1.5$	$γ_{1} = 2$	$γ_{1} = 3$
GEE (exchangeable)	MSE	μ₁	0.109	0.104	0.087	0.070	0.061	0.050
		μ₂	0.111	0.106	0.091	0.076	0.067	0.062
		μ₃	0.120	0.114	0.101	0.093	0.091	0.094
	CI width	μ₁	1.273	1.247	1.141	1.035	0.958	0.868
		μ₂	1.284	1.259	1.161	1.072	1.011	0.950
		μ₃	1.317	1.295	1.219	1.169	1.153	1.165
	Coverage	μ₁	0.948	0.947	0.947	0.950	0.948	0.947
		μ₂	0.944	0.946	0.945	0.947	0.950	0.945
		μ₃	0.942	0.944	0.947	0.945	0.942	0.941
mtram (binary)	MSE	μ₁	0.109	0.104	0.086	0.068	0.057	0.044
		μ₂	0.110	0.106	0.091	0.074	0.064	0.055
		μ₃	0.119	0.114	0.100	0.091	0.087	0.088
	CI width	μ₁	1.251	1.254	1.150	1.042	0.958	0.847
		μ₂	1.276	1.268	1.172	1.079	1.014	0.942
		μ₃	1.343	1.303	1.230	1.178	1.162	1.184
	Coverage	μ₁	0.953	0.951	0.949	0.953	0.953	0.952
		μ₂	0.948	0.950	0.949	0.951	0.954	0.953
		μ₃	0.947	0.948	0.950	0.951	0.953	0.955
mtram (continuous)	MSE	μ₁	0.074	0.067	0.045	0.029	0.019	0.009
		μ₂	0.079	0.070	0.048	0.033	0.024	0.015
		μ₃	0.082	0.075	0.056	0.046	0.038	0.033
	CI width	μ₁	1.040	1.005	0.827	0.659	0.535	0.382
		μ₂	1.061	1.020	0.853	0.705	0.602	0.485
		μ₃	1.119	1.059	0.926	0.826	0.766	0.710
	Coverage	μ₁	0.949	0.949	0.948	0.945	0.947	0.948
		μ₂	0.945	0.948	0.949	0.948	0.947	0.951
		μ₃	0.949	0.947	0.947	0.945	0.951	0.950

			$γ_{1} = 0$	$γ_{1} = 0.5$	$γ_{1} = 1$	$γ_{1} = 1.5$	$γ_{1} = 2$	$γ_{1} = 3$
GEE (exchangeable)	MSE	μ₁	0.109	0.104	0.087	0.070	0.061	0.050
		μ₂	0.111	0.106	0.091	0.076	0.067	0.062
		μ₃	0.120	0.114	0.101	0.093	0.091	0.094
	CI width	μ₁	1.273	1.247	1.141	1.035	0.958	0.868
		μ₂	1.284	1.259	1.161	1.072	1.011	0.950
		μ₃	1.317	1.295	1.219	1.169	1.153	1.165
	Coverage	μ₁	0.948	0.947	0.947	0.950	0.948	0.947
		μ₂	0.944	0.946	0.945	0.947	0.950	0.945
		μ₃	0.942	0.944	0.947	0.945	0.942	0.941
mtram (binary)	MSE	μ₁	0.109	0.104	0.086	0.068	0.057	0.044
		μ₂	0.110	0.106	0.091	0.074	0.064	0.055
		μ₃	0.119	0.114	0.100	0.091	0.087	0.088
	CI width	μ₁	1.251	1.254	1.150	1.042	0.958	0.847
		μ₂	1.276	1.268	1.172	1.079	1.014	0.942
		μ₃	1.343	1.303	1.230	1.178	1.162	1.184
	Coverage	μ₁	0.953	0.951	0.949	0.953	0.953	0.952
		μ₂	0.948	0.950	0.949	0.951	0.954	0.953
		μ₃	0.947	0.948	0.950	0.951	0.953	0.955
mtram (continuous)	MSE	μ₁	0.074	0.067	0.045	0.029	0.019	0.009
		μ₂	0.079	0.070	0.048	0.033	0.024	0.015
		μ₃	0.082	0.075	0.056	0.046	0.038	0.033
	CI width	μ₁	1.040	1.005	0.827	0.659	0.535	0.382
		μ₂	1.061	1.020	0.853	0.705	0.602	0.485
		μ₃	1.119	1.059	0.926	0.826	0.766	0.710
	Coverage	μ₁	0.949	0.949	0.948	0.945	0.947	0.948
		μ₂	0.945	0.948	0.949	0.948	0.947	0.951
		μ₃	0.949	0.947	0.947	0.945	0.951	0.950

Table 3

			$γ_{1} = 0$	$γ_{1} = 0.5$	$γ_{1} = 1$	$γ_{1} = 1.5$	$γ_{1} = 2$	$γ_{1} = 3$
GEE (exchangeable)	MSE	μ₁	0.109	0.104	0.087	0.070	0.061	0.050
		μ₂	0.111	0.106	0.091	0.076	0.067	0.062
		μ₃	0.120	0.114	0.101	0.093	0.091	0.094
	CI width	μ₁	1.273	1.247	1.141	1.035	0.958	0.868
		μ₂	1.284	1.259	1.161	1.072	1.011	0.950
		μ₃	1.317	1.295	1.219	1.169	1.153	1.165
	Coverage	μ₁	0.948	0.947	0.947	0.950	0.948	0.947
		μ₂	0.944	0.946	0.945	0.947	0.950	0.945
		μ₃	0.942	0.944	0.947	0.945	0.942	0.941
mtram (binary)	MSE	μ₁	0.109	0.104	0.086	0.068	0.057	0.044
		μ₂	0.110	0.106	0.091	0.074	0.064	0.055
		μ₃	0.119	0.114	0.100	0.091	0.087	0.088
	CI width	μ₁	1.251	1.254	1.150	1.042	0.958	0.847
		μ₂	1.276	1.268	1.172	1.079	1.014	0.942
		μ₃	1.343	1.303	1.230	1.178	1.162	1.184
	Coverage	μ₁	0.953	0.951	0.949	0.953	0.953	0.952
		μ₂	0.948	0.950	0.949	0.951	0.954	0.953
		μ₃	0.947	0.948	0.950	0.951	0.953	0.955
mtram (continuous)	MSE	μ₁	0.074	0.067	0.045	0.029	0.019	0.009
		μ₂	0.079	0.070	0.048	0.033	0.024	0.015
		μ₃	0.082	0.075	0.056	0.046	0.038	0.033
	CI width	μ₁	1.040	1.005	0.827	0.659	0.535	0.382
		μ₂	1.061	1.020	0.853	0.705	0.602	0.485
		μ₃	1.119	1.059	0.926	0.826	0.766	0.710
	Coverage	μ₁	0.949	0.949	0.948	0.945	0.947	0.948
		μ₂	0.945	0.948	0.949	0.948	0.947	0.951
		μ₃	0.949	0.947	0.947	0.945	0.951	0.950

			$γ_{1} = 0$	$γ_{1} = 0.5$	$γ_{1} = 1$	$γ_{1} = 1.5$	$γ_{1} = 2$	$γ_{1} = 3$
GEE (exchangeable)	MSE	μ₁	0.109	0.104	0.087	0.070	0.061	0.050
		μ₂	0.111	0.106	0.091	0.076	0.067	0.062
		μ₃	0.120	0.114	0.101	0.093	0.091	0.094
	CI width	μ₁	1.273	1.247	1.141	1.035	0.958	0.868
		μ₂	1.284	1.259	1.161	1.072	1.011	0.950
		μ₃	1.317	1.295	1.219	1.169	1.153	1.165
	Coverage	μ₁	0.948	0.947	0.947	0.950	0.948	0.947
		μ₂	0.944	0.946	0.945	0.947	0.950	0.945
		μ₃	0.942	0.944	0.947	0.945	0.942	0.941
mtram (binary)	MSE	μ₁	0.109	0.104	0.086	0.068	0.057	0.044
		μ₂	0.110	0.106	0.091	0.074	0.064	0.055
		μ₃	0.119	0.114	0.100	0.091	0.087	0.088
	CI width	μ₁	1.251	1.254	1.150	1.042	0.958	0.847
		μ₂	1.276	1.268	1.172	1.079	1.014	0.942
		μ₃	1.343	1.303	1.230	1.178	1.162	1.184
	Coverage	μ₁	0.953	0.951	0.949	0.953	0.953	0.952
		μ₂	0.948	0.950	0.949	0.951	0.954	0.953
		μ₃	0.947	0.948	0.950	0.951	0.953	0.955
mtram (continuous)	MSE	μ₁	0.074	0.067	0.045	0.029	0.019	0.009
		μ₂	0.079	0.070	0.048	0.033	0.024	0.015
		μ₃	0.082	0.075	0.056	0.046	0.038	0.033
	CI width	μ₁	1.040	1.005	0.827	0.659	0.535	0.382
		μ₂	1.061	1.020	0.853	0.705	0.602	0.485
		μ₃	1.119	1.059	0.926	0.826	0.766	0.710
	Coverage	μ₁	0.949	0.949	0.948	0.945	0.947	0.948
		μ₂	0.945	0.948	0.949	0.948	0.947	0.951
		μ₃	0.949	0.947	0.947	0.945	0.951	0.950

4.2 Binary responses

Binary responses were generated by dichotomization of the continuous response at the overall median. We fitted logistic GEEs with exchangeable working correlation structure and computed estimates and confidence intervals for all three marginal parameters $μ_{p}, p = 1, 2, 3$ ⁠. Results are shown in the first block of Table 3. In addition, marginal transformation models were fitted to these binary responses. Joint maximum-likelihood estimates of γ₁ and $β$ were computed from which we derived estimates and confidence intervals for the marginal effects $μ_{p}, p = 1, 2, 3$ ⁠. We drew 10 000 samples from the asymptotic joint normal distribution of γ₁ and $β$ to derive confidence intervals for $μ_{p}, p = 1, 2, 3$ in each simulation iteration. These results in the second block of Table 3 are practically equivalent to the results reported for GEEs. For μ₂ and μ₃, the coverage of confidence intervals computed from model (2.2) were slightly closer to the nominal 95% level.

4.3 Continuous responses

Marginal transformation models fitted to data on the original scale, that is, without dichotomisation of the response, performed better in terms of smaller MSEs and confidence interval widths (third block in Table 3). The coverage remained close to the nominal level.

In addition, we compared mtrams for continuous responses to two mixed-effects models: a normal (LMM) and a conditional logistic mixed-effects transformation model (tramME, Tamási and Hothorn, 2021). Unlike GEEs, these two additional competitors are misspecified and one has to integrate over normal random effects to obtain a marginal distribution given a specific configuration of x. For the normal LMM, the marginal distribution is again normal. Numerical integration was used to obtain marginal distributions from the tramME model.

For model (2.2), a conditional logistic mixed-effects transformation model with the same model complexity in terms of parameters for the transformation function and for the shift parameters, and a normal LMM, we derived the marginal distribution conditional on $x = {(0.5, 0.5, 0.5)}^{⊤}$ for 100 simulation iterations and present the difference $F (y | x) - \hat{F} (y | x)$ of the true and estimated marginal distribution functions for all three procedures in Figure 4. The normal linear mixed-effects model (LMM) lead to biased marginal distributions, simply because the model is not able to adapt to the skewness of the marginal distributions. The results for the marginal (mtram) and conditional (tramME) transformation models were surprisingly similar, especially for smaller values of γ₁. For $γ_{1} = 0$ and thus independence measurements, results are expected to be identical. For $γ_{1} = 3$ ⁠, and thus very large correlations among the five repeated measurements, the estimated marginal distribution functions obtained from tramME seemed to be slightly more biased than the marginal distribution functions obtained from the mtram.

Fig. 4

Difference between the true and estimated marginal distribution functions for a normal LMM (LMM), a conditional logistic mixed-effects transformation model (tramME) and a marginal transformation model (mtram). For mixed-effects models, the marginal distribution function was computed by integrating out the random effects (analytically for LMM and numerically for tramME).

This impression is also supported in Figure 5, where the integrated MSE of the difference in distributions $\int_{- \infty}^{\infty} {(F (y | x) - \hat{F} (y | x))}^{2} \cdot f (y | x) d y$ is presented for the conditional logistic mixed-effects transformation model (tramME) and the mtram (2.2). For $γ_{1} < 2$ ⁠, the two procedures performed very similar, for larger correlations the misspecified tramME model exhibited slightly larger discrepancies between true and estimated marginal distribution function. Of course, it is not possible to derive marginal effects and corresponding confidence intervals from such numerically obtained marginal distributions.

Fig. 5

Integrated MSE between the true marginal distribution function and the estimated marginal distribution function for a conditional logistic mixed-effects transformation model (tramME) and a marginal logistic transformation model.

5 Discussion

There is a difference between a marginal and a marginally interpretable model. A marginal model, for example defined by generalized estimation equations (Zeger and others, 1988), does not specify the joint distribution. A marginally interpretable model is a model for the joint or conditional (given random effects) distribution from which one can infer the marginal distribution (Lee and Nelder, 2004). The models proposed here follow the latter approach with the important distinctive feature that very simple expressions for the marginal distribution function are available. Thus, there is no need to apply numerical integration to the joint or conditional model formulation. In our view, model (mtram) is especially attractive because it allows the interpretation of scaled regression coefficients as marginal effects acting on the marginal predictive distribution in terms of a log-odds ratio or a log-hazard ratio, for example. The Gaussian copula approach for obtaining marginally interpretable models has gained some interest in the last years (Zhang and others, 2021; Masarotto and Varin, 2012); however, the simple framework of transformation models allows estimation for a wide range of responses without encountering computational burdens or challenges that other methods typically do.

Naturally, the questions arises which model is to be preferred: a marginal, a conditional, or a marginally interpretable one? In this case, the “right” model is not the model which most closely reflects the data generating process, which is usually unknown, but rather the model that allows the user to answer the research question at hand by interpreting the estimated parameters, as McGee and Stringer (2022) point out. An advantage of transformation models is that besides allowing for interpretation of the fixed-effects on a marginal level, they also yield valid models for the whole marginal distribution (2.1) of the response given the covariates. An advantage of marginalized multilevel models (Heagerty and Zeger, 2000) over marginal transformation models is that the former models are parameterized in terms of marginal effects of interest, whereas effect shrinkage is part of the latter models. The distribution-free nature, general applicability to all types of responses, and the relative computational simplicity are, in our opinion, attractive features of transformation models compared to marginalized multilevel models.

The models and estimation procedures introduced here are limited by some practical and some conceptual constraints. Response-varying regression coefficients $β (y)$ define distribution regression models (Foresi and Peracchi, 1995; Chernozhukov and others, 2013), where corresponding mixed-effects models have been presented recently (Garcia and others, 2019). This would be relatively straightforward to implement in the framework presented here, in fact, stratification in Weibull models was parameterized in a similar way. A mix of continuous and censored observations within one cluster would require to compute the likelihood by partial integration over an N_i-dimensional normal, this is currently not implemented. On a more conceptual level, it seems impossible to implement multilevel models for discrete or censored responses, because the likelihood (2.7) is only defined for contributions by independent clusters.

6 Computational details

The empirical analyses presented in Sections 3 and 4 are reproducible using the mtram package vignette (Online Appendix, Barbanti and Hothorn, 2022) in package tram (Hothorn and others, 2022). Infrastructure for transformation models from package mlt was used to define marginal models. Augmented Lagrangian Minimization implemented in the auglag() function of package alabama (Varadhan, 2022) was used for optimizing the log-likelihood. Numerical integration to compute the discrete and censored version of the log-likelihood was performed by SparseGrid (Ypma, 2013). GEEs were estimated using package geepack (Højsgaard and others, 2022) and conditional mixed-effects (LMM and tramME) models using package tramME (Tamasi, 2022). Packages lme4 (Bates and others, 2015) and glmmTMB (Brooks and others, 2017) were used to fit generalized mixed-effects models. All results were obtained using R version 4.2.2 (R Core Team, 2022).

Supplementary material

Supplementary material is available at https://CRAN.R-project.org/package=tram.

Acknowledgments

The authors would like to thank Leonhard Held, Thomas Kneib, Nadja Klein, and Bálint Tamási for interesting discussions.

Conflict of Interest: None declared.

Funding

Luisa Barbanti received a UZH Graduate Campus travel grant for a research stay in Berlin, during which this article was finalized. The Swiss National Science Foundation (200021_184603 to T.H.).

Appendix

A Likelihood function: censored and discrete case

The ith contribution to the likelihood (2.6) is given by the N_i-dimensional normal integral

exp (ℓ_{i} (ϑ, β, γ)) = \int_{z ({\underline{y}}_{i} | ϑ, β, γ)}^{z ({\bar{y}}_{i} | ϑ, β, γ)} ϕ_{N_{i}} (z, 0_{N_{i}}, U_{i} Λ (γ) Λ {(γ)}^{⊤} U_{i}^{⊤} + I_{N_{i}}) d z .

With

D_{i} (γ) = diag (U_{i} Λ (γ) Λ {(γ)}^{⊤} U_{i}^{⊤} + I_{N_{i}}) \cdot I_{N_{i}}

we obtain the corresponding correlation matrix as

\begin{matrix} C_{i} (γ) = D_{i} {(γ)}^{- 1 / 2} Σ_{i} (γ) D_{i} {(γ)}^{- 1 / 2} = V_{i} (γ) V_{i} {(γ)}^{⊤} + D_{i} {(γ)}^{- 1} \in R^{N_{i} \times N_{i}} \\ V_{i} (γ) = D_{i} {(γ)}^{- 1 / 2} U_{i} Λ (γ) \in R^{N_{i} \times R} \end{matrix}

and the integration limits become (again with

z ()

defined in (2.5))

\underline{z} = D_{i} {(γ)}^{- 1 / 2} z ({\underline{y}}_{i} | ϑ, β, γ) and \bar{z} = D_{i} {(γ)}^{- 1 / 2} z ({\bar{y}}_{i} | ϑ, β, γ) .

According to Marsaglia (1963), the above normal probability can be written as

\int_{\underline{z}}^{\bar{z}} ϕ_{N_{i}} (z, 0_{N_{i}}, C_{i} (γ)) d z = \int_{R^{R}} ϕ_{R} (w, 0_{R}, I_{R}) \int_{\underline{z} - Vw}^{\bar{z} - Vw} ϕ_{N_{i}} (y, 0_{N_{i}}, D_{i} {(γ)}^{- 1}) d w d y

and can, following Genz and Bretz (2009) here, further be simplified to

\begin{matrix} = \int_{R^{R}} ϕ_{R} (w, 0_{R}, I_{R}) \prod_{ı = 1}^{N_{i}} [Φ (\frac{{\bar{z}}_{ı} - \sum_{r = 1}^{R} v_{ı r} w_{r}}{\sqrt{d_{i}}}) - Φ (\frac{{\underline{z}}_{ı} - \sum_{r = 1}^{R} v_{ı r} w_{r}}{\sqrt{d_{i}}})] d w \\ \overset{w = Φ_{R}^{- 1} (q)}{=} \int_{{[0, 1]}^{R}} \prod_{ı = 1}^{N_{i}} [Φ (\frac{{\bar{z}}_{ı} - \sum_{r = 1}^{R} v_{ı r} Φ^{- 1} (q_{r})}{\sqrt{d_{ı}}}) - Φ (\frac{{\underline{z}}_{ı} - \sum_{r = 1}^{R} v_{ı r} Φ^{- 1} (q_{r})}{\sqrt{d_{ı}}})] d q . \end{matrix}

The elements

d_{ı}

are the diagonal elements of

D_{i} {(γ)}^{- 1}

and thus standardization of z and

U_{i} Λ (γ)

cancel out in this case such that we get

= \int_{{[0, 1]}^{R}} \prod_{ı = 1}^{N_{i}} [Φ ({\tilde{\bar{z}}}_{ı} - \sum_{r = 1}^{R} {\tilde{v}}_{ı r} Φ^{- 1} (q_{r})) - Φ ({\tilde{\underline{z}}}_{ı} - \sum_{r = 1}^{R} {\tilde{v}}_{ı r} Φ^{- 1} (q_{r}))] d q

with

{\tilde{\bar{z}}}_{ı}

and

{\tilde{\underline{z}}}_{ı}

being the elements of

z ({\bar{y}}_{i} | ϑ, β, γ)

and

z ({\underline{y}}_{i} | ϑ, β, γ)

⁠, respectively, and

{\tilde{v}}_{ı r}

are the elements of

U_{i} Λ (γ)

⁠.

The latter expression is an R-dimensional integral over the unit cube (random intercept models have R = 1 and correlated random intercept/random slope models correspond to R = 3) of products of univariate normal probabilities. It should be noted that, unlike using an Laplace or other approximation of the likelihood, the above term is the exact likelihood contribution. It can be approximated up to any desired accuracy using numerical integration procedures. An analytic expression for the score function seems quite challenging and one thus has to rely on numerical approaches such as sparse grids (Heiss and Winschel, 2008).

B Likelihood and score function: continuous case

The joint probability of

y_{i} \in R^{N_{i}}

is given by:

P (Y_{i} \leq y_{i} | X_{i}, U_{i}) = Φ_{0_{N_{i}}, Σ_{i} (γ)} (D_{i} (γ) Φ_{N_{i}}^{- 1} (F_{N_{i}} {D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β]})) ​ .

To simplify the notation, we define

z ()

as in (2.5):

z (y_{i} | ϑ, β, γ) = D_{i} (γ) Φ_{N_{i}}^{- 1} (F_{N_{i}} {D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β]}) .

We can derive the corresponding joint density for an arbitrary F:

\begin{matrix} f_{Y_{i}} (y_{i} | ϑ, β, γ) = {({(2 π)}^{N_{i}} | L_{i} (γ) L_{i} {(γ)}^{⊤} |)}^{- 1 / 2} \times \\ exp (- \frac{1}{2} {‖ z {(y_{i} | ϑ, β, γ)}^{⊤} L_{i} {(γ)}^{- 1} ‖}_{2}^{2}) \times \\ \prod_{ı = 1}^{N_{i}} \frac{D_{i} {(γ)}_{ı ı} f {{(D_{i} {(γ)}^{- 1})}_{ı ı} [a {(y_{i ı})}^{⊤} ϑ - X_{i} β]}}{ϕ (Φ^{- 1} (F {D_{i} {(γ)}^{- 1} [a {(y_{i ı})}^{⊤} ϑ - X_{i} β]}))} {(D_{i} {(γ)}^{- 1})}_{ı ı} a' {(y_{i ı})}^{⊤} ϑ \\ = {| L_{i} (γ) L_{i} {(γ)}^{⊤} |}^{- 1 / 2} \times \\ exp (- \frac{1}{2} {‖ z {(y_{i} | ϑ, β, γ)}^{⊤} L_{i} {(γ)}^{- 1} ‖}_{2}^{2}) \times \\ exp (\frac{1}{2} {‖ D_{i} {(γ)}^{- 1} z (y_{i} | ϑ, β, γ) ‖}_{2}^{2}) \times \\ \prod_{ı = 1}^{N_{i}} f {{(D_{i} {(γ)}^{- 1})}_{ı ı} [a {(y_{i ı})}^{⊤} ϑ - X_{i} β]} a' {(y_{i ı})}^{⊤} ϑ \\ = {| L_{i} (γ) L_{i} {(γ)}^{⊤} |}^{- 1 / 2} \times \\ exp (- \frac{1}{2} z {(y_{i} | ϑ, β, γ)}^{⊤} Σ_{i} {(γ)}^{- 1} z (y_{i} | ϑ, β, γ)) \times \\ exp (\frac{1}{2} z {(y_{i} | ϑ, β, γ)}^{⊤} D_{i} {(γ)}^{- 2} z (y_{i} | ϑ, β, γ)) \times \\ \prod_{ı = 1}^{N_{i}} f {{(D_{i} {(γ)}^{- 1})}_{ı ı} [a {(y_{i ı})}^{⊤} ϑ - X_{i} β]} a' {(y_{i ı})}^{⊤} ϑ . \end{matrix}

The resulting log-likelihood contribution (2.7) for the ith observation is given by:

\begin{matrix} ℓ_{i} (ϑ, β, γ) \approx log (f_{Y_{i}} (y_{i} | ϑ, β, γ)) \\ = - \frac{1}{2} log | L_{i} (γ) L_{i} {(γ)}^{⊤} | - \frac{1}{2} {‖ z {(y_{i} | ϑ, β, γ)}^{⊤} L_{i} {(γ)}^{- 1} ‖}_{2}^{2} + \\ \frac{1}{2} {‖ D_{i} {(γ)}^{- 1} z {(y_{i} | ϑ, β, γ)}^{⊤} ‖}_{2}^{2} + \\ {log}_{N_{i}} {(f_{N_{i}} {D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β]})}^{⊤} 1_{N_{i}} + \\ {log}_{N_{i}} {(A' (y_{i}) ϑ)}^{⊤} 1_{N_{i}} \\ = - \frac{1}{2} log | Σ_{i} (γ) | + \\ - \frac{1}{2} z {(y_{i} | ϑ, β, γ)}^{⊤} (Σ_{i} {(γ)}^{- 1} - D_{i} {(γ)}^{- 2}) z (y_{i} | ϑ, β, γ) + \\ {log}_{N_{i}} {(f_{N_{i}} {D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β]})}^{⊤} 1_{N_{i}} + \\ {log}_{N_{i}} {(A' (y_{i}) ϑ)}^{⊤} 1_{N_{i}} . \end{matrix}

The score function for all model parameters $ϑ, β$ ⁠, and $γ$ can be derived based on the results of Stroup (2012) as applied to normal linear mixed-effects models by Wang and Merkle (2018).

With the

M = R (R + 1) / 2

unique elements

γ = {(γ_{1}, \dots, γ_{M})}^{⊤}

of the lower Cholesky factor

Λ (γ)

we get

\frac{\partial Σ_{i} (γ)}{\partial γ_{m}} = U_{i} \frac{\partial Λ (γ) Λ {(γ)}^{⊤}}{\partial γ_{m}} U_{i}^{⊤} = U_{i} (\frac{\partial Λ (γ)}{\partial γ_{m}} Λ {(γ)}^{⊤} + Λ (γ) \frac{\partial Λ {(γ)}^{⊤}}{\partial γ_{m}}) U_{i}^{⊤} .

The derivative of

Λ (γ)

with respect to an element γ_m of

γ

is a matrix of zeros with the exception of a single one at the position of γ_m. Moreover, we compute:

\begin{matrix} \frac{\partial D_{i} (γ)}{\partial γ_{m}} = \frac{1}{2} D_{i} {(γ)}^{- 1} diag (U_{i} (\frac{\partial Λ (γ)}{\partial γ_{m}} Λ {(γ)}^{⊤} + Λ (γ) \frac{\partial Λ {(γ)}^{⊤}}{\partial γ_{m}}) U_{i}^{⊤}) \cdot I_{N_{i}} \\ = \frac{1}{2} D_{i} {(γ)}^{- 1} diag (\frac{\partial Σ_{i} (γ)}{\partial γ_{m}}) \cdot I_{N_{i}} \\ \frac{\partial D_{i} {(γ)}^{- 1}}{\partial γ_{m}} = - \frac{1}{2} {(D_{i} {(γ)}^{- 1})}^{3} diag (\frac{\partial Σ_{i} (γ)}{\partial γ_{m}}) \cdot I_{N_{i}} \\ \frac{\partial {(D_{i} {(γ)}^{- 1})}^{2}}{\partial γ_{m}} = - {(D_{i} {(γ)}^{- 1})}^{4} diag (\frac{\partial Σ_{i} (γ)}{\partial γ_{m}}) \cdot I_{N_{i}} \\ = - {(diag (Σ_{i} (γ)))}^{- 2} diag (\frac{\partial Σ_{i} (γ)}{\partial γ_{m}}) \cdot I_{N_{i}} \end{matrix}

and

\begin{matrix} \frac{\partial z (y_{i} | ϑ, β, γ)}{\partial γ_{m}} = \frac{\partial D_{i} (γ) Φ_{N_{i}}^{- 1} (F_{N_{i}} {D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β]})}{\partial γ_{m}} \\ = (\frac{\partial}{\partial γ_{m}} D_{i} (γ)) Φ_{N_{i}}^{- 1} (F_{N_{i}} {D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β]}) + \\ + D_{i} (γ) (\frac{\partial}{\partial γ_{m}} Φ_{N_{i}}^{- 1} (F_{N_{i}} {D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β]})) \\ = \frac{1}{2} D_{i} {(γ)}^{- 1} diag (\frac{\partial Σ_{i} (γ)}{\partial γ_{m}}) \cdot I_{N_{i}} Φ_{N_{i}}^{- 1} (F_{N_{i}} {D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β]}) + \\ + D_{i} (γ) \frac{f_{N_{i}} {D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β]}}{ϕ_{N_{i}} [Φ_{N_{i}}^{- 1} (F_{N_{i}} {D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β]})]} [A (y_{i}) ϑ - X_{i} β] \frac{\partial D_{i} {(γ)}^{- 1}}{\partial γ_{m}} \\ = \frac{1}{2} D_{i} {(γ)}^{- 2} diag (\frac{\partial Σ_{i} (γ)}{\partial γ_{m}}) \cdot I_{N_{i}} z (y_{i} | ϑ, β, γ) + \\ - \frac{1}{2} D_{i} {(γ)}^{- 2} diag (\frac{\partial Σ_{i} (γ)}{\partial γ_{m}}) \cdot I_{N_{i}} \frac{f_{N_{i}} {D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β]} [A (y_{i}) ϑ - X_{i} β]}{ϕ_{N_{i}} [Φ_{N_{i}}^{- 1} (F_{N_{i}} {D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β]})]} \\ = \frac{1}{2} D_{i} {(γ)}^{- 2} diag (\frac{\partial Σ_{i} (γ)}{\partial γ_{m}}) \cdot I_{N_{i}} \times \\ [z (y_{i} | ϑ, β, γ) - \frac{f_{N_{i}} {D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β]} [A (y_{i}) ϑ - X_{i} β]}{ϕ_{N_{i}} [Φ_{N_{i}}^{- 1} (F_{N_{i}} {D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β]})]}] \end{matrix}

Thus,

\begin{matrix} \frac{\partial ℓ_{i} (ϑ, β, γ)}{\partial γ_{m}} = - \frac{1}{2} t r (Σ_{i} {(γ)}^{- 1} \frac{\partial Σ_{i} (γ)}{\partial γ_{m}}) + \\ - \frac{1}{2} [(\frac{\partial z {(y_{i} | ϑ, β, γ)}^{⊤}}{\partial γ_{m}}) (Σ_{i} {(γ)}^{- 1} - D_{i} {(γ)}^{- 2}) z (y_{i} | ϑ, β, γ) + \\ z {(y_{i} | ϑ, β, γ)}^{⊤} (\frac{\partial}{\partial γ_{m}} (Σ_{i} {(γ)}^{- 1} - D_{i} {(γ)}^{- 2})) z (y_{i} | ϑ, β, γ) + \\ z {(y_{i} | ϑ, β, γ)}^{⊤} (Σ_{i} {(γ)}^{- 1} - D_{i} {(γ)}^{- 2}) (\frac{\partial}{\partial γ_{m}} z (y_{i} | ϑ, β, γ))] + \\ + \frac{f_{N_{i}}' (D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β])}{f_{N_{i}} (D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β])} [A (y_{i}) ϑ - X_{i} β] \frac{\partial D_{i} {(γ)}^{- 1}}{\partial γ_{m}} \\ = - \frac{1}{2} t r (Σ_{i} {(γ)}^{- 1} \frac{\partial Σ_{i} (γ)}{\partial γ_{m}}) + \\ - \frac{1}{2} [(\frac{\partial z {(y_{i} | ϑ, β, γ)}^{⊤}}{\partial γ_{m}}) (Σ_{i} {(γ)}^{- 1} - D_{i} {(γ)}^{- 2}) z (y_{i} | ϑ, β, γ) + \\ - z {(y_{i} | ϑ, β, γ)}^{⊤} (Σ_{i} {(γ)}^{- 1} \frac{\partial Σ_{i} (γ)}{\partial γ_{m}} Σ_{i} {(γ)}^{- 1} + \\ {(D_{i} {(γ)}^{- 1})}^{4} diag (\frac{\partial Σ_{i} (γ)}{\partial γ_{m}}) \cdot I_{N_{i}}) z (y_{i} | ϑ, β, γ) + \\ z {(y_{i} | ϑ, β, γ)}^{⊤} (Σ_{i} {(γ)}^{- 1} - D_{i} {(γ)}^{- 2}) (\frac{\partial z (y_{i} | ϑ, β, γ)}{\partial γ_{m}})] + \\ + \frac{f_{N_{i}}' (D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β])}{f_{N_{i}} (D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β])} [A (y_{i}) ϑ - X_{i} β] \frac{\partial D_{i} {(γ)}^{- 1}}{\partial γ_{m}} \\ \frac{\partial ℓ_{i} (ϑ, β, γ)}{\partial β} = z {(y_{i} | ϑ, β, γ)}^{⊤} (Σ_{i} {(γ)}^{- 1} - D_{i} {(γ)}^{- 2}) \times \\ \frac{f_{N_{i}} (D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β])}{ϕ_{N_{i}} (Φ_{N_{i}}^{- 1} (F_{N_{i}} {D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β]}))} X_{i} + \\ - 1_{N_{i}}^{⊤} (\frac{f_{N_{i}}^{'} (D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β])}{f_{N_{i}} (D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β])} D_{i} {(γ)}^{- 1} X_{i}) \\ \frac{\partial ℓ_{i} (ϑ, β, γ)}{\partial ϑ} = - z {(y_{i} | ϑ, β, γ)}^{⊤} (Σ_{i} {(γ)}^{- 1} - D_{i} {(γ)}^{- 2}) \times \\ \frac{f_{N_{i}} (D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β])}{ϕ_{N_{i}} (Φ_{N_{i}}^{- 1} (F_{N_{i}} {D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β]}))} A (y_{i}) + \\ 1_{N_{i}}^{⊤} (\frac{f_{N_{i}}^{'} (D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β])}{f_{N_{i}} (D_{i} {(γ)}^{- 1} [A (y_{i}) ϑ - X_{i} β])} D_{i} {(γ)}^{- 1} A (y_{i})) + 1_{N_{i}}^{⊤} \frac{1}{A' (y_{i}) ϑ} A' (y_{i}) . \end{matrix}

For

F = Φ, z (y_{i} | ϑ, β, γ) = A (y_{i}) ϑ - X_{i} β

⁠, the joint distribution simplifies to

P (Y_{i} \leq y_{i} | X_{i}, U_{i}) = Φ_{0_{N_{i}}, Σ_{i} (γ)} (A (y_{i}) ϑ - X_{i} β)

and the joint density becomes:

\begin{matrix} f_{Y_{i}} (y_{i} | ϑ, β, γ) = {({(2 π)}^{N_{i}} | L_{i} (γ) L_{i} {(γ)}^{⊤} |)}^{- 1 / 2} \times \\ exp (- \frac{1}{2} {(A (y_{i}) ϑ - X_{i} β)}^{⊤} Σ_{i} {(γ)}^{- 1} (A (y_{i}) ϑ - X_{i} β)) \times \\ \prod_{ı = 1}^{N_{i}} a' {(y_{i ı})}^{⊤} ϑ . \end{matrix}

We obtain the corresponding log-likelihood:

\begin{matrix} ℓ_{i} (ϑ, β, γ) \approx log (f_{Y_{i}} (y_{i} | ϑ, β, γ)) \\ \propto - \frac{1}{2} log | Σ_{i} (γ) | + \\ - \frac{1}{2} {(A (y_{i}) ϑ - X_{i} β)}^{⊤} Σ_{i} {(γ)}^{- 1} (A (y_{i}) ϑ - X_{i} β) + \\ {log}_{N_{i}} {(A' (y_{i}) ϑ)}^{⊤} 1_{N_{i}} \\ = - \frac{1}{2} log | Σ_{i} (γ) | - \frac{1}{2} ϑ^{⊤} A {(y_{i})}^{⊤} Σ_{i} {(γ)}^{- 1} A (y_{i}) ϑ + ϑ^{⊤} A {(y_{i})}^{⊤} Σ_{i} {(γ)}^{- 1} X_{i} β + \\ - \frac{1}{2} β^{⊤} X_{i}^{⊤} Σ_{i} {(γ)}^{- 1} X_{i} β + {log}_{N_{i}} {(A' (y_{i}) ϑ)}^{⊤} 1_{N_{i}} . \end{matrix}

The scores are:

\begin{matrix} \frac{\partial ℓ_{i} (ϑ, β, γ)}{\partial γ_{m}} = - \frac{1}{2} t r (Σ_{i} {(γ)}^{- 1} \frac{\partial Σ_{i} (γ)}{\partial γ_{m}}) + \\ \frac{1}{2} {(A (y_{i}) ϑ - X_{i} β)}^{⊤} Σ_{i} {(γ)}^{- 1} \frac{\partial Σ_{i} (γ)}{\partial γ_{m}} Σ_{i} {(γ)}^{- 1} (A (y_{i}) ϑ - X_{i} β) \\ \frac{\partial ℓ_{i} (ϑ, β, γ)}{\partial β} = ϑ^{⊤} A {(y_{i})}^{⊤} Σ_{i} {(γ)}^{- 1} X_{i} - β^{⊤} X_{i}^{⊤} Σ_{i} {(γ)}^{- 1} X_{i} \\ \frac{\partial ℓ_{i} (ϑ, β, γ)}{\partial ϑ} = - ϑ^{⊤} A {(y_{i})}^{⊤} Σ_{i} {(γ)}^{- 1} A (y_{i}) + β^{⊤} X_{i}^{⊤} Σ_{i} {(γ)}^{- 1} A (y_{i}) + 1_{N_{i}}^{⊤} \frac{1}{A' (y_{i}) ϑ} A' (y_{i}) . \end{matrix}

References

Aalen

O. O.

Borgan

Ø.

Gjessing

H. K.

(

2008

Survival and Event History Analysis

New York, USA

Springer

Backer

M. D.

Vroey

C. D.

Lesaffre

Scheys

Keyser

P. D

. (

1998

Twelve weeks of continuous oral therapy for toenail onychomycosis caused by dermatophytes: a double-blind comparative trial of terbinafine 250 mg/day versus itraconazole 200 mg/day

Journal of the American Academy of Dermatology

S57

–

S63

Barbanti

Hothorn

(

2022

). Some Applications of Marginally Interpretable Linear Transformation Models for Clustered Observations . R package vignette version 0.8-0. https://CRAN.R-project.org/package=tram

Bates

Mächler

Bolker

Walker

(

2015

Fitting linear mixed-effects models using lme4

Journal of Statistical Software

–

Belenky

Wesensten

N. J.

Thorne

D. R.

Thomas

M. L.

Sing

H. C.

Redmond

D. P.

Russo

M. B.

Balkin

T. J

. (

2003

Patterns of performance degradation and restoration during sleep restriction and subsequent recovery: a sleep dose-response study

Journal of Sleep Research

–

Brooks

M. E.

Kristensen

van Benthem

K. J.

Magnusson

Berg

C. W.

Nielsen

Skaug

H. J.

Mächler

Bolker

B. M

. (

2017

glmmTMB balances speed and flexibility among packages for zero-inflated generalized linear mixed modeling

The R Journal

378

–

400

Cai

Wei

L. J.

Wilcox

. (

2000

Semiparametric regression analysis for clustered failure time data

Biometrika

867

–

878

Chernozhukov

Fernández-Val

Melly

(

2013

Inference on counterfactual distributions

Econometrica

2205

–

2268

Chow

R. T.

Heller

G. Z.

Barnsley

(

2006

The effect of 300 mW, 830 nm laser on chronic neck pain: a double-blind, randomized, placebo-controlled study

Pain

124

201

–

210

Foresi

Peracchi

(

1995

The conditional distribution of excess returns: an empirical analysis

Journal of the American Statistical Association

451

–

466

Garcia

T. P.

Marder

Wang

(

2019

Time-varying proportional odds model for mega-analysis of clustered event times

Biostatistics

129

–

146

Genz

Bretz

(

2009

Computation of Multivariate Normal and t Probabilities, Lecture Notes in Statistics

Heidelberg

Springer

Goethals

Janssen

Duchateau

(

2008

Frailty models and copulas: similarities and differences

Journal of Applied Statistics

1071

–

1079

Gory

J. J.

Craigmile

P. F.

MacEachern

S. N

. (

2021

A class of generalized linear mixed models adjusted for marginal interpretability

Statistics in Medicine

427

–

440

Gurka

M. J.

Edwards

L. J.

Muller

K. E.

Kupper

L. L

. (

2006

Extending the Box-Cox transformation to the linear mixed model

Journal of the Royal Statistical Society: Series A (Statistics in Society

)

169

273

–

288

Heagerty

P. J

. (

1999

Marginally specified logistic-normal models for longitudinal binary data

Biometrics

688

–

698

Heagerty

P. J.

Zeger

S. L

. (

2000

Marginalized multilevel models and likelihood inference (with comments and a rejoinder by the authors)

Statistical Science

–

Heiss

Winschel

(

2008

Likelihood approximation by numerical integration on sparse grids

Journal of Econometrics

144

–

Hothorn

(

2020

Most likely transformations: the mlt package

Journal of Statistical Software

–

Hothorn

Barbanti

Siegfried

(

2022

). tram: Transformation Models . R package version 0.8-0. https://CRAN.R-project.org/package=tram

Hothorn

Kneib

Bühlmann

(

2014

Conditional transformation models

Journal of the Royal Statistical Society: Series B (Statistical Methodology)

–

Hothorn

Möst

Bühlmann

(

2018

Most likely transformations

Scandinavian Journal of Statistics

110

–

134

Hutmacher

M. M.

French

J. L.

Krishnaswami

Menon

(

2011

Estimating transformations for repeated measures modeling of continuous bounded outcome data

Statistics in Medicine

935

–

949

Højsgaard

Halekoh

Yan

(

2022

). geepack: Generalized Estimating Equation Package . R package version 1.3.9. https://CRANR-project.org/package=geepack

Klein

Hothorn

Barbanti

Kneib

(

2022

Multivariate conditional transformation models

Scandinavian Journal of Statistics

116

–

142

Lee

Nelder

J. A

. (

2004

Conditional and marginal models: another view

Statistical Science

219

–

238

Lin

Luo

Xie

Chen

(

2017

Robust rank estimation for transformation models with random effects

Biometrika

104

971

–

986

Lindsey

J. K.

(

1999

Some statistical heresies

Journal of the Royal Statistical Society: Series D (The Statistician)

–

Manuguerra

Heller

G. Z

. (

2010

Ordinal regression models for continuous scales

International Journal of Biostatistics

Marsaglia

(

1963

Expressing the normal distribution with covariance matrix a + b in terms of one with covariance matrix a

Biometrika

535

–

538

Maruo

Yamaguchi

Noma

Gosho

. (

2017

Interpretable inference on the mixed effect model with the Box-Cox transformation

Statistics in Medicine

2420

–

2434

Masarotto

Varin

(

2012

Gaussian copula marginal regression

Electronic Journal of Statistics

1517

–

1549

McGee

Stringer

(

2022

). Flexible marginal models for dependent data. Technical Report, arXiv 2204.07188.

McLain

A. C.

Ghosh

S. K

. (

2013

Efficient sieve maximum likelihood estimation of time-transformation models

Journal of Statistical Theory and Practice

285

–

303

Molenberghs

Verbeke

(

2005

Models for Discrete Longitudinal Data

New York, USA

Springer

Google Preview

Muff

Held

Keller

L. F

. (

2016

Marginal or conditional regression models for non-normal data?

Methods in Ecology and Evolution

1514

–

1524

Ogden

H. E

. (

2015

A sequential reduction method for inference in generalized linear mixed models

Electronic Journal of Statistics

135

–

152

Core Team

. (

2022

R: A Language and Environment for Statistical Computing

Vienna, Austria

R Foundation for Statistical Computing

Google Preview

Rödel

Graeven

Fietkau

Hohenberger

Hothorn

Arnold

Hofheinz

R.-D.

Ghadimi

Wolff

H. A.

Lang-Welzenbach

others. (

2015

Oxaliplatin added to fluorouracil-based preoperative chemoradiotherapy and postoperative chemotherapy of locally advanced rectal cancer (the German CAO/ARO/AIO-04 study): final results of the multicentre, open-label, randomised, phase 3 trial

The Lancet Oncology

979

–

989

Sauter

Held

(

2016

Quasi-complete separation in random effects of binary response mixed models

Journal of Statistical Computation and Simulation

2781

–

2796

Stroup

W. W.

(

2012

Generalized Linear Mixed Models: Modern Concepts, Methods and Applications

New York, USA

Chapman & Hall/CRC

Google Preview

Sun

Ding

(

2021

Copula-based semiparametric transformation model for bivariate data under general interval censoring

Biostatistics

315

–

330

Tamasi

(

2022

). tramME: Transformation Models with Mixed Effects . R package version 1.0.3. https://CRAN.R-project.org/package=tramME

Tamási

Crowther

Puhan

M. A.

Steyerberg

Hothorn

(

2022

Individual participant data meta-analysis with mixed-effects transformation models

Biostatistics

1083

–

1098

Tamási

Hothorn

(

2021

tramME: mixed-effects transformation models using template model builder

The R Journal

398

–

418

Tang

Chen

(

2018

Semiparametric Bayesian analysis of transformation linear mixed models

Journal of Multivariate Analysis

166

225

–

240

Thas

De Neve

Clement

Ottoy

J.-P.

(

2012

Probabilistic index models

Journal of the Royal Statistical Society: Series B (Statistical Methodology

)

623

–

671

van der Vaart

A. W.

(

1998

Asymptotic Statistics

Cambridge, UK

Cambridge University Press

Varadhan

(

2022

). alabama: Constrained Nonlinear Optimization . R package version 2022.4-1. https://CRAN.R-project.org/package=alabama

Wang

Merkle

E. C

. (

2018

merDeriv: derivative computations for linear mixed effects models with application to robust standard errors

Journal of Statistical Software

–

Wang

Louis

T. A

. (

2003

Matching conditional and marginal shapes in binary random intercept models using a bridge distribution function

Biometrika

765

–

775

Wang

(

2019

Normal frailty probit model for clustered interval-censored failure time data

Biometrical Journal

827

–

840

Ypma

(

2013

). SparseGrid: Sparse Grid Integration in R . R package version 0.8.2. https://CRAN.R-project.org/package=SparseGrid

Zeger

S. L.

Liang

K.-Y.

Albert

P. S

. (

1986

Longitudinal data analysis using generalized linear models

Biometrika

–