ABSTRACT

Zero-inflated data commonly arise in various fields, including economics, healthcare, and environmental sciences, where measurements frequently include an excess of zeros due to structural or sampling mechanisms. Traditional approaches, such as Zero-Inflated Poisson and Zero-Inflated Negative Binomial models, have been widely used to handle excess zeros in count data, but they rely on strong parametric assumptions that may not hold in complex real-world applications. In this paper, we propose a zero-inflated quantile single-index rank-score-based test (ZIQ-SIR) to detect associations between zero-inflated outcomes and covariates, particularly when nonlinear relationships are present. ZIQ-SIR offers a flexible, semi-parametric approach that accounts for the zero-inflated nature of the data and avoids the restrictive assumptions of traditional parametric models. Through simulations, we show that ZIQ-SIR outperforms existing methods by achieving higher power and better Type I error control, owing to its flexibility in modeling zero-inflated and overdispersed data. We apply our method to the real-world dataset: microbiome abundance from the Columbian Gut study. In this application, ZIQ-SIR identifies more significant associations than alternative approaches, while maintaining accurate type I error control.

1 INTRODUCTION

Zero-inflated data arise in many fields, including economics (Lewsey and Thomson, 2004), healthcare (Wang et al., 2015), and environmental sciences (Agarwal et al., 2002), where excess zeros result from structural or sampling mechanisms. These zeros may indicate true absences or undetected values due to measurement limitations, complicating statistical analyses of response-covariate relationships. Traditional methods, such as Zero-Inflated Poisson (ZIP) and Zero-Inflated Negative Binomial (ZINB) models, are widely used but rely on strong parametric assumptions that often fail in real-world applications (Horton et al., 2007). Violations of these assumptions can inflate type I errors, particularly when covariates exhibit nonlinear relationships with the response (Hawinkel et al., 2019).

To relax the parametric assumptions in ZIP and ZINB, Ling et al. (2021b) proposed the zero-inflated quantile rank-score-based test (ZIQRank), which mitigates strong assumptions on non-zero counts by incorporating ordinary quantile regression (Koenker and Bassett, 1978). ZIQRank has been shown to achieve comparable or higher power than existing tests while better controlling false positives (Ling et al., 2021b). Unlike parametric methods, quantile regression is a robust tool for modeling complex data without distributional assumptions. However, its assumption on linear models at each quantile level may fail to capture complex relationships between zero-inflated outcomes and multivariate covariates, potentially leading to uncontrolled type I error inflation, particularly in sparse data or small samples. A similar issue has been reported in linear regression tests (Schwarzer et al., 2002).

To highlight the limitations of ZIQRank, we present a motivating example from microbiome association studies. Human microbiomes play a crucial role in health and disease (Kinder-Haake, 2012), with alterations linked to conditions such as inflammatory bowel disease (Frank et al., 2007), cancer (Kostic et al., 2013), and major depressive disorder (Jiang et al., 2015). However, inflated zeros in microbial abundance and their complex relationships with biomedical factors pose significant challenges in analysis (Martin et al., 2020). To illustrate these complexities, we analyze a widely studied dataset from the Columbian Gut study (De la Cuesta-Zuluaga et al., 2018; Gonzalez et al., 2018), which includes 441 samples with gut microbiota profiled and health-related measures such as BMI and blood pressure (Kinder-Haake, 2012). Figure 1(a) reveals a high prevalence of zeros in microbial abundance, while Figure 1(b) shows that over 99% of taxa have a standard deviation-to-mean ratio greater than 1, indicating the overdispersion characteristic of microbiome data.

Histogram for Columbian’s gut dataset.
FIGURE 1

Histogram for Columbian’s gut dataset.

To illustrate the nonlinear relationship between zero-inflated outcomes and covariates, we examine the taxon Enterobacteriaceae, a key microbial family in the human body (De la Cuesta-Zuluaga et al., 2018), and model its abundance against low-density lipoprotein (LDL), which has been linked to Enterobacteriaceae levels (Liu et al., 2018). Since we focus on the relationship between positive taxon counts and covariates, zeros were excluded. We estimated the 25% and 75% quantile curves of non-zero taxon counts using both a B-spline-based nonparametric method (Eilers and Marx, 1996) and a linear quantile regression model (Koenker et al., 1994), the latter applied in Ling et al. (2021a). Bootstrap was used to construct 90% confidence intervals (Efron and Tibshirani, 1994). At both quantile levels, the linear quantile regression model detected no abundance changes with respect to LDL, whereas the nonparametric model indicated that high LDL levels are associated with increased Enterobacteriaceae abundance (Figure 2).

B-spline estimated 25% and 75% of quantile curves for taxon Enterobacteriaceae-unspecified fitted against the covariate LDL. The 90% confidence intervals (dashed lines) of the fitted quantile curves (solid lines) are constructed by bootstrap.
FIGURE 2

B-spline estimated 25% and 75% of quantile curves for taxon Enterobacteriaceae-unspecified fitted against the covariate LDL. The 90% confidence intervals (dashed lines) of the fitted quantile curves (solid lines) are constructed by bootstrap.

To detect associations between zero-inflated responses and covariates in the presence of non-linearity, we propose the zero-inflated quantile single-index rank-score-based test (ZIQ-SIR). Single-index models are widely used for handling complex data while ensuring interpretability (Neykov et al., 2016), making them suitable for various applications. This model assumes that covariates |${\bf X}$| influence the response |$Y$| through a link function |$G({\bf X}^\top \boldsymbol \beta )$|⁠, where |$\boldsymbol \beta$| is an unknown parameter and |$G$| is an unspecified function. Traditional quantile single-index models, like classical quantile regression, struggle with inflated zeros, requiring further refinement. Our approach explicitly accounts for zero inflation, offering greater flexibility and precision in modeling response-covariate relationships. We adopt a semi-parametric framework instead of traditional non-parametric quantile regression to maintain model generality—allowing |$G$| to remain unspecified—while enabling hypothesis testing for a subset of covariates of interest. In contrast, non-parametric methods often lack the ability to test specific covariates while adjusting for others.

The paper is organized as follows: Section 2 introduces our model and test statistics. Section 3 presents simulations comparing ZIQ-SIR with existing zero-inflated models, demonstrating its high power and well-controlled type I errors. In Section 4, we apply ZIQ-SIR to the Columbian Gut dataset, identifying more taxa associated with biological and dietary features than ZIQRank. Finally, Section 5 concludes with a discussion of potential future directions.

2 METHODS

We denote |$Y$| as a non-negative zero-inflated response, and |${\bf Z}$| as a set of |$p$| covariates of interest. We assume a set of additional |$q$| covariates that need to be adjusted in the model but not of interest, and denote it as |${\bf C}$|⁠. The intercept is contained in |${\bf C}$|⁠. Throughout the paper, we denote |$Q_Y(\tau \mid {\bf X})$| as the |$\tau$|th conditional quantile of |$Y$| given |${\bf X} = ({\bf Z}^\top ,{\bf C}^\top )^\top$|⁠.

2.1 Zero-inflated quantile single-index model

To interpret the distribution of |$Y$|⁠, we first decompose the conditional distribution of the zero-inflated response |$Y$| into the zero part and the positive part:

where |$P(Y = 0 \mid \mathbf {X})$| represents the probability mass at zero, and |$P(Y \le y \mid \mathbf {X}, Y > 0)$| corresponds to the conditional distribution of |$Y$| given that |$Y$| is positive. Then we model the two parts |$P(Y=0\mid {\bf X})$| and |$P(Y \le y \mid \mathbf {X}, Y > 0)$| separately. For |$P(Y>0\mid {\bf X})$|⁠, the conditional probability of observing a positive |$Y$|⁠, follows a logistic regression model,

(1)

where |$\boldsymbol {\eta }\in \mathbb {R}^p$|⁠, |$\boldsymbol {\gamma }\in \mathbb {R}^q$| are the effects of covariates on the presence-absence status of the response. Then, we model the |$\tau$|th conditional quantile function of |$Y$| given |${\bf X}$| and |$Y>0$| by a semiparametric single-index model:

(2)

where |$G_{\tau }(\cdot )$| is an unknown function, and |$\boldsymbol \alpha _{\tau }^0$|⁠, |$\boldsymbol \beta _{\tau }^0$| are unknown parameters. The coefficient |$\boldsymbol \alpha _{\tau }^0$| describes the effects of the covariates of interest on the response and |$\boldsymbol \beta _{\tau }^0$| captures the contribution of other covariates.

To test whether the covariates of interest |${\bf Z}$| are associated with |$Y$|⁠, the null hypothesis can be written as |$H_0: \boldsymbol \eta = {\bf 0}\quad \text{and}\quad \boldsymbol \alpha _\tau ^0 = {\bf 0},\ \forall \tau \in (0,1).$| The first part, |$\boldsymbol \eta = {\bf 0}$|⁠, can be tested through the likelihood ratio test (Nam et al., 2017). Thus, our main goal is to construct the test statistic for the second part, |$\boldsymbol \alpha _\tau ^0$|⁠. In the quantile regression literature, the rank score test has been widely considered as a robust and powerful test for the quantile coefficient |$\boldsymbol \alpha _\tau ^0$| (Gutenbrunner et al., 1993). To construct a rank score test statistic, we first introduce how to estimate the |$G_\tau (\cdot )$| function. We approximate |$G_{\tau }\left({\bf Z}^\top \boldsymbol \alpha _{\tau }^0+{\bf C}^\top \boldsymbol \beta _{\tau }^0\right)$| by B-spline as follows:

(3)

where |$\boldsymbol \theta (\tau )\in \mathbb {R}^{J_n}$| is an unknown parameter, |$\boldsymbol \alpha _{\tau }$|⁠, |$\boldsymbol \beta _{\tau }$| are unknown coefficients, and |$B(u) = \lbrace B_j(u):1\le j\le J_n,J_n=N_{n}+m\rbrace ^\top$|⁠, a normalized |$m$|th order B-spline basis with a partition of |$a=t_0 < t_1<...< t_{N_n}< b=t_{N_n+1}$| on a given interval |$[a,b]$| (De Boor, 2001).

Suppose we have independent and identically distributed random samples |$\lbrace \left({\bf X}_i,Y_i\right); i = 1,2,\cdots ,n\rbrace =\lbrace (({\bf Z}_i^\top ,{\bf C}_i^\top )^\top ,Y_i); i = 1,2,...,n\rbrace$|⁠. By models (2) and (3), we estimate the quantile coefficients by minimizing the following pseudo-likelihood function:

(4)

where |$\rho _\tau (u) = u \lbrace \tau - I(u < 0)\rbrace $| represents the quantile loss function.

2.2 Estimation

The parameters |$\boldsymbol \theta _\tau ,\boldsymbol \alpha _\tau ,\boldsymbol \beta _\tau$| do not have explicit forms of solutions and require numerical iteration methods for solving. We use the profile approach to iteratively estimate |$\boldsymbol \theta _\tau$|⁠, |$\boldsymbol \alpha _\tau$| and |$\boldsymbol \beta _\tau$| due to its stable estimation performance (Liang et al., 2010; Ma and He, 2016). We define the profile pseudo-likelihood function of |$\boldsymbol \alpha _\tau$| and |$\boldsymbol \beta _\tau$| as

(5)

where |$\boldsymbol {\tilde{\theta }}_n\left(\boldsymbol \alpha _\tau , \boldsymbol \beta _\tau ,\tau \right)$| is the minimizer of |$L_{\tau n}(\boldsymbol \alpha _\tau , \boldsymbol \beta _\tau ,\boldsymbol \theta _\tau )$| over |$\boldsymbol \theta _\tau \in \mathbb {R}^{J_n}$| for given |$\boldsymbol \alpha _\tau , \boldsymbol \beta _\tau$|⁠. Note that one cannot construct the rank score of the coefficient |$\boldsymbol \alpha _\tau$| by differentiating the quantile objective function |$L_{\tau n}^{*}(\boldsymbol \alpha _\tau ,\boldsymbol \beta _\tau )$| directly due to the lack of differentiability of |$\tilde{\boldsymbol \theta }_n({\boldsymbol \alpha }_\tau , {\boldsymbol \beta }_\tau , \tau )$| as a function of |${\boldsymbol \alpha }_\tau$|⁠. We consider a smooth version of the profile quantile loss function:

(6)

where the smoothness of |${\tilde{\tilde{\boldsymbol \theta }}}_{n}(\boldsymbol \alpha _\tau , \boldsymbol \beta _\tau ,\tau ) = \arg \min _{\boldsymbol \theta _\tau \in \mathbb {R}^{J_n}}\mathbb {E}\lbrace L_{\tau n}(\boldsymbol \theta _\tau ,\boldsymbol \alpha _\tau ,\boldsymbol \beta _\tau ) \mid \mathbb {X}\rbrace$| can be derived from Assumption (3.4), Supplement A.1.1, and |$\mathbb {X} = ({\bf X}_1,\cdots ,{\bf X}_n)$| are the given covariates. With the profile likelihood function defined in eq (5), we can obtain the estimation of |$ \left({\boldsymbol {\alpha }^0_\tau }^\top , {\boldsymbol \beta ^0_\tau }^\top \right)^\top$| under the null hypothesis: |$\widehat{\boldsymbol \alpha }^{N}_\tau = {\bf 0}_{p\times 1},\quad {\widehat{\boldsymbol \beta }^{N}_\tau }= \arg \min _{\Vert \boldsymbol \beta _\tau \Vert _2 = 1}L_{\tau n}^{*}({\bf 0}_{p\times 1},\boldsymbol \beta _\tau ),$| and the spline estimator of the |$G_\tau (\cdot )$| function is: |$\widehat{G}_{\tau }\left(u\right) = B(u)^\top \boldsymbol {\tilde{\theta }}_n\left(\widehat{\boldsymbol \alpha }^N_\tau , \widehat{\boldsymbol \beta }^N_\tau ,\tau \right)$|⁠, where |$\boldsymbol {\tilde{\theta }}_n\left(\widehat{\boldsymbol \alpha }^N_\tau , \widehat{\boldsymbol \beta }^N_\tau ,\tau \right)$| is the minimizer of |$L_{\tau n}(\widehat{\boldsymbol \alpha }^N_\tau , \widehat{\boldsymbol \beta }^N_\tau ,\boldsymbol \theta _\tau )$| over |$\boldsymbol \theta _\tau \in \mathbb {R}^{J_n}$|⁠.

2.3 Proposed test statistics and asymptotic properties

Since we have obtained the profile likelihood function and the estimator of |$G_\tau (\cdot )$|⁠, we can construct the rank score test for quantile regression coefficient(s) |$\boldsymbol \alpha _\tau ^0$| by differentiating the smooth quantile loss function |$\tilde{L}_{\tau n}^{*}\left(\boldsymbol \alpha _\tau ,\boldsymbol \beta _\tau \right)$| in eq (6): |${\bf s}\left(\widehat{\boldsymbol {\alpha }}_\tau ^N,\widehat{\boldsymbol {\beta }}_\tau ^N\right)_{p\times 1}= -\partial \tilde{L}^{*}_{\tau n} \left(\widehat{\boldsymbol {\alpha }}_\tau ^N,\widehat{\boldsymbol {\beta }}_\tau ^N\right)/\partial \boldsymbol \alpha _\tau .$| Though the exact value of |${\bf s}\left(\boldsymbol {\widehat{\alpha }}_\tau ^N,\boldsymbol {\widehat{\beta }}_\tau ^N\right)$| cannot be directly obtained, since |$\tilde{L}^{*}_{\tau n}$| involves the expectation of the pseudo-likelihood function, we can construct the empirical score to approximate: |$\widehat{{\bf s}}\left(\widehat{\boldsymbol {\alpha }}_\tau ^N,\widehat{\boldsymbol {\beta }}_\tau ^N\right)= n^{-1}\sum _{i = 1}^{n}\rho _{\tau }^{(1)}\left\lbrace Y_{i}-\widehat{G}_\tau \left({\bf C}_{i}^\top \widehat{\boldsymbol \beta }_\tau ^N\right)\right\rbrace \left\lbrace \widehat{G}_{\tau }^{(1)}\left({\bf C}_{i}^\top \widehat{\boldsymbol \beta }_\tau ^N\right)\widehat{{\bf Z}}_i\left(\widehat{\boldsymbol \beta }_\tau ^N\right)\right.\\ \left. I(Y_i>0)\right\rbrace ,$| where |$\rho _{\tau }^{(1)}(u) =\tau -I(u<0)$| denotes the first order derivative of the quantile loss function; |$\widehat{G}_{\tau }^{(1)}\left({\bf C}_{i}^\top \widehat{\boldsymbol \beta }_\tau ^N\right) = B^{(1)}\left({\bf C}_{i}^\top \widehat{\boldsymbol \beta }_\tau ^N\right)^\top \tilde{\boldsymbol \theta }_n\left({\bf 0}_{p\times 1},\widehat{\boldsymbol \beta }_\tau ^N,\tau \right)$| denotes the spline estimator of the first-order derivative of |$G_\tau \left(\cdot \right)$| under null hypothesis; and |$(\widehat{{\bf Z}}_{i}(\boldsymbol \beta _\tau )^\top ,\widehat{{\bf C}}_{i}(\boldsymbol \beta _\tau )^\top )^\top = \widehat{{\bf X}}_{i}(\boldsymbol \beta _\tau ) = {\bf X}_{i} - \widehat{E}\left({\bf X}_{i}\mid {\bf C}_{i}^\top \boldsymbol \beta _\tau \right),$| where |$\widehat{E}\left({\bf X}_{i}\mid {\bf C}_{i}^\top \widehat{\boldsymbol \beta }_{\tau }^N\right)$| is the estimator of |$E\left({\bf X}_i\mid {\bf C}_i^\top \boldsymbol \beta _\tau ^0 \right)$| with

We further denote |$\widehat{\boldsymbol \Omega }_{\tau }$| to approximate the variance-covariance matrix for the empirical score constructed above:

$$\widehat{\boldsymbol \Omega }_{\tau } = n^{-1}\sum _{i = 1}^{n}\widehat{{\bf g}}_{\tau ,i}\widehat{ {\bf g}}_{\tau , i}^\top = \begin{pmatrix}\widehat{\boldsymbol \Omega }_{\tau 11} \quad & \widehat{\boldsymbol \Omega }_{\tau 12} \\\widehat{\boldsymbol \Omega }_{\tau 21}\quad & \widehat{\boldsymbol \Omega }_{\tau 22} \end{pmatrix}$$
⁠, where |$\widehat{ {\bf g}}_{\tau ,i} = \widehat{G}_{\tau }^{(1)}\left({\bf C}_{i}^\top \widehat{\boldsymbol \beta }_\tau ^N\right)\widehat{{\bf X}}_{i}(\widehat{\boldsymbol \beta }_\tau ^N)I(Y_i>0) = (\widehat{ {\bf g}}_{\tau , i 1}^\top ,\widehat{ {\bf g}}_{\tau , i2}^\top )^\top ,$| and |$\widehat{\boldsymbol \Omega }_{\tau ll^{^{\prime }}} = n^{-1}\sum _{i = 1}^{n}\widehat{ {\bf g}}_{\tau , il}\widehat{ {\bf g}}_{\tau , il^{^{\prime }}}^\top$| for |$l,l^{^{\prime }} = 1,2.$| With the empirical score |$\widehat{{\bf s}}\left(\widehat{\boldsymbol {\alpha }}_\tau ^N,\widehat{\boldsymbol {\beta }}_\tau ^N\right)$| and the estimated covariance matrix |$\widehat{\boldsymbol \Omega }_{\tau }$|⁠, the test statistic for the null hypothesis |$\boldsymbol \alpha ^0_\tau = {\bf 0}_{p\times 1}$| can be constructed as

where |$\left(\widehat{\boldsymbol \Omega }_{\tau 22} - \widehat{\boldsymbol \Omega }_{\tau 21}^\top \widehat{\boldsymbol \Omega }_{\tau 11}^+\widehat{\boldsymbol \Omega }_{\tau 12}\right)^+$| is an approximation of the inverse of the variance-covariance matrix of the empirical score |$\widehat{{\bf s}}\left(\boldsymbol {\widehat{\alpha }}_\tau ^N,\boldsymbol {\widehat{\beta }}_\tau ^N\right)$|⁠. Note that |$E\left[\lbrace \widehat{{\bf X}}_{i}(\widehat{\boldsymbol \beta }_\tau ^N)I(Y_i>0)\rbrace \lbrace \widehat{{\bf X}}_{i}(\widehat{\boldsymbol \beta }_\tau ^N)I(Y_i>0)\rbrace ^\top \right] = E\left[\widehat{{\bf X}}_{i}(\widehat{\boldsymbol \beta }_\tau ^N)\lbrace \widehat{{\bf X}}_{i}(\widehat{\boldsymbol \beta }_\tau ^N)\rbrace ^\top P(Y_i>0\mid {\bf X}_i)\right].$| The variance-covariance matrix |$\left(\widehat{\boldsymbol \Omega }_{\tau 22} -\widehat{\boldsymbol \Omega }_{\tau 21}^\top \widehat{\boldsymbol \Omega }_{\tau 11}^+\widehat{\boldsymbol \Omega }_{\tau 12}\right)$| implicitly incorporates a “propensity score”, |$P(Y_i>0\mid {\bf X}_i)$|⁠, which captures the uncertainty associated with whether |$Y$| is observed as a positive count or remains zero. This reflects the randomness in the occurrence of positive counts in the data. With |$n\rightarrow \infty$|⁠, we can derive the asymptotic distribution of the test statistic |$\mathcal {T}_\tau$|⁠, which is proven in Supplement A.1.2.

 
Theorem 1:

Under the null hypothesis, with Assumptions 1-3 in Supplement A.1.1 hold, given that |$\left(Y,{\bf C}^\top \right)^\top$| is independent of |${\bf Z}$| given |${\bf C}^\top \boldsymbol \beta _\tau ^0$|⁠, as |$n\rightarrow \infty$|⁠, we have |$\mathcal {T}_{\tau }\stackrel{d}{\longrightarrow } \chi _{p}^2.$|

Combining the tests of |$\boldsymbol \eta$| and |$\boldsymbol \alpha _\tau ^0$|⁠, we conclude a unified test for the following hypothesis: |$H_0: \boldsymbol \eta = {\bf 0}\quad \text{and}\quad \boldsymbol \alpha ^0_\tau = {\bf 0},\ \forall \tau \in (0,1).$| We denote the |$p$|-value for logistic regression as |$p^L$|⁠. For the quantile regression part, we conduct hypothesis testing for the coefficient |$\boldsymbol \alpha _\tau ^0$| on a grid of |$K$| quantile levels, |$0<\tau _1<\cdots <\tau _K<1$|⁠, and obtain the |$p$|-values |$p_{\tau _1}^Q, \cdots ,p_{\tau _K}^Q$|⁠. For computational efficiency, a common practice in quantile regression is to test |$K=5$| levels |$\tau \in \lbrace 0.1,0.25,0.5,0.75, 0.9\rbrace$|⁠. Finally, we combine |$p^L$| and |$p_{\tau _1}^Q, \cdots ,p_{\tau _K}^Q$| by Cauchy combination (Liu and Xie, 2020), and the final test statistic |$\mathcal {T}_C$| takes the form |$\mathcal {T}_C = \widehat{r}_n \tan \left\lbrace (0.5-p^L)\pi \right\rbrace + (1-\widehat{r}_n)\sum _{s = 1}^K w_s \tan \left\lbrace (0.5-p^Q_{\tau _s})\pi \right\rbrace ,$| where |$\widehat{r}_n$| is the observed proportion of zero in |$Y_i$|’s and the weight |$w_s = \frac{\tau _sI(\tau _s\le 0.5)+(1-\tau _s)I(\tau _s>0.5)}{\sum _{s = 1}^K\tau _sI(\tau _s\le 0.5)+(1-\tau _s)I(\tau _s>0.5)}$| are set to ensure that |$p$|-values with central quantiles are assigned with larger weight while the extreme tails are assigned with smaller weight (Ling et al., 2021a). The Cauchy combination is a flexible approach that does not require strong assumptions in practice. As established in Liu and Xie (2020) (Theorem 1 and Remark 1), it follows that |$\mathcal {T}_C \stackrel{d}{\longrightarrow } {\it Cauchy}(0,1)$|⁠. Note that there are various options to combine multiple |$p$|-values other than the Cauchy combination, such as the minimum |$p$|-value method (Tippett et al., 1931), truncated product method (Zaykin et al., 2002), uniform combinations (Edgington, 1972). One can use any appropriate method, and we use the Cauchy combination here due to its fast computation.

The implementation details are provided in Supplement A.2, including a fast permutation strategy for small sample sizes and the knot selection for the B-spline component of ZIQ-SIR.

3 SIMULATIONS

Since the microbiome data often has a higher portion of zeros and smaller sample sizes than the dietary data, as reported in Section 1, modeling and testing become more challenging. Therefore, in this section, we simulate the dataset to mimic real microbiome data in Section 4. We compare ZIQ-SIR to four other methods: ZIQRank (Ling et al., 2021a), Quantile Single-index (Ma and He, 2016), ZINB (Ridout et al. (2001)), and ZIP (Yau et al. (2003)). For ZINB and ZIP, we first round the generated response to count data and use the Wald test in pscl package in R. The ZIQRank method can be viewed as a special case of ZIQ-SIR by specifying the function |$G_{\tau }(\cdot )$| as an identity link function. The Quantile Single-index method performs similarly to the positive part of ZIQ-SIR but does not model the inflated zeros. When the data includes a probability mass at zero, the Quantile Single-index method fails to converge. To address this, we added a small perturbation |$\left(N(0,10^{-10})\right)$| to the zero-valued responses before applying the Quantile Single-index method. We evaluate all methods based on their type I errors and power performance.

3.1 Simulation settings

We generate |$Y$| to mimic microbiome abundance and |${\bf X}= (x_{1},x_{2},x_{3},x_{4},x_{5})^\top$| as covariates, based on real data distribution. For the covariates |$x_{1},x_{2},x_{3},x_{4},x_{5}$|⁠, we generate |${\rm Setting}\ 1:\ x_{1}\sim Bernoulli(0.5), x_{2}\sim N(28,2^2), x_{3}\sim N(92.5,13^2), x_{4}\sim N(80,12^2), x_{5}\sim N(124,18.5^2);$| and |${\rm Setting}\ 2:\ x_{1}\sim Bernoulli(0.5),x_{2}\sim N(28,2^2),x_{3}\sim 2x_2+N(36.5,9^2), x_{4}\sim N(80,12^2)$|⁠, |$x_{5}\sim 1.3x_4+N(20,7.75^2).$| We generate these covariates to mimic the real data in Section 4: |$x_1$| represents gender, |$x_2$| BMI, |$x_3$| waist circumference, |$x_4$| diastolic blood pressure, and |$x_5$| systolic blood pressure. In Setting 1, all covariates are generated independently. In Setting 2, the correlations between |$x_2$| and |$x_3$|⁠, as well as |$x_4$| and |$x_5$|⁠, are simulated to mimic their real data correlations (see Table S.1 in Supplement B.1). For each subject |${\bf X}_i$|⁠, we first simulated its nominal quantile level |$\tau _i\sim \text{Unif}(0,1)$| given the presence of the taxon, while |$D_i$| is the presence-absence status from a Bernoulli distribution with a success probability defined as: |$P(D = 1\mid {\bf X}) = \frac{\exp (\gamma _0+\sum _{j=1}^{p+q}\gamma _{j}x_{j})}{1+\exp (\gamma _0+\sum _{j=1}^{p+q}\gamma _{j}x_{j})},$| where |$p$| denotes the dimension of covariates of interest, |$q$| denotes the dimension of additional covariates, and the set of parameters |$\boldsymbol \gamma = (-0.4,-0.480,-0.022,0.021,0.015, -0.009)^\top$| to mimic the taxon Blautia from real data analysis. We then set |$Y_i= 0$| if |$D_i=0$|⁠. Otherwise, we generated |$Y_i$| as |$Y_i = Q_Y(\tau _i\mid {\bf X}_i,Y>0) = G_{\tau _i}\left(\beta _0(\tau _i)+{\bf X}_i^\top \boldsymbol \beta (\tau _i)\right),$| where the true coefficients |$\boldsymbol \beta (\tau ) = \left(\beta _1(\tau ),\cdots ,\beta _{p+q}(\tau )\right)^\top$| and the quantile function |$G_{\tau }(\cdot )$| are generated as follows: |$ \beta _0(\tau ) = -147.7\tau - 50\tau ^2-20, \beta _{1}(\tau ) = 0.6\sqrt{\tau }-2\tau , \beta _{2}(\tau ) = 2.2\tau ^2,\beta _3(\tau ) = \frac{1}{30}\tau ^2+0.02,$||$\beta _{4}(\tau ) = 0.097\sin (2\pi \tau ), \beta _5(\tau ) = 0.086(-3\tau ^2+\tau ), G_{\tau }(u) = \frac{1}{12}\tau (0.1u)^4+13.5\tau (0.1u)^2 + 1.875\tau u.$| Setting 2 mimics the distribution of taxon Blautia as in the application from Section 4 (Figure S.1, Supplement B.1). Simulation results are presented based on 500 Monte Carlo replicates.

3.2 Type I error results

We first present the type I error results with sample sizes of |$n=500,2000$|⁠. The samples |$({\bf X}_i,Y_i)$|⁠, |$i=1,\cdots ,n$| are generated under Setting 1 and Setting 2, respectively. We evaluate the type I error under the null model with the coefficients of |$x_j$| set to zero, ie, |$\beta _j(\tau ) = 0$| and |$\gamma _j = 0$|⁠, for |$j = 1,\cdots , p$|⁠, respectively. Additionally, in Setting 2, we further test whether the coefficients for the entire groups of correlated covariates |$(x_2, x_3)$| and |$(x_4, x_5)$| are zero.

The sample size of |$n=500$| matches the real data size in Section 4 and can be viewed as a small sample case due to the inflated zeros in the data. On average, 36% of |$Y_i$| generated under Setting 1 and 2 are zero. We apply the proposed ZIQ-SIR method using the fast permutation approach described in Supplement A.2. To ensure a fair and direct comparison with existing methods, we opted to use publicly available versions of ZIQRank and other competing methods without modification, rather than introducing permutation adjustments. From Table 1, we observe that our proposed ZIQ-SIR method maintains a controlled type I error, regardless of the correlations between covariates, while other methods all exhibit inflation to some extent. The ZINB and ZIP methods exhibit severe type I error inflation under both settings, as these parametric models fail to adequately describe the relationship between the covariates and |$Y$| due to the overdispersion nature of the data. This inadequacy potentially leads to erroneous conclusions about the significant association between covariates of interest and |$Y$| at a given confidence level. The Quantile Single-index method shows much more severe inflation when the covariates are correlated, possibly due to the violation of independence assumptions in Ma and He (2016). We also conducted additional type I error evaluation with a smaller sample size (⁠|$n=200$|⁠). The proposed ZIQ-SIR still maintains the type I error; see Table S.3, Supplement B.2.

TABLE 1

Type I error result with the significant threshold |$\alpha =0.05$|⁠.

|$n=500$|PredictorZIQ-SIRZIQRankQuantile Single-indexZINBZIP
Setting 1|$x_1$|0.0560.0740.0900.8800.904
 |$x_2$|0.0420.0660.0780.8220.820
 |$x_3$|0.0540.0640.0900.9020.888
 |$x_4$|0.0520.0520.0920.8880.904
 |$x_5$|0.0460.0620.0820.9060.888
Setting 2|$x_2$|0.0440.0820.0880.8500.834
 |$x_3$|0.0400.0720.0960.8900.880
 |$x_4$|0.0440.0660.3140.9080.896
 |$x_5$|0.0340.072|$ 0.330$|0.8700.878
 |$x_2$|⁠, |$x_3$||$0.052$|0.0640.094|$0.970$||$0.946$|
 |$x_4$|⁠, |$x_5$|0.0420.0580.0680.9780.996
|$n=500$|PredictorZIQ-SIRZIQRankQuantile Single-indexZINBZIP
Setting 1|$x_1$|0.0560.0740.0900.8800.904
 |$x_2$|0.0420.0660.0780.8220.820
 |$x_3$|0.0540.0640.0900.9020.888
 |$x_4$|0.0520.0520.0920.8880.904
 |$x_5$|0.0460.0620.0820.9060.888
Setting 2|$x_2$|0.0440.0820.0880.8500.834
 |$x_3$|0.0400.0720.0960.8900.880
 |$x_4$|0.0440.0660.3140.9080.896
 |$x_5$|0.0340.072|$ 0.330$|0.8700.878
 |$x_2$|⁠, |$x_3$||$0.052$|0.0640.094|$0.970$||$0.946$|
 |$x_4$|⁠, |$x_5$|0.0420.0580.0680.9780.996
n = 2000PredictorZIQ-SIRZIQRankQuantile Single-indexZINBZIP
Setting 1|$x_1$|0.0460.0560.0480.9020.948
 |$x_2$|0.0460.0620.0620.8560.986
 |$x_3$|0.0440.0500.0720.9060.932
 |$x_4$|0.0500.0540.0420.8660.954
 |$x_5$|0.0500.0580.0680.9320.940
Setting 2|$x_2$|0.0580.0580.0660.7980.988
 |$x_3$|0.0460.0580.1100.8960.916
 |$x_4$|0.0560.0620.3260.9120.974
 |$x_5$|0.0560.0400.2480.8760.984
 |$x_2$|⁠, |$x_3$|0.0560.0660.0760.9650.998
 |$x_4$|⁠, |$x_5$|0.0560.0540.0620.8940.918
n = 2000PredictorZIQ-SIRZIQRankQuantile Single-indexZINBZIP
Setting 1|$x_1$|0.0460.0560.0480.9020.948
 |$x_2$|0.0460.0620.0620.8560.986
 |$x_3$|0.0440.0500.0720.9060.932
 |$x_4$|0.0500.0540.0420.8660.954
 |$x_5$|0.0500.0580.0680.9320.940
Setting 2|$x_2$|0.0580.0580.0660.7980.988
 |$x_3$|0.0460.0580.1100.8960.916
 |$x_4$|0.0560.0620.3260.9120.974
 |$x_5$|0.0560.0400.2480.8760.984
 |$x_2$|⁠, |$x_3$|0.0560.0660.0760.9650.998
 |$x_4$|⁠, |$x_5$|0.0560.0540.0620.8940.918
TABLE 1

Type I error result with the significant threshold |$\alpha =0.05$|⁠.

|$n=500$|PredictorZIQ-SIRZIQRankQuantile Single-indexZINBZIP
Setting 1|$x_1$|0.0560.0740.0900.8800.904
 |$x_2$|0.0420.0660.0780.8220.820
 |$x_3$|0.0540.0640.0900.9020.888
 |$x_4$|0.0520.0520.0920.8880.904
 |$x_5$|0.0460.0620.0820.9060.888
Setting 2|$x_2$|0.0440.0820.0880.8500.834
 |$x_3$|0.0400.0720.0960.8900.880
 |$x_4$|0.0440.0660.3140.9080.896
 |$x_5$|0.0340.072|$ 0.330$|0.8700.878
 |$x_2$|⁠, |$x_3$||$0.052$|0.0640.094|$0.970$||$0.946$|
 |$x_4$|⁠, |$x_5$|0.0420.0580.0680.9780.996
|$n=500$|PredictorZIQ-SIRZIQRankQuantile Single-indexZINBZIP
Setting 1|$x_1$|0.0560.0740.0900.8800.904
 |$x_2$|0.0420.0660.0780.8220.820
 |$x_3$|0.0540.0640.0900.9020.888
 |$x_4$|0.0520.0520.0920.8880.904
 |$x_5$|0.0460.0620.0820.9060.888
Setting 2|$x_2$|0.0440.0820.0880.8500.834
 |$x_3$|0.0400.0720.0960.8900.880
 |$x_4$|0.0440.0660.3140.9080.896
 |$x_5$|0.0340.072|$ 0.330$|0.8700.878
 |$x_2$|⁠, |$x_3$||$0.052$|0.0640.094|$0.970$||$0.946$|
 |$x_4$|⁠, |$x_5$|0.0420.0580.0680.9780.996
n = 2000PredictorZIQ-SIRZIQRankQuantile Single-indexZINBZIP
Setting 1|$x_1$|0.0460.0560.0480.9020.948
 |$x_2$|0.0460.0620.0620.8560.986
 |$x_3$|0.0440.0500.0720.9060.932
 |$x_4$|0.0500.0540.0420.8660.954
 |$x_5$|0.0500.0580.0680.9320.940
Setting 2|$x_2$|0.0580.0580.0660.7980.988
 |$x_3$|0.0460.0580.1100.8960.916
 |$x_4$|0.0560.0620.3260.9120.974
 |$x_5$|0.0560.0400.2480.8760.984
 |$x_2$|⁠, |$x_3$|0.0560.0660.0760.9650.998
 |$x_4$|⁠, |$x_5$|0.0560.0540.0620.8940.918
n = 2000PredictorZIQ-SIRZIQRankQuantile Single-indexZINBZIP
Setting 1|$x_1$|0.0460.0560.0480.9020.948
 |$x_2$|0.0460.0620.0620.8560.986
 |$x_3$|0.0440.0500.0720.9060.932
 |$x_4$|0.0500.0540.0420.8660.954
 |$x_5$|0.0500.0580.0680.9320.940
Setting 2|$x_2$|0.0580.0580.0660.7980.988
 |$x_3$|0.0460.0580.1100.8960.916
 |$x_4$|0.0560.0620.3260.9120.974
 |$x_5$|0.0560.0400.2480.8760.984
 |$x_2$|⁠, |$x_3$|0.0560.0660.0760.9650.998
 |$x_4$|⁠, |$x_5$|0.0560.0540.0620.8940.918

When the sample size increases to |$n=2000$|⁠, we use the asymptotic distribution of |$\mathcal {T}_\tau$| to obtain its |$p$|-value. We observe that the type I error of ZIQ-SIR is under control (Table 1), confirming the validity of asymptotic distribution. Note that type I error inflation persists for Quantile Single-Index and ZIQRank across different sample sizes due to the limited flexibility of their models. The type I error of ZIQRank is improved due to the increased sample size, similar to Quantile Single-index under Setting 1. However, their type I error is still inflated under Setting 2 due to the correlation among covariates and their model misspecifications.

The hypothesis testing results with a significance level of |$\alpha = 0.01$| are consistent with the above results, provided in Supplement B.2 (Tables S.2 and S.4). We have also conducted additional simulation studies using hurdle Poisson and hurdle negative binomial methods (Mullahy, 1986). Similar to ZIP and ZINB, the hurdle methods also exhibit type I error inflation due to their restrictive parametric assumptions (Supplement B.3).

3.3 Power results

Power results are also performed under Setting 1 and Setting 2 with sample sizes of |$n\in \lbrace 500,2000\rbrace$|⁠. We do not present the power results of ZINB and ZIP methods, since Table 1 shows their severe type I error inflation. The power for ZIQ-SIR, ZIQRank, and Quantile Single-index are presented in Table 2.

TABLE 2

Power results (without ZIP and ZINB) with the significant threshold |$\alpha =0.05$|⁠.

|$n=500$|PredictorZIQ-SIRZIQRankQuantile Single-index
 |$x_1$|0.4840.4740.124
 |$x_2$|0.0800.0820.080
Setting 1|$x_3$|0.5940.6080.120
 |$x_4$|0.2760.2780.098
 |$x_5$|0.2740.2600.112
Setting 2|$x_2$|0.0660.0680.090
 |$x_3$|0.3300.2860.126
 |$x_4$|0.0660.0680.366
 |$x_5$|0.0680.0820.384
 |$x_2$|⁠, |$x_3$|0.2920.2700.116
 |$x_4$|⁠, |$x_5$|0.0820.0740.110
|$n=500$|PredictorZIQ-SIRZIQRankQuantile Single-index
 |$x_1$|0.4840.4740.124
 |$x_2$|0.0800.0820.080
Setting 1|$x_3$|0.5940.6080.120
 |$x_4$|0.2760.2780.098
 |$x_5$|0.2740.2600.112
Setting 2|$x_2$|0.0660.0680.090
 |$x_3$|0.3300.2860.126
 |$x_4$|0.0660.0680.366
 |$x_5$|0.0680.0820.384
 |$x_2$|⁠, |$x_3$|0.2920.2700.116
 |$x_4$|⁠, |$x_5$|0.0820.0740.110
n = 2000PredictorZIQ-SIRZIQRankQuantile Single-index
 |$x_1$|0.9980.9920.386
 |$x_2$|0.1220.1560.078
Setting 1|$x_3$|1.0000.9980.384
 |$x_4$|0.9160.9100.210
 |$x_5$|0.8540.8240.298
Setting 2|$x_2$|0.1540.1620.084
 |$x_3$|0.9340.9060.180
 |$x_4$|0.2660.2620.430
 |$x_5$|0.1940.1900.426
 |$x_2$|⁠, |$x_3$|0.9140.8860.084
 |$x_4$|⁠, |$x_5$|0.2180.1980.096
n = 2000PredictorZIQ-SIRZIQRankQuantile Single-index
 |$x_1$|0.9980.9920.386
 |$x_2$|0.1220.1560.078
Setting 1|$x_3$|1.0000.9980.384
 |$x_4$|0.9160.9100.210
 |$x_5$|0.8540.8240.298
Setting 2|$x_2$|0.1540.1620.084
 |$x_3$|0.9340.9060.180
 |$x_4$|0.2660.2620.430
 |$x_5$|0.1940.1900.426
 |$x_2$|⁠, |$x_3$|0.9140.8860.084
 |$x_4$|⁠, |$x_5$|0.2180.1980.096
TABLE 2

Power results (without ZIP and ZINB) with the significant threshold |$\alpha =0.05$|⁠.

|$n=500$|PredictorZIQ-SIRZIQRankQuantile Single-index
 |$x_1$|0.4840.4740.124
 |$x_2$|0.0800.0820.080
Setting 1|$x_3$|0.5940.6080.120
 |$x_4$|0.2760.2780.098
 |$x_5$|0.2740.2600.112
Setting 2|$x_2$|0.0660.0680.090
 |$x_3$|0.3300.2860.126
 |$x_4$|0.0660.0680.366
 |$x_5$|0.0680.0820.384
 |$x_2$|⁠, |$x_3$|0.2920.2700.116
 |$x_4$|⁠, |$x_5$|0.0820.0740.110
|$n=500$|PredictorZIQ-SIRZIQRankQuantile Single-index
 |$x_1$|0.4840.4740.124
 |$x_2$|0.0800.0820.080
Setting 1|$x_3$|0.5940.6080.120
 |$x_4$|0.2760.2780.098
 |$x_5$|0.2740.2600.112
Setting 2|$x_2$|0.0660.0680.090
 |$x_3$|0.3300.2860.126
 |$x_4$|0.0660.0680.366
 |$x_5$|0.0680.0820.384
 |$x_2$|⁠, |$x_3$|0.2920.2700.116
 |$x_4$|⁠, |$x_5$|0.0820.0740.110
n = 2000PredictorZIQ-SIRZIQRankQuantile Single-index
 |$x_1$|0.9980.9920.386
 |$x_2$|0.1220.1560.078
Setting 1|$x_3$|1.0000.9980.384
 |$x_4$|0.9160.9100.210
 |$x_5$|0.8540.8240.298
Setting 2|$x_2$|0.1540.1620.084
 |$x_3$|0.9340.9060.180
 |$x_4$|0.2660.2620.430
 |$x_5$|0.1940.1900.426
 |$x_2$|⁠, |$x_3$|0.9140.8860.084
 |$x_4$|⁠, |$x_5$|0.2180.1980.096
n = 2000PredictorZIQ-SIRZIQRankQuantile Single-index
 |$x_1$|0.9980.9920.386
 |$x_2$|0.1220.1560.078
Setting 1|$x_3$|1.0000.9980.384
 |$x_4$|0.9160.9100.210
 |$x_5$|0.8540.8240.298
Setting 2|$x_2$|0.1540.1620.084
 |$x_3$|0.9340.9060.180
 |$x_4$|0.2660.2620.430
 |$x_5$|0.1940.1900.426
 |$x_2$|⁠, |$x_3$|0.9140.8860.084
 |$x_4$|⁠, |$x_5$|0.2180.1980.096

Under Setting 1, with sample sizes of |$n\in \lbrace 500,2000\rbrace$|⁠, we observe that our method demonstrates comparable power to the ZIQRank method, while the Quantile Single-index method shows significantly lower power. This suggests that ZIQ-SIR and ZIQRank better address the issue of zero inflation in the simulated data, thus achieving higher power. Under Setting 2, we observe that the power of the ZIQ-SIR method is generally higher than the ZIQRank method, suggesting that the ZIQ-SIR method better handles the correlation among covariates and results in a higher power. The Quantile Single-index method shows the highest power for certain covariates due to its type I error inflation (Table 1).

To better reflect real microbial taxonomic count data, we conduct additional simulations with count responses and present the results in Supplement B.4. These results, which account for the discrete nature of the data, remain consistent with our original findings.

4 APPLICATION IN COLUMBIAN’S GUT DATA

We demonstrate the performance of our ZIQ-SIR method using data from the Colombian Gut study (De la Cuesta-Zuluaga et al., 2018; Gonzalez et al., 2018). We present hypothesis testing results for taxa associated with health-related biological and dietary features using both ZIQ-SIR and ZIQRank. The results for ZIP, ZINB, and the Quantile Single-Index method are not included due to their significantly inflated type I error rates.

4.1 Data description

The dataset consists of microbiome counts for 425 taxa at the genus level for 441 adults, alongside covariates related to diet (eg, fiber, various fats, protein), anthropometric measures [age, body mass index (BMI)], lipid profile [high-density lipoprotein (HDL), LDL], glucose metabolism (glucose, insulin), blood pressure (diastolic and systolic), city, and medication usage. Categorical variables such as sex, medication, and city were treated as dummy variables. We excluded 3 subjects due to missing values and 2 others with triglyceride levels exceeding 800 mg/dL, resulting in a final sample size of 436.

We analyzed 109 taxa with observed zero proportions below 0.8, as higher percentages of zeros can lead to unreliable results (Zhang and Yi, 2020). Consistent with standard practice in microbiome studies (Xia et al., 2018), we adjusted for library size (ie, the total count of the 109 taxa per person) by including it as a covariate in our model. Other data normalization methods, such as rarefaction (Willis, 2019), relative abundance (Gloor et al., 2016), and CSS (McKnight et al., 2019), can also be applied. Additionally, since the quantile-based models are designed for continuous outcomes, we applied jittering (uniformly distributed between 0 and 1) to the non-zero microbiome counts.

The 24 covariates were divided into three groups: biological features, dietary features, and other covariates. Other covariates include 4 covariates: age, sex, city, and medication usage. The biological feature group comprises 12 covariates, including adiponectin, BMI, cholesterol, diastolic blood pressure, systolic blood pressure, glucose, glycosylated hemoglobin, HDL, LDL, insulin, triglycerides, and waist circumference. These covariates are closely linked to overall health (Ma and Shieh, 2006; Stefan et al., 2003). The dietary feature group includes 8 covariates related to macronutrient consumption: fiber, percentage of animal protein, carbohydrates, monounsaturated fat, polyunsaturated fat, saturated fat, total fat, and protein. We performed hypothesis testing for the 109 taxa against the biological and dietary feature groups using the proposed ZIQ-SIR method with a fast permutation approach (Supplement A.2), comparing the results with those by the ZIQRank method.

4.2 Permutation test for type I error evaluation

As the effective sample size is relatively small due to excessive zeros, before analyzing the data using the two methods, we first evaluate the type I errors of ZIQ-SIR and ZIQRank on the real microbiome data testing for the biological features and dietary features. We permute the covariates jointly for each subject to create 50 null datasets. The permutation maintains the association between all the covariates but removes the association between covariates and the response, microbial abundance. Therefore, none of the covariates should have an impact on microbial abundance, and microbiomes with small |$p$|-values are considered false positives. We use the null datasets to test for the biological and dietary features, respectively, and report the type I error by the proportion of taxa with nominal |$p$|-values less than 0.05 within each set. This evaluation procedure is widely adopted in real data analysis (Ling et al., 2021b; Soneson and Robinson, 2018). Results suggest that, in both analyses, our method has type I error controlled with around 5% taxa having |$p$|-values smaller than 0.05, while the ZIQRank method has inflation to some extent (Figure 3).

Boxplot of fraction of taxa with $p<0.05$ based on 50 null datasets.
FIGURE 3

Boxplot of fraction of taxa with |$p<0.05$| based on 50 null datasets.

4.3 Hypothesis testing results

We tested the relationship between microbial abundance and covariates of interest using the proposed ZIQ-SIR method and ZIQRank. The |$p$|-values are adjusted for multiple testing by controlling the False Discovery Rate (FDR) (Benjamini and Hochberg, 1995), and taxa with FDR-adjusted |$p$|-values less than 0.05 were considered significantly associated with the biological or dietary features. The detailed |$p$|-values are given in Table S.9, Supplement C.1.

Using the ZIQ-SIR method, we identified 3 taxa associated with the biological features at a 5% FDR threshold, including 1 taxon also identified by ZIQRank (Table S.9). Peptococcaceae-unspecified and Rhizobiales-unspecified-unspecified are exclusively discovered by our method. The literature suggests that Peptococcaceae-unspecified is closely related to LDL levels (Zhu et al., 2021) and lower triglyceride levels (Ejtahed et al., 2020), findings confirmed by our analysis. Similarly, Rhizobiales-unspecified-unspecified has been linked to glucose levels (Asensio et al., 2022). Both methods identified Actinomycetales-unspecified-unspecified, which has been reported to increase significantly in CAD patients with higher BMI, lower cholesterol, and higher glucose levels (Sawicka-Smiarowska et al., 2021).

For the dietary features, the ZIQ-SIR method uniquely identified the taxon Peptostreptococcus and found Peptococcaceae-unspecified in common with the ZIQRank method (Table S.9). Previous studies have indicated that low-fiber, high-protein diets influence the abundance of Peptostreptococcus (Martínez-López et al., 2021), while experimental evidence shows that Peptococcaceae-unspecified levels are significantly higher in individuals consuming high-fat diets (Wang et al., 2020). These findings are consistent with our analysis.

5 DISCUSSION

We proposed ZIQ-SIR to test associations between zero-inflated outcomes and covariates of interest, accommodating their potential nonlinear relationships. Our method introduces a new single-index model for zero-inflated data and provides detailed procedures for both large and small sample sizes. Numerical experiments show that ZIQ-SIR maintains well-controlled type I errors while achieving high power, whereas existing methods like ZINB, ZIP, and ZIQRank often exhibit inflated type I errors when their assumptions are violated.

While the single-index model in ZIQ-SIR provides greater flexibility for the positive component, the logistic regression component retains a linear structure, as real-world applications have not shown compelling evidence for a more complex formulation. Future work could explore more generalized logistic regression models (Stukel, 1988). Another promising future direction is adapting our method to test associations between covariates and groups of zero-inflated outcomes (eg, multiple taxa in microbiome studies), requiring methods for multivariate responses. This extension could reveal broader patterns and improve inference. Incorporating structural information, such as hierarchical or clustered relationships among outcomes, may further enhance model fit and statistical power (Washburne et al., 2018).

ACKNOWLEDGMENTS

We thank the editor, associate editor, and two referees for their valuable comments and constructive suggestions.

FUNDING

This work was supported, in part, by a grant from the National Heart, Lung, and Blood Institute [R01 HL155417] and two grants from the National Institute of General Medical Sciences [R01GM151301, R01GM155734].

CONFLICT OF INTEREST

None declared.

DATA AVAILABILITY

The dataset that supports the findings in this paper is available on https://qiita.ucsd.edu/ with Study ID 11993 (De la Cuesta-Zuluaga et al., 2018; Gonzalez et al., 2018).

REFERENCES

Agarwal
 
D. K.
,
Gelfand
 
A. E.
,
Citron-Pousty
 
S.
(
2002
).
Zero-inflated models with application to spatial count data
.
Environmental and Ecological Statistics
,
9
,
341
355
.

Asensio
 
E. M.
,
Ortega-Azorín
 
C.
,
Barragán
 
R.
,
Alvarez-Sala
 
A.
,
Sorlí
 
J. V.
,
C
 
P. E.
 et al. (
2022
).
Association between microbiome-related human genetic variants and fasting plasma glucose in a high-cardiovascular-risk mediterranean population
.
Medicina
,
58
,
1238
.

Benjamini
 
Y.
,
Hochberg
 
Y.
(
1995
).
Controlling the false discovery rate: a practical and powerful approach to multiple testing
.
Journal of the Royal Statistical Society. Series B (Methodological)
, ​​​​
57
,
289
300
.

De Boor
 
C.
(
2001
).
A practical guide to splines, Revised edition Applied Mathematical Sciences
.
New York
:
Springer New York
.

De la Cuesta-Zuluaga
 
J.
,
Corrales-Agudelo
 
V.
,
Velásquez-Mejía
 
E. P.
,
Carmona
 
J. A.
,
Abad
 
J. M.
,
Escobar
 
J. S.
(
2018
).
Gut microbiota is associated with obesity and cardiometabolic disease in a population in the midst of Westernization
.
Scientific Reports
,
8
,
1
14
.

Edgington
 
E. S.
(
1972
).
An additive method for combining probability values from independent experiments
.
The Journal of Psychology
,
80
,
351
363
.

Efron
 
B.
,
Tibshirani
 
R. J.
(
1994
).
An introduction to the bootstrap
.
New York
:
Chapman and Hall/CRC
.

Eilers
 
P. H.
,
Marx
 
B. D.
(
1996
).
Flexible smoothing with B-splines and penalties
.
Statistical Science
,
11
,
89
121
.

Ejtahed
 
H.-S.
,
Angoorani
 
P.
,
Soroush
 
A.-R.
,
Hasani-Ranjbar
 
S.
,
Siadat
 
S.-D.
,
Larijani
 
B.
(
2020
).
Gut microbiota-derived metabolites in obesity: a systematic review
.
Bioscience of Microbiota, Food and Health
,
39
,
65
76
.

Frank
 
D. N.
,
St. Amand
 
A. L.
,
Feldman
 
R. A.
,
Boedeker
 
E. C.
,
Harpaz
 
N.
,
Pace
 
N. R.
(
2007
).
Molecular-phylogenetic characterization of microbial community imbalances in human inflammatory bowel diseases
.
Proceedings of the National Academy of Sciences
,
104
,
13780
13785
.

Gloor
 
G. B.
,
Wu
 
J. R.
,
Pawlowsky-Glahn
 
V.
,
Egozcue
 
J. J.
(
2016
).
It’s all relative: analyzing microbiome data as compositions
.
Annals of Epidemiology
,
26
,
322
329
.

Gonzalez
 
A.
,
Navas-Molina
 
J. A.
,
Kosciolek
 
T.
,
McDonald
 
D.
,
Vázquez-Baeza
 
Y.
,
Ackermann
 
G.
 et al. (
2018
).
Qiita: rapid, web-enabled microbiome meta-analysis
.
Nature Methods
,
15
,
796
798
.

Gutenbrunner
 
C.
,
Jurečková
 
J.
,
Koenker
 
R.
,
Portnoy
 
S.
(
1993
).
Tests of linear hypotheses based on regression rank scores
.
Journal of Nonparametric Statistics
,
2
,
307
331
.

Hawinkel
 
S.
,
Mattiello
 
F.
,
Bijnens
 
L.
,
Thas
 
O.
(
2019
).
A broken promise: microbiome differential abundance methods do not control the false discovery rate
.
Briefings in Bioinformatics
,
20
,
210
221
.

Horton
 
N. J.
,
Kim
 
E.
,
Saitz
 
R.
(
2007
).
A cautionary note regarding count models of alcohol consumption in randomized controlled trials
.
BMC Medical Research Methodology
,
7
,
1
9
.

Jiang
 
H.
,
Ling
 
Z.
,
Zhang
 
Y.
,
Mao
 
H.
,
Ma
 
Z.
,
Yin
 
Y.
 et al. (
2015
).
Altered fecal microbiota composition in patients with major depressive disorder
.
Brain, Behavior, and Immunity
,
48
,
186
194
.

Kinder-Haake
 
S.
(
2012
).
A framework for human microbiome research
.
Nature
,
486
,
215
221
.

Koenker
 
R.
,
Bassett
 
G.
 Jr
(
1978
).
Regression quantiles
.
Econometrica: Journal of the Econometric Society
,
46
,
33
50
.

Koenker
 
R.
,
Ng
 
P.
,
Portnoy
 
S.
(
1994
).
Quantile smoothing splines
.
Biometrika
,
81
,
673
680
.

Kostic
 
A. D.
,
Chun
 
E.
,
Robertson
 
L.
,
Glickman
 
J. N.
,
Gallini
 
C. A.
,
Michaud
 
M.
 et al. (
2013
).
Fusobacterium nucleatum potentiates intestinal tumorigenesis and modulates the tumor-immune microenvironment
.
Cell Host and Microbe
,
14
,
207
215
.

Lewsey
 
J. D.
,
Thomson
 
W. M.
(
2004
).
The utility of the zero-inflated Poisson and zero-inflated negative binomial models: a case study of cross-sectional and longitudinal DMF data examining the effect of socio-economic status
.
Community Dentistry and Oral Epidemiology
,
32
,
183
189
.

Liang
 
H.
,
Liu
 
X.
,
Li
 
R.
,
Tsai
 
C.-L.
(
2010
).
Estimation and testing for partially linear single-index models
.
Annals of Statistics
,
38
,
3811
.

Ling
 
W.
,
Zhang
 
W.
,
Cheng
 
B.
,
Wei
 
Y.
(
2021a
).
Zero-inflated quantile rank-score based test (ZIQRank) with application to scRNA-seq differential gene expression analysis
.
The Annals of Applied Statistics
,
15
,
1673
.

Ling
 
W.
,
Zhao
 
N.
,
Plantinga
 
A. M.
,
Launer
 
L. J.
,
Fodor
 
A. A.
,
Meyer
 
K. A.
 et al. (
2021b
).
Powerful and robust non-parametric association testing for microbiome data via a zero-inflated quantile approach (ZINQ)
.
Microbiome
,
9
,
1
19
.

Liu
 
Y.
,
Song
 
X.
,
Zhou
 
H.
,
Zhou
 
X.
,
Xia
 
Y.
,
Dong
 
X.
 et al. (
2018
).
Gut microbiome associates with lipid-lowering effect of rosuvastatin in vivo
.
Frontiers in Microbiology
,
9
,
530
.

Liu
 
Y.
,
Xie
 
J.
(
2020
).
Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures
.
Journal of the American Statistical Association
,
115
,
393
402
.

Ma
 
H.
,
Shieh
 
K.-J.
(
2006
).
Cholesterol and human health
.
The Journal of American Science
,
2
,
46
50
.

Ma
 
S.
,
He
 
X.
(
2016
).
Inference For Single-Index Quantile Regression Models With Profile Optimization
.
The Annals of Statistics
,
44
, ​​​​​​
1234
1268
.

Martin
 
B. D.
,
Witten
 
D.
,
Willis
 
A. D.
(
2020
).
Modeling microbial abundances and dysbiosis with beta-binomial regression
.
The Annals of Applied Statistics
,
14
,
94
.

Martínez-López
 
L. M.
,
Pepper
 
A.
,
Pilla
 
R.
,
Woodward
 
A. P.
,
Suchodolski
 
J. S.
,
Mansfield
 
C.
(
2021
).
Effect of sequentially fed high protein, hydrolyzed protein, and high fiber diets on the fecal microbiota of healthy dogs: a cross-over study
.
Animal Microbiome
,
3
,
1
12
.

McKnight
 
D. T.
,
Huerlimann
 
R.
,
Bower
 
D. S.
,
Schwarzkopf
 
L.
,
Alford
 
R. A.
,
Zenger
 
K. R.
(
2019
).
Methods for normalizing microbiome data: an ecological perspective
.
Methods in Ecology and Evolution
,
10
,
389
400
.

Mullahy
 
J.
(
1986
).
Specification and testing of some modified count data models
.
Journal of Econometrics
,
33
,
341
365
.

Nam
 
K.
,
Henderson
 
N. C.
,
Rohan
 
P.
,
Woo
 
E. J.
,
Russek-Cohen
 
E.
(
2017
).
Logistic regression likelihood ratio test analysis for detecting signals of adverse events in post-market safety surveillance
.
Journal of Biopharmaceutical Statistics
,
27
,
990
1008
.

Neykov
 
M.
,
Liu
 
J. S.
,
Cai
 
T.
(
2016
).
L1-regularized least squares for support recovery of high dimensional single index models with gaussian designs
.
The Journal of Machine Learning Research
,
17
,
2976
3012
.

Ridout
 
M.
,
Hinde
 
J.
,
Demétrio
 
C. G.
(
2001
).
A score test for testing a zero-inflated Poisson regression model against zero-inflated negative binomial alternatives
.
Biometrics
,
57
,
219
223
.

Sawicka-Smiarowska
 
E.
,
Bondarczuk
 
K.
,
Bauer
 
W.
,
Niemira
 
M.
,
Szalkowska
 
A.
,
Raczkowska
 
J.
 et al. (
2021
).
Gut microbiome in chronic coronary syndrome patients
.
Journal of Clinical Medicine
,
10
,
5074
.

Schwarzer
 
G.
,
Antes
 
G.
,
Schumacher
 
M.
(
2002
).
Inflation of type I error rate in two statistical tests for the detection of publication bias in meta-analyses with binary outcomes
.
Statistics in Medicine
,
21
,
2465
2477
.

Soneson
 
C.
,
Robinson
 
M. D.
(
2018
).
Bias, robustness and scalability in single-cell differential expression analysis
.
Nature Methods
,
15
,
255
.

Stefan
 
N.
,
Stumvoll
 
M.
,
Vozarova
 
B.
,
Weyer
 
C.
,
Funahashi
 
T.
,
Matsuzawa
 
Y.
 et al. (
2003
).
Plasma adiponectin and endogenous glucose production in humans
.
Diabetes Care
,
26
,
3315
3319
.

Stukel
 
T. A.
(
1988
).
Generalized logistic models
.
Journal of the American Statistical Association
,
83
,
426
431
.

Tippett
 
L. H. C.
and
others
(
1931
).
The Methods of Statistics
.
London
:
Williams & Norgate Ltd
.

Wang
 
B.
,
Kong
 
Q.
,
Li
 
X.
,
Zhao
 
J.
,
Zhang
 
H.
,
Chen
 
W.
 et al. (
2020
).
A high-fat diet increases gut microbiota biodiversity and energy expenditure due to nutrient difference
.
Nutrients
,
12
,
3197
.

Wang
 
Z.
,
Ma
 
S.
,
Wang
 
C.-Y.
(
2015
).
Variable selection for zero-inflated and overdispersed data with application to health care demand in Germany
.
Biometrical Journal
,
57
,
867
884
.

Washburne
 
A. D.
,
Morton
 
J. T.
,
Sanders
 
J.
,
McDonald
 
D.
,
Zhu
 
Q.
,
Oliverio
 
A. M.
 et al. (
2018
).
Methods for phylogenetic analysis of microbiome data
.
Nature Microbiology
,
3
,
652
661
.

Willis
 
A. D.
(
2019
).
Rarefaction, alpha diversity, and statistics
.
Frontiers in Microbiology
,
10
,
2407
.

Xia
 
Y.
,
Sun
 
J.
,
Chen
 
D.-G.
,
Xia
 
Y.
,
Sun
 
J.
,
Chen
 
D.-G.
(
2018
).
Modeling zero-inflated microbiome data. Statistical Analysis of Microbiome Data with R
,
453
496
.
Singapore
:
Springer
.

Yau
 
K. K.
,
Wang
 
K.
,
Lee
 
A. H.
(
2003
).
Zero-inflated negative binomial mixed regression modeling of over-dispersed count data with extra zeros
.
Biometrical Journal: Journal of Mathematical Methods in Biosciences
,
45
,
437
452
.

Zaykin
 
D. V.
,
Zhivotovsky
 
L. A.
,
Westfall
 
P. H.
,
Weir
 
B. S.
(
2002
).
Truncated product method for combining P-values
.
Genetic Epidemiology: The Official Publication of the International Genetic Epidemiology Society
,
22
,
170
185
.

Zhang
 
X.
,
Yi
 
N.
(
2020
).
Fast zero-inflated negative binomial mixed modeling approach for analyzing longitudinal metagenomics data
.
Bioinformatics
,
36
,
2345
2351
.

Zhu
 
X.
,
Li
 
Y.
,
Jiang
 
Y.
,
Zhang
 
J.
,
Duan
 
R.
,
Liu
 
L.
 et al. (
2021
).
Prediction of gut microbial community structure and function in polycystic ovary syndrome with high low-density lipoprotein cholesterol
.
Frontiers in Cellular and Infection Microbiology
,
11
,
665406
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.