ABSTRACT

Efron’s 2-group model is widely used in large-scale multiple testing. This model assumes that test statistics are drawn independently from a mixture of a null and a non-null distribution. The marginal local false discovery rate (locFDR) is the probability that the hypothesis is null given its test statistic. The procedure that rejects null hypotheses with marginal locFDRs below a fixed threshold maximizes power (the expected number of non-nulls rejected) while controlling the marginal false discovery rate in this model. However, in realistic settings the test statistics are dependent, and taking the dependence into account can boost power. Unfortunately, the resulting calculations are typically exponential in the number of hypotheses, which is impractical. Instead, we propose using |$\textrm {locFDR}_N$|⁠, which is the probability that the hypothesis is null given the test statistics in its |$N$|-neighborhood. We prove that rejecting for small |$\textrm {locFDR}_N$| is optimal in the restricted class where the decision for each hypothesis is only guided by its |$N$|-neighborhood, and that power increases with |$N$|⁠. The computational complexity of computing the |$\mathrm{ locFDR}_N$|s increases with |$N$|⁠, so the analyst should choose the largest |$N$|-neighborhood that is still computationally feasible. We show through extensive simulations that our proposed procedure can be substantially more powerful than alternative practical approaches, even with small |$N$|-neighborhoods. We demonstrate the utility of our method in a genome-wide association study of height.

1 INTRODUCTION

In high-dimensional studies, thousands of null hypotheses are tested simultaneously to discover the non-null hypotheses. A multiple testing procedure produces a list of |$R$| discoveries (ie, rejected null hypotheses) that are considered to be interesting. The analyst wants to minimize the number |$V$| of false discoveries in the list while maximizing the number |$R-V$| of true discoveries. To avoid inflation of false discoveries, the analyst typically aims to provide a multiple testing procedure that controls a relevant error rate, such as the false discovery rate (FDR, Benjamini and Hochberg, 1995), or its closely related variant, the marginal false discovery rate (mFDR, Genovese and Wasserman, 2002):

The optimal level-|$\alpha$| procedure maximizes the expected number of true discoveries, |$\mathbb {E}(R-V)$|⁠, while controlling the error rate at level |$\alpha$|⁠.

Assuming that the test statistics come from a mixture of a null distribution and a non-null distribution, the optimal procedure is approximated by estimating the mixture distribution (Sun and Cai, 2007; Xie et al., 2011; Heller and Rosset, 2021). If the test statistics are independent, then the optimal procedure is to reject the hypotheses with a small marginal local FDR (locFDR), which is the probability that the hypothesis is null given the observed test statistic for that hypothesis. Inference based on estimated marginal locFDRs is widely popular (Ferkingstad et al., 2008; Schwartzman et al., 2008; Cai and Sun, 2009; Heller and Yekutieli, 2014), ever since its introduction in Efron et al. (2001). This is so even though the test statistics are typically dependent in high-dimensional studies, so there is no optimality guarantee. The popularity of this approach is due to several important factors: the robustness of the inference, in the sense that the error rate is typically controlled (Efron, 2007; 2010; Heller and Rosset, 2021); it has good power compared to frequentist methods that do not assume the mixture model (Sun and Cai, 2007); it is computationally feasible.

Rejecting hypotheses with small estimated marginal locFDRs is not optimal when the test statistics are dependent. Xie et al. (2011) showed that the optimal procedure is to reject the hypotheses with small oracle locFDR, which is the probability that the hypothesis is null given the entire vector of test statistics. However, computing the Oracle locFDRs is computationally prohibitive, except for very special dependence cases, such as local dependence (Heller and Rosset, 2021). This is unfortunate since the power gain of using the oracle locFDRs instead of the marginal locFDRs can be extremely large, as demonstrated in numerical examples in Heller and Rosset (2021).

In this paper, we develop a multiple-testing procedure that is computationally feasible and uniformly more powerful than the inference based on marginal locFDRs. We introduce |$\mathrm{ locFDR}_N$|⁠, which is the probability that the hypothesis is null given the observed test statistic for that hypothesis and the test statistics in its |$N$|-neighborhood. |$N$| is a small number so that the required computations given the vector of test statistics of size |$2\times N+1$| are feasible (in our examples, |$N\le 5$|⁠). For example, for hypotheses ordered by location as in genome-wide association studies (GWAS), the |$N$|-neighborhood for the test statistic in the |$i$|th position can be the test statistics in the |$N$| positions before and the |$N$| positions after the |$i$|th position. This is a reasonable neighborhood for GWAS since the dependence due to linkage disequilibrium (LD) implies physically close genotypes are correlated, and the range of correlation is roughly similar throughout the genome, so the neighborhood for each genotype can be defined by the group of |$2\times N$| single nucleotide polymorphisms (SNPs) surrounding it.

We suggest rejecting the hypotheses with small |$\mathrm{ locFDR}_N$|⁠. Let |$\mathcal {C}_N$| be the class of procedures that base the decision for test |$i$| only on its |$N$|-neighborhood. We show that our suggested procedure is optimal for mFDR control for the class of procedures in |$\mathcal {C}_N$|⁠. Therefore, our suggested procedure is uniformly more powerful than the procedure based on marginal locFDRs. We demonstrate there can be a huge potential power gain over using marginal locFDRs even when the neighborhoods are small. This is important since the computation of the |$\mathrm{ locFDR}_N$|s increases exponentially with neighborhood size |$N$|⁠.

Advanced computational methods for estimating the oracle locFDR have been suggested in the literature in specific contexts. These methods implement a full Bayesian approach using MCMC to infer the posterior null probabilities, which is philosophically similar to the oracle locFDR. Sun et al. (2015) suggested MCMC methods, which require sampling from high-dimensional posteriors, for spatial data. Zhu and Stephens (2017) introduced a Bayesian approach for identifying associations using GWAS summary statistics, accounting for correlations among SNPs. Their method can be computationally expensive, as each MCMC iteration may have a computational cost that scales as high as cubic in the number of SNPs. Additionally, this approach relies on extensive modeling assumptions and does not offer control of the FDR. In contrast to the Bayesian approach, our method only requires computation or estimation of joint distributions in small neighborhoods, even when the number of parameters is large, so the modeling assumptions are weaker in addition to our great computational advantage.

The rest of the paper is organized as follows. In Section 2, we define all the necessary notations and formalize our goal. In Section 3, we introduce our novel procedure and prove its optimality in class |$\mathcal {C}_N$|⁠. We also show that the power (ie, the expected number of true discoveries) is an increasing function of the neighborhood size indexed by |$N$|⁠, making explicit the trade-off between power and computation that the restriction to class |$\mathcal {C}_N$| provides. In Section 4, we show that the computation of the |$\mathrm{ locFDR}_N$|s is linear in the number of hypotheses. Furthermore, we show how we can estimate the |$\mathrm{ locFDR}_N$|s assuming a reasonable model for GWAS. Finally, we summarize our complete data-driven algorithm. In Section 5, we perform extensive simulations with different covariance structures, and we observe that the novel procedure achieves substantially higher power compared to other practical methods in a variety of different settings, including ones where there is no natural definition of “neighborhood.” In Section 6, we provide a complete analysis pipeline for GWAS using summary statistics and demonstrate that the use of |$\mathrm{ locFDR}_N$| indeed leads to more discoveries than existing practical approaches. We conclude with some final remarks in Section 7.

2 NOTATION AND GOAL

We assume the data are generated from the generalized 2-group model. The hypothesis states vector, |$\boldsymbol{h}=(h_1,\dots ,h_K)$|⁠, has entries sampled independently from the |$\mathrm{ Ber}(\pi )$| distribution. Given the hypothesis states |$\boldsymbol{h}$|⁠, the vector of test statistics |${\bf Z}$| for testing the |$K$| null hypothesis, |$H_i:h_i = 0$|⁠, |$i=1,\ldots , K$|⁠, has a joint distribution we denote by |$\mathbb {P}(\boldsymbol{z}\mid \boldsymbol{h})$|⁠. The goal is to construct a multiple testing procedure, defined by the decision rule |$\boldsymbol{D}({\bf Z}) = (D_1({\bf Z}),\dots ,D_K({\bf Z})):\mathbb {R}^K \rightarrow \lbrace 0,1\rbrace ^K$|⁠, that has good power while controlling a meaningful error. It is crucial to emphasize that the model under consideration in this context is a conditional dependence model (Xie et al., 2011; Zhu and Stephens, 2017; Heller and Rosset, 2021). This means that the z-scores |${\bf Z}$| given the hypothesis states |$\boldsymbol{h}$| are dependent, while |$\boldsymbol{h}$| themselves are independent. In contrast, there is another widely used model in the multiple testing literature, which we refer to as the conditional independence model (Sun and Cai, 2009; Shu et al., 2015). This model assumes that the z-scores |${\bf Z}$| are independent given the hypothesis states |$\boldsymbol{h}$|⁠, while |$\boldsymbol{h}$| themselves may be dependent. In our current context, we do not consider the conditional independence model, since the conditional dependence model is more appropriate in the context of GWAS, where there is LD among the SNPs, as well as in the broader context of linear models.

Let |$T_i({\bf Z})$| denote the oracle locFDR, which is the statistic for optimal decision (Xie et al., 2011; Heller and Rosset, 2021):

The expected number of true and false positives is simple expressions of the oracle locFDRs:

A relevant error rate is the marginal FDR,

where |$R({\bf D})= \sum_{i=1}^{K}{D_i({\bf Z})}$| and |$\quad \mathbb {E}_{{\bf Z},\boldsymbol{h}}(R(\boldsymbol{D})) = \mathbb {E}_{{\bf Z},\boldsymbol{h}}(V(\boldsymbol{D})) + TP(\boldsymbol{D})$| is the expected number of rejections.

Certain expectations mentioned above are taken over the joint distribution of |$({\bf Z},\boldsymbol{h})$|⁠, while others are over |${\bf Z}$| because |$T_i({\bf Z})$| already represents integration over the distribution of |$h_i\mid {\bf Z}$|⁠. Going forward, we will omit the explicit notation of the random variable with respect to which we take the expectation, as it can be inferred from the context. It has been shown in Xie et al. (2011) that the optimal rule over the class of all decision functions |$\boldsymbol{D}:\mathbb {R}^K \rightarrow \lbrace 0,1\rbrace ^K$| for the above problem is thresholding the oracle locFDR statistics with a fixed threshold. So the decision rule is |$D_i(\boldsymbol{z}) = \mathbb {I}(T_i(\boldsymbol{z})\le c)$|⁠, where |$c$| is a fixed constant depending on |$\alpha$|⁠. The complexity required for naively calculating all the locFDR statistics is |$O(K2^K)$|⁠, which is typically infeasible. Hence the optimal procedure is impractical, except for some very special types of dependencies across the |$z$|-scores, for example, block dependence with small block sizes (Heller and Rosset, 2021). Due to the difficulty in finding the globally optimal solution, we suggest considering a smaller class of decision functions where the decision function for each test depends on at most |$2N$| neighboring test statistics, for |$2N \in \lbrace 0,2,\ldots , 2\lfloor K/2\rfloor \rbrace$|⁠:

(1)

For simplicity, we assume that the neighborhood size of |$N$| is fixed. The straightforward relaxation is detailed in Remark 1.

Given a choice of |$N$| and the resulting class |$\mathcal {C}_N,$| we show in Section 3 that the optimal multiple testing (OMT) procedure, which is limited to rules in |$\mathcal {C}_N$|⁠, relies on the locFDR after marginalization:

(2)

Henceforth, we refer to |$T_{i,N}$| as the |$\mathrm{ locFDR}_N$|⁠. For |$N=0$|⁠, |$\mathcal {C}_0$| includes only marginal rules and hence |$\mathrm{ locFDR}_0$| is the marginal locFDR (Efron et al., 2001; Sun and Cai, 2007), while for |$N=K$|⁠, |$\mathrm{ locFDR}_K$| is the oracle locFDR (Sun and Cai, 2009; Xie et al., 2011; Heller and Rosset, 2021). We can view increasing the size of |$N$| as offering a trade-off between computation and power: for small |$N$|⁠, calculating |$\mathrm{ locFDR}_N$| and the resulting optimal rule within the limited class |$\mathcal {C}_N$| is computationally efficient but can suffer a loss of power, and as |$N$| increases computations become exponentially more complex but power improves. We show in Section 5 that the power gain using the |$\mathrm{ locFDR}_N$| statistics, compared to existing methods, can be substantial. Specifically, in GWAS-like situations with a natural neighborhood structure and strong dependence only within the neighborhood, the power increase from |$N=0$| to |$N\in \lbrace 1,2,3\rbrace$| is large. We demonstrate the excellent power-computation trade-off for various correlation structures. Our objective is not to find the globally optimal solution, as it is not practically feasible, but to find a relaxation of the optimal solution that significantly enhances power when compared to existing marginal test statistics.

 
Remark 1

More generally, in Equation (2), the neighborhood size can depend on the particular feature (indexed by |$i$| here), |$N_i$|⁠. The only restriction on |$N_i$| is computational: we cannot choose |$N_i$| too large due to the exponential growth of the computational complexity (as demonstrated in Section 4.1). For simplicity, we only consider a fixed |$N_i$| for all features and show the usefulness of our approach for a small, fixed |$N$|⁠.

3 THE OMT PROCEDURE IN THE RESTRICTED CLASS |$\mathcal {C}_N$|

When the test statistics are independent, the |$T_{i,0}$| statistics for |$i = 1, \ldots , K$| serve as the optimal test statistics for controlling the mFDR (Sun and Cai, 2007). Heller and Rosset (2021, Proposition 2.2) demonstrated that these same statistics, |$T_{i,0}$|⁠, can be employed to establish a valid procedure for controlling the mFDR even when |${\bf Z}$| are generated from a conditional dependence model. In the following, we illustrate that the same principle outlined by Heller and Rosset (2021) can be applied to develop a valid method for controlling mFDR using |$T_{i,N}$| statistics within the context of a conditional dependence model.

 
Proposition 1
For |$N \ge 0$| and |$0 < \alpha < 1$|⁠, let
(3)
Then, |$\mathrm{ mFDR}\le \alpha$| for the procedure that rejects |$H_i:h_i = 0$| if |$T_{i,N} \le t_{\alpha , N}$| for |$i = 1,\dots , K$|⁠.
 
Proof.
The mFDR of the procedure that rejects |$h_i = 0$| if |$T_{i,N} \le t_{\alpha ,N}$| is given by
where the last inequality follows from the definition of |$t_{\alpha ,N}$|⁠.

The OMT procedure with mFDR control is a solution to the problem:

(4)

Now we concentrate on the class |$\mathcal {C}_N$| and find an optimal solution within this restricted class, so the optimization problem is

(5)
 
Theorem 1
The Procedure detailed in Proposition 1 is the solution to the OMT problem in Equation (5), that is, the optimal policy is
 
Proof.

see Section S1.

 
Remark 2

Though marginal test statistics (⁠|$\mathrm{ locFDR}_0$|⁠) based fixed threshold rules have been around for a long time, the optimality of such rules for dependent test statistics was not known (as far as we know). Therefore, the above result is not only new for |$0< N < K$| but also for |$N=0$| when restricting the allowed rules to the class |$\mathcal {C}_N$|⁠.

 
Corollary 1

The Procedure detailed in Proposition 1 with |$N=L+1$| has at least as much power as with |$N=L$|⁠, for |$L\in \lbrace 0,\ldots ,K-1 \rbrace$|⁠.

 
Proof.

Theorem 1 says that the optimal fixed threshold mFDR level |$\alpha$| rule based on |$T_{i,N}$| is the solution to problem (5). But |$\mathcal {C}_N \subset \mathcal {C}_{N+1}$| for every |$N\ge 0$|⁠. Hence the proof.

4 DATA DRIVEN PROCEDURES

4.1 The computational complexity of the |$\mathrm{ locFDR}_N$| statistics

To implement the procedure described in Proposition 1, we need to compute the statistics |$\lbrace T_{1,N},\dots ,T_{K,N}\rbrace$|⁠. We denote |$l_i = \min (i+N,K) - \max (i-N,1) + 1$|⁠, clearly |$l_i \le 2N + 1$|⁠. |$T_{i,N}$| can be written as

(6)

where |$\mathbb {P}(\boldsymbol{h}_{i,N})$| is the probability distribution of |$\boldsymbol{h}_{i,N}$| and |$\mathbb {P}({\bf Z}_{i,N}\vert \boldsymbol{h}_{i,N})$| is the conditional density of |${\bf Z}_{i,N}$| given |$\boldsymbol{h}_{i,N}$|⁠. The sum in the denominator above is over |$l_i$| dimensional binary vectors and hence requires |$2^{l_i}$| evaluations of |$\mathbb {P}({\bf Z}_{i,N} \vert \boldsymbol{h}_{i,N})$|⁠. Each of these takes time of at most |$O(l_i^2)$| for the model we consider in (7). Hence calculating the denominator requires time |$O(2^{l_i} l_i^2)$|⁠. Note that all the calculations in the denominator can be stored with |$O(2^{l_i})$| memory and hence to evaluate the numerator, we do not need any extra computation time. Since |$O(2^{l_i} l_i^2) \le O(2^{2N+1}(2N + 1)^2)$|⁠, the worst-case runtime for computing all the |$\mathrm{ locFDR}_N$|s is |$O(K2^{2N+1}(2N + 1)^2)=O(K4^NN^2)$|⁠. This is a great reduction compared with the computational complexity needed for finding the globally optimal rule, for which it is necessary to calculate the oracle |$\mathrm{ locFDR}$|s |$T_1,\dots ,T_K$| with a runtime of |$O(K2^K)$| (unless the joint density |$g$| has a special structure).

4.2 Estimation strategies

Although computations of the statistics of interest |$\lbrace T_{1,N},\dots ,T_{K,N}\rbrace$| are simple with known probabilistic structure, in real-life examples, we need to estimate the statistics from the data. For our examples, we assume the following model (see Remark 4 for a more general model)

(7)

where |$\mathcal{N}_K$| denotes the |$K$|-dimensional multivariate normal distribution, and |$\Sigma$| is known, with diagonal entries of one. This model assumes the test statistic has a |$\mathcal {N}(0,1)$| distribution if |$h_i=0$|⁠, and a |$\mathcal {N}(b,1+\tau ^2)$| distribution if |$h_i = 1$|⁠. It is reasonable, for example, for testing the associations between covariates and outcome in a linear model, where |$\Sigma$| is determined by the design matrix of the covariates, see Section 6 for details. This model is similar to the “RSS-BVSR” model considered in Zhu and Stephens (2017), and it reduces to the 2-group model of Efron et al. (2001) if |$\Sigma$| is a diagonal matrix.

In practical applications, we want to estimate the parameters |$b, \tau ^2, \pi$| of model (7) from the observed vector of test statistics |${\bf Z}$|⁠. Note that the parameters |$b, \tau ^2, \pi$| in the model (7) are all marginal parameters, that is, they only depend on the marginal distribution of |$Z_i\sim (1-\pi )N(0,1)+\pi N(b,\tau ^2)$| for all |$i$|⁠. To estimate the parameters in this type of setting, we adopt a composite marginal likelihood approach, which is a special case of general pseudolikelihood methods for estimating parameters in models having complex dependencies (for a detailed discussion, see Varin et al. 2011 and references therein). By this approach, we first construct the pseudolikelihood under the working independence assumption and then estimate the unknown parameters by optimizing the pseudolikelihood.

Our first step is to estimate the non-null proportion |$\pi$| from |${\bf Z}$| using the method of Jin and Cai (2007). Briefly, this method is motivated by the property of the characteristic function of the observations |$Z_i$|⁠, which decomposes into 2 parts: amplitude and phase. The phase part contains all the relevant information about the non-null proportion |$\pi$|⁠. Since the characteristic function is an unknown quantity that requires exact knowledge of model parameters, the authors construct the empirical characteristic function of the observations |$Z_1,\dots ,Z_K$| and use it to estimate |$\pi$| (for more details, see Jin and Cai (2007)). Let |$\hat{\pi }$| be the estimate of |$\pi$|⁠. Our second step is to construct the pseudolikelihood  |$\mathcal {L}_C({\bf Z},\boldsymbol{\psi }) = \prod _{i=1}^{K}{\mathcal {L}(Z_i,\boldsymbol{\psi })},$| where |$\mathcal{L}(Z_i,\boldsymbol{\psi})$| is the marginal likelihood of |$Z_i$|⁠, and |$\boldsymbol{\psi }=(\hat{\pi }, b,\tau ^2)$|⁠, and then optimize |$\mathcal {L}_C$| with respect to |$b$| and |$\tau ^2$|⁠. Optimization of |$\mathcal {L}_C$| is facilitated via the iterative EM algorithm (Dempster et al., 1977). Updates of the parameters |$b$| and |$\tau ^2$| at the |$(t+1)$|th step are (see details in Section S2)

(8)

where |$\boldsymbol{\psi }^{(t)} = (\hat{\pi },b^{(t)},{\tau ^2}^{(t)})$| are the estimates of the parameters in the |$t$|th step. Finally, we take |$(\hat{\pi }, \hat{b}, \hat{\tau }^2)=\boldsymbol{\psi }^{(T)}$| as the estimate of model parameters, where |$T=\min \left\lbrace t: |\boldsymbol{\psi }^{(t+1)}-\boldsymbol{\psi }^{(t)}|< c\right\rbrace$| and |$c$| is some small constant. This estimation method is expected to give a fast convergence rate for a relatively small number of z-scores under short-range dependency [similar results have been established in Cox and Reid (2004)].

With estimated parameters |$\hat{b}, \hat{\pi }, \hat{\tau }$|⁠, we can estimate the |$K$|  |$\mathrm{ locFDR}_N$|s, and denote these statistics by |$\lbrace \hat{T}_{1,N},\dots ,\hat{T}_{K,N}\rbrace$|⁠. For inference using the estimated |$\mathrm{ locFDR}_N$|s, we need to determine the rejection cutoff, so that hypotheses with estimated |$\mathrm{ locFDR}_N$|s below that cutoff will be rejected. For simplicity we apply the data-dependent procedure proposed by Sun and Cai (2007), for which the rejection cutoff is expected to converge to the fixed cutoff in Equation (3) whenever the estimated parameters |$\hat{\pi }, \hat{b}, \hat{\tau }^2$| converge to the true parameters |$\pi , b, \tau ^2$| (the relation to the computationally demanding method is detailed in Remark 3). The procedure starts by ordering the statistics |$\hat{T}_{(1),N}\le \hat{T}_{(2),N} \le ,\dots ,\hat{T}_{(K),N}$|⁠, and finds |$\hat{k}=\max \left\lbrace i:\frac{1}{i}\sum _{j=1}^{i}{\hat{T}_{(j),N}}\le \alpha \right\rbrace$|⁠. All hypotheses having |$\mathrm{ locFDR}_N$| below the cutoff |$T_{(\hat{k}), N}$| are then rejected. The rejected indices are |$\mathcal {R}_\alpha = \left\lbrace i: \hat{T}_{i,N}\le T_{(\hat{k}),N}\right\rbrace$|⁠. We summarize the whole pipeline in Algorithm 1.

Algorithm 1

The level |$\alpha$| data-driven |$\mathrm{ locFDR}_N$| procedure.

• Step 1 - Input: The vector of |$z$|-scores |${\bf Z}$|⁠, correlation among the |$z$|-scores |$\Sigma$|⁠, a computationally feasible choice of |$N \ge 0$|⁠.
• Step 2 - Estimate |$\pi$| using the method of Jin and Cai (2007) and |$b, \tau ^2$| using Equation (8). Let |$\hat{b}, \hat{\pi }, \hat{\tau }$| be the estimates.
• Step 3 - Calculate the |$\mathrm{ locFDR}_N$|s using the estimated parameters from Step 2 in formula (6). Let |$\lbrace \hat{T}_{1,N},\dots ,\hat{T}_{K,N}\rbrace$| be the test statistics.
• Step 4 - Order the estimated statistics |$\hat{T}_{(1),N}\le \hat{T}_{(2),N} \le \dots \le \hat{T}_{(K),N}$| and let |$H_{(1)},\dots ,H_{(K)}$| be the corresponding hypotheses, and find |$\hat{k}=\max \left\lbrace i:\frac{1}{i}\sum _{j=1}^{i}{\hat{T}_{(j),N}}\le \alpha \right\rbrace$|⁠.
• Step 5 - Output: The rejected hypotheses are |$H_{(i)}, i=1,2,\dots,\hat{k}$|⁠.
• Step 1 - Input: The vector of |$z$|-scores |${\bf Z}$|⁠, correlation among the |$z$|-scores |$\Sigma$|⁠, a computationally feasible choice of |$N \ge 0$|⁠.
• Step 2 - Estimate |$\pi$| using the method of Jin and Cai (2007) and |$b, \tau ^2$| using Equation (8). Let |$\hat{b}, \hat{\pi }, \hat{\tau }$| be the estimates.
• Step 3 - Calculate the |$\mathrm{ locFDR}_N$|s using the estimated parameters from Step 2 in formula (6). Let |$\lbrace \hat{T}_{1,N},\dots ,\hat{T}_{K,N}\rbrace$| be the test statistics.
• Step 4 - Order the estimated statistics |$\hat{T}_{(1),N}\le \hat{T}_{(2),N} \le \dots \le \hat{T}_{(K),N}$| and let |$H_{(1)},\dots ,H_{(K)}$| be the corresponding hypotheses, and find |$\hat{k}=\max \left\lbrace i:\frac{1}{i}\sum _{j=1}^{i}{\hat{T}_{(j),N}}\le \alpha \right\rbrace$|⁠.
• Step 5 - Output: The rejected hypotheses are |$H_{(i)}, i=1,2,\dots,\hat{k}$|⁠.
Algorithm 1

The level |$\alpha$| data-driven |$\mathrm{ locFDR}_N$| procedure.

• Step 1 - Input: The vector of |$z$|-scores |${\bf Z}$|⁠, correlation among the |$z$|-scores |$\Sigma$|⁠, a computationally feasible choice of |$N \ge 0$|⁠.
• Step 2 - Estimate |$\pi$| using the method of Jin and Cai (2007) and |$b, \tau ^2$| using Equation (8). Let |$\hat{b}, \hat{\pi }, \hat{\tau }$| be the estimates.
• Step 3 - Calculate the |$\mathrm{ locFDR}_N$|s using the estimated parameters from Step 2 in formula (6). Let |$\lbrace \hat{T}_{1,N},\dots ,\hat{T}_{K,N}\rbrace$| be the test statistics.
• Step 4 - Order the estimated statistics |$\hat{T}_{(1),N}\le \hat{T}_{(2),N} \le \dots \le \hat{T}_{(K),N}$| and let |$H_{(1)},\dots ,H_{(K)}$| be the corresponding hypotheses, and find |$\hat{k}=\max \left\lbrace i:\frac{1}{i}\sum _{j=1}^{i}{\hat{T}_{(j),N}}\le \alpha \right\rbrace$|⁠.
• Step 5 - Output: The rejected hypotheses are |$H_{(i)}, i=1,2,\dots,\hat{k}$|⁠.
• Step 1 - Input: The vector of |$z$|-scores |${\bf Z}$|⁠, correlation among the |$z$|-scores |$\Sigma$|⁠, a computationally feasible choice of |$N \ge 0$|⁠.
• Step 2 - Estimate |$\pi$| using the method of Jin and Cai (2007) and |$b, \tau ^2$| using Equation (8). Let |$\hat{b}, \hat{\pi }, \hat{\tau }$| be the estimates.
• Step 3 - Calculate the |$\mathrm{ locFDR}_N$|s using the estimated parameters from Step 2 in formula (6). Let |$\lbrace \hat{T}_{1,N},\dots ,\hat{T}_{K,N}\rbrace$| be the test statistics.
• Step 4 - Order the estimated statistics |$\hat{T}_{(1),N}\le \hat{T}_{(2),N} \le \dots \le \hat{T}_{(K),N}$| and let |$H_{(1)},\dots ,H_{(K)}$| be the corresponding hypotheses, and find |$\hat{k}=\max \left\lbrace i:\frac{1}{i}\sum _{j=1}^{i}{\hat{T}_{(j),N}}\le \alpha \right\rbrace$|⁠.
• Step 5 - Output: The rejected hypotheses are |$H_{(i)}, i=1,2,\dots,\hat{k}$|⁠.
 
Remark 3

We can calculate the fixed threshold in Equation (3), with the estimated parameters |$\hat{\pi }, \hat{b}, \hat{\tau }^2$|⁠, by Monte-Carlo simulation. This requires far greater computational resources than using the procedure suggested by Sun and Cai (2007). The difference in the list of discoveries from using the data-dependent cutoff of Sun and Cai (2007) is negligible in the numerical examples we considered.

 
Remark 4

We can assume that a non-null test statistic has a distribution that is a mixture of |$G$| normal distributions |$\sum _{c=1}^{G}{\pi _c \mathcal {N}(\mu _c,\sigma _c^2)}$|⁠, as done in Stephens (2017), and adapt the estimation technique described above in order to estimate |$\pi _c$|⁠, |$\mu _c$|⁠, and |$\sigma _c^2$|⁠. The estimation is expected to be fairly accurate for short-range dependency when |$G$| is not very large and the number of z-scores is large, which is typically the case in GWAS.

4.3 Extension to the multiple-group model

In large-scale multiple-testing scenarios, data is often gathered from diverse and heterogeneous sources, allowing hypotheses to be categorized into several groups, each with distinct characteristics (Efron, 2008; Cai and Sun, 2009). Assuming the general 2-group model (7) is inappropriate and can lead to biased inferences in such heterogeneous contexts (Efron, 2008). Efron (2008) introduced the multiple-group model, which assumes that the |$z$|-vector can be partitioned into |$B$| sub-vectors, each coming from a 2-group model with its own distinct parameters. Since we take dependency into account, we assume the general 2-group model for each group. We denote |${\bf Z}_k$| as the sub-vector of the |$k$|th group |$\mathcal {G}_k$| and assume that it is generated by the 2-group model (7), parameterized by |$\pi _k$|⁠, |$b_k$|⁠, |$\tau _k$|⁠, and the known covariance matrix |$\Sigma _k$|⁠. It is further assumed that z-scores from different groups are independent. In Algorithm 2, we present the level |$\alpha$| mFDR controlling procedure, which sorts the |$\mathrm{ locFDR}_N$|s computed from the multiple-group model in order to find the threshold for discovery, as suggested in Cai and Sun (2009).

Algorithm 2

The level |$\alpha$|⁠, data driven, multiple-groups |$\mathrm{ locFDR}_N$| procedure.

• Step 1 - Input: the sub-vectors of z-scores and covariance matrix for each of the |$B$| groups, a computationally feasible |$N$|⁠.
• Step 2 - For each of the |$B$| groups separately, calculate the estimated |$\mathrm{ locFDR}_N$| statistics using Steps 1-3 in Algorithm 1.
• Step 3 - Combine and rank all estimated |$\mathrm{ locFDR}_N$| statistics found from step 2. Denote by |$\hat{T}_{(1),N},\dots ,\hat{T}_{(K),N}$| the ranked values and |$H_{(1)},\dots ,H_{(K)}$| the corresponding hypotheses, and find |$\hat{k}=\max \left\lbrace i:\dfrac{1}{i}\sum _{j=1}^{i}{\hat{T}_{(j),N}\le \alpha }\right\rbrace$|⁠.
• Step 4 - Output: The rejected hypotheses are |$H_{(i)}$|⁠, |$i=1,\dots ,\hat{k}$|⁠.
• Step 1 - Input: the sub-vectors of z-scores and covariance matrix for each of the |$B$| groups, a computationally feasible |$N$|⁠.
• Step 2 - For each of the |$B$| groups separately, calculate the estimated |$\mathrm{ locFDR}_N$| statistics using Steps 1-3 in Algorithm 1.
• Step 3 - Combine and rank all estimated |$\mathrm{ locFDR}_N$| statistics found from step 2. Denote by |$\hat{T}_{(1),N},\dots ,\hat{T}_{(K),N}$| the ranked values and |$H_{(1)},\dots ,H_{(K)}$| the corresponding hypotheses, and find |$\hat{k}=\max \left\lbrace i:\dfrac{1}{i}\sum _{j=1}^{i}{\hat{T}_{(j),N}\le \alpha }\right\rbrace$|⁠.
• Step 4 - Output: The rejected hypotheses are |$H_{(i)}$|⁠, |$i=1,\dots ,\hat{k}$|⁠.
Algorithm 2

The level |$\alpha$|⁠, data driven, multiple-groups |$\mathrm{ locFDR}_N$| procedure.

• Step 1 - Input: the sub-vectors of z-scores and covariance matrix for each of the |$B$| groups, a computationally feasible |$N$|⁠.
• Step 2 - For each of the |$B$| groups separately, calculate the estimated |$\mathrm{ locFDR}_N$| statistics using Steps 1-3 in Algorithm 1.
• Step 3 - Combine and rank all estimated |$\mathrm{ locFDR}_N$| statistics found from step 2. Denote by |$\hat{T}_{(1),N},\dots ,\hat{T}_{(K),N}$| the ranked values and |$H_{(1)},\dots ,H_{(K)}$| the corresponding hypotheses, and find |$\hat{k}=\max \left\lbrace i:\dfrac{1}{i}\sum _{j=1}^{i}{\hat{T}_{(j),N}\le \alpha }\right\rbrace$|⁠.
• Step 4 - Output: The rejected hypotheses are |$H_{(i)}$|⁠, |$i=1,\dots ,\hat{k}$|⁠.
• Step 1 - Input: the sub-vectors of z-scores and covariance matrix for each of the |$B$| groups, a computationally feasible |$N$|⁠.
• Step 2 - For each of the |$B$| groups separately, calculate the estimated |$\mathrm{ locFDR}_N$| statistics using Steps 1-3 in Algorithm 1.
• Step 3 - Combine and rank all estimated |$\mathrm{ locFDR}_N$| statistics found from step 2. Denote by |$\hat{T}_{(1),N},\dots ,\hat{T}_{(K),N}$| the ranked values and |$H_{(1)},\dots ,H_{(K)}$| the corresponding hypotheses, and find |$\hat{k}=\max \left\lbrace i:\dfrac{1}{i}\sum _{j=1}^{i}{\hat{T}_{(j),N}\le \alpha }\right\rbrace$|⁠.
• Step 4 - Output: The rejected hypotheses are |$H_{(i)}$|⁠, |$i=1,\dots ,\hat{k}$|⁠.

Asymptotic guarantees are possible if we possess consistent estimators for the |$\mathrm{ locFDR}_N$| in Algorithm 1 and Algorithm 2, e.g., if the z-scores exhibit short-range dependencies. Briefly, the procedure described in Algorithm 1 satisfies |$\lim _{K\rightarrow \infty } \mathrm{ mFDR}_K\le \alpha$|⁠, where |$\mathrm{ mFDR}_K$| is the marginal FDR of Algorithm 1 for a family of |$K$| hypotheses. Moreover, the procedure described in Algorithm 2 satisfies |$\lim _{K\rightarrow \infty } \mathrm{ mFDR}_K\le \alpha$| by directly adapting and combining the proofs presented in Cai and Sun (2009) and Xie et al. (2011). Furthermore, under additional regularity conditions, the difference between the mFDR and FDR goes to zero as |$K\rightarrow \infty$| by following a similar argument as in Xie et al. (2011), so the asymptotic FDR is controlled.

5 SIMULATIONS

5.1 The oracle setting: simulation from a known mixture of Gaussians

Here, we show all the results with known values of the parameters of the distribution described by (7). Henceforth, we consider covariance structures for |$\Sigma =(\sigma _{ij})_{1\le i, j\le K}$| among the following four for |$1\le i, j\le K$|⁠:

(9)
(10)

Long range (Fractional Gaussian Noise)(Fan and Han, 2017):

(11)
(12)

Figure 1 shows a scatter plot of |$T_{i,N}$| versus |$T_{i,N+1}$|⁠, and the marginal density of |$T_{i,N}$|⁠, for |$N=0,\ldots ,5$|⁠. As we move to the comparison between higher-order locFDRs, the locFDRs tend to concentrate around the 45-degree line. The concentration is particularly good for AR(1) and the long-range correlation structure. For all dependencies, the densities are similar for |$N>0$| and strikingly different from the density for |$N=0$|⁠.

The left column displays scatter plots of $(T_{i,N}, T_{i,N+1})$, while the right column shows the density estimates of $T_{i,N}$ for $N=0,\dots ,5$. The $z$-score vector is generated according to model (7) with $K=1000$, $\tau =2$, $\pi =0.3$, and $b=0.2$. Rows represent various covariance structures: Row (a) correspond to an AR(1) covariance with $\rho = 0.8$; row (b), a Banded(1) covariance with $\rho = 0.5$; row (c), a long-range covariance with $H = 0.9$; and row (d), an equicorrelated covariance with $\rho = 0.8$.
FIGURE 1

The left column displays scatter plots of |$(T_{i,N}, T_{i,N+1})$|⁠, while the right column shows the density estimates of |$T_{i,N}$| for |$N=0,\dots ,5$|⁠. The |$z$|-score vector is generated according to model (7) with |$K=1000$|⁠, |$\tau =2$|⁠, |$\pi =0.3$|⁠, and |$b=0.2$|⁠. Rows represent various covariance structures: Row (a) correspond to an AR(1) covariance with |$\rho = 0.8$|⁠; row (b), a Banded(1) covariance with |$\rho = 0.5$|⁠; row (c), a long-range covariance with |$H = 0.9$|⁠; and row (d), an equicorrelated covariance with |$\rho = 0.8$|⁠.

In Table 1, we refer to the testing procedure based on |$\lbrace {T}_{i,N}, 1\le i \le K\rbrace$| for |$N=0,1,2$| as the “|$\boldsymbol{T}_N$|⁠.” We make the following observations from the results in Table 1:

  • Power: as expected, the TP increases with |$N$|⁠.

  • Effect of N, the neighborhood size: what makes the |$\mathrm{ locFDR}_N$| most interesting is the rate of power increase as a function of |$N$|⁠. The power increase is large when moving from |$N=0$| to |$N=1$|⁠, and it is much smaller when moving from |$N=1$| to |$N=2$|⁠. This behavior is commonly observed under all dependency structures considered. The highest power increase was observed for the equicorrelated covariance structure.

  • Variability of the FDP: In all settings, the procedure with marginal statistics shows higher variability in FDP (False discovery proportion, the number of false rejections divided by the number of total rejections, a random variable) compared to procedures that use |$\mathrm{ locFDR}_N$| statistics with |$N>0$|⁠. We consistently observe from simulations that the variability of the FDP is decreasing in |$N$|⁠, so this is an added advantage of the use of |$T_{i,N}$| with |$N$| as large as (computationally) possible.

  • mFDR vs FDR: The difference between mFDR and FDR is large only when the covariance matrix is equicorrelated and the |$\mathrm{ locFDR}_0$| statistics are used. Interestingly, with the |$\mathrm{ locFDR}_2$| statistics, more discoveries are made, and the difference between mFDR and FDR is small, regardless of the correlation structure (this is expected when the number of rejections is not too small).

TABLE 1

Results for with different covariance structures (known parameters) and different parameters of the model (7) (table head), using the procedure with |$T_{i,0}$| (⁠|$\boldsymbol{T}_0$| columns), |$T_{i,1}$| (⁠|$\boldsymbol{T}_1$| columns), and |$T_{i,2}$| (⁠|$\boldsymbol{T}_2$| columns). All simulations are based on repeating the process 3000 times. Standard errors for marginal false discovery rate (mFDR) were determined by bootstrapping. Based on 3000 data generations, |$K=1000,b=0,\tau = 2,\pi = 0.1$|⁠. In bold, the most powerful procedure.

ParametersAR(1), |$\rho = 0.5$|Banded(1), |$\rho = 0.5$|Long-range, |$H = 0.7$|Equi, |$\rho = 0.5$|
Method|$\boldsymbol{T}_0$||$\boldsymbol{T}_1$||$\boldsymbol{T}_2$||$\boldsymbol{T}_0$||$\boldsymbol{T}_1$||$\boldsymbol{T}_2$||$\boldsymbol{T}_0$||$\boldsymbol{T}_1$||$\boldsymbol{T}_2$||$\boldsymbol{T}_0$||$\boldsymbol{T}_1$||$\boldsymbol{T}_2$|
mFDR0.05020.04970.05010.05050.04990.04950.05090.05020.04990.05790.05070.0506
s.e.(mFDR)0.00110.00080.00080.00110.00080.00070.00110.00100.00090.00610.00130.0009
FDR0.05000.04970.05010.05030.04990.04960.05050.05020.04980.02370.04820.0502
s.e.(FDR)0.00110.00080.00080.00110.00080.00070.00110.00100.00100.00140.00110.0009
TP13.322.522.913.224.628.413.316.616.913.321.024.0
s.e.(TP)0.06640.08630.08740.06710.08920.09540.06590.07520.07570.08580.08380.0890
ParametersAR(1), |$\rho = 0.5$|Banded(1), |$\rho = 0.5$|Long-range, |$H = 0.7$|Equi, |$\rho = 0.5$|
Method|$\boldsymbol{T}_0$||$\boldsymbol{T}_1$||$\boldsymbol{T}_2$||$\boldsymbol{T}_0$||$\boldsymbol{T}_1$||$\boldsymbol{T}_2$||$\boldsymbol{T}_0$||$\boldsymbol{T}_1$||$\boldsymbol{T}_2$||$\boldsymbol{T}_0$||$\boldsymbol{T}_1$||$\boldsymbol{T}_2$|
mFDR0.05020.04970.05010.05050.04990.04950.05090.05020.04990.05790.05070.0506
s.e.(mFDR)0.00110.00080.00080.00110.00080.00070.00110.00100.00090.00610.00130.0009
FDR0.05000.04970.05010.05030.04990.04960.05050.05020.04980.02370.04820.0502
s.e.(FDR)0.00110.00080.00080.00110.00080.00070.00110.00100.00100.00140.00110.0009
TP13.322.522.913.224.628.413.316.616.913.321.024.0
s.e.(TP)0.06640.08630.08740.06710.08920.09540.06590.07520.07570.08580.08380.0890
TABLE 1

Results for with different covariance structures (known parameters) and different parameters of the model (7) (table head), using the procedure with |$T_{i,0}$| (⁠|$\boldsymbol{T}_0$| columns), |$T_{i,1}$| (⁠|$\boldsymbol{T}_1$| columns), and |$T_{i,2}$| (⁠|$\boldsymbol{T}_2$| columns). All simulations are based on repeating the process 3000 times. Standard errors for marginal false discovery rate (mFDR) were determined by bootstrapping. Based on 3000 data generations, |$K=1000,b=0,\tau = 2,\pi = 0.1$|⁠. In bold, the most powerful procedure.

ParametersAR(1), |$\rho = 0.5$|Banded(1), |$\rho = 0.5$|Long-range, |$H = 0.7$|Equi, |$\rho = 0.5$|
Method|$\boldsymbol{T}_0$||$\boldsymbol{T}_1$||$\boldsymbol{T}_2$||$\boldsymbol{T}_0$||$\boldsymbol{T}_1$||$\boldsymbol{T}_2$||$\boldsymbol{T}_0$||$\boldsymbol{T}_1$||$\boldsymbol{T}_2$||$\boldsymbol{T}_0$||$\boldsymbol{T}_1$||$\boldsymbol{T}_2$|
mFDR0.05020.04970.05010.05050.04990.04950.05090.05020.04990.05790.05070.0506
s.e.(mFDR)0.00110.00080.00080.00110.00080.00070.00110.00100.00090.00610.00130.0009
FDR0.05000.04970.05010.05030.04990.04960.05050.05020.04980.02370.04820.0502
s.e.(FDR)0.00110.00080.00080.00110.00080.00070.00110.00100.00100.00140.00110.0009
TP13.322.522.913.224.628.413.316.616.913.321.024.0
s.e.(TP)0.06640.08630.08740.06710.08920.09540.06590.07520.07570.08580.08380.0890
ParametersAR(1), |$\rho = 0.5$|Banded(1), |$\rho = 0.5$|Long-range, |$H = 0.7$|Equi, |$\rho = 0.5$|
Method|$\boldsymbol{T}_0$||$\boldsymbol{T}_1$||$\boldsymbol{T}_2$||$\boldsymbol{T}_0$||$\boldsymbol{T}_1$||$\boldsymbol{T}_2$||$\boldsymbol{T}_0$||$\boldsymbol{T}_1$||$\boldsymbol{T}_2$||$\boldsymbol{T}_0$||$\boldsymbol{T}_1$||$\boldsymbol{T}_2$|
mFDR0.05020.04970.05010.05050.04990.04950.05090.05020.04990.05790.05070.0506
s.e.(mFDR)0.00110.00080.00080.00110.00080.00070.00110.00100.00090.00610.00130.0009
FDR0.05000.04970.05010.05030.04990.04960.05050.05020.04980.02370.04820.0502
s.e.(FDR)0.00110.00080.00080.00110.00080.00070.00110.00100.00100.00140.00110.0009
TP13.322.522.913.224.628.413.316.616.913.321.024.0
s.e.(TP)0.06640.08630.08740.06710.08920.09540.06590.07520.07570.08580.08380.0890

Additional simulation results for different numbers of hypotheses, |$K$|⁠, and varying non-null proportions, |$\pi$|⁠, are provided in Section S3.1.

5.2 Data-driven settings: simulations from (unknown to the analyst) mixture of Gaussians

Here, we show the performance of our procedure when we estimate the unknown parameters |$b, \pi , \tau$| in model (7) using Algorithm 1. From now on in all the tables, we denote by “|$\boldsymbol{\hat{T}}_N$|-rule,” “Sun and Cai,” “BH,” and “ABH,” respectively, the testing procedure based on estimated |$\lbrace \hat{T}_{i,N}, 1\le i \le K\rbrace$|⁠, the testing procedure developed in Sun and Cai (2007), the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995), and the adaptive Benjamini-Hochberg (Benjamini et al., 2006), which incorporates the plug-in estimate of |$\pi$| given by the method of Jin and Cai (2007). We used the R code implementation of the “Sun and Cai” procedure that was provided in Cai (2024).

Table 2 shows the results. We see that all the methods control the mFDR at the 0.05 nominal level except the “Sun and Cai” method, which has a slightly inflated mFDR level when the non-null proportion is small, |$\pi =0.1$|⁠. This is because the estimation of the |$\mathrm{ locFDR}_0$| statistics in the “Sun and Cai” method is done in a different approach than by the pseudolikelihood approach we propose for |$\hat{\boldsymbol{T}}_N$|⁠. As in the case of the oracle simulations, we get a substantial power benefit using |$\hat{\boldsymbol{T}}_2$| compared to other methods.

TABLE 2

Results for different covariance structure (data driven) and different parameters of the model (7). For each multiple testing procedure, we provide the false discovery rate (FDR), marginal false discovery rate (mFDR), and expected true positives (TP) and their standard error. Based on 500 data generations. In bold, the most powerful procedure. |$K=15000,b=0,\tau = \sqrt{2}$|⁠.

Method|$\hat{\boldsymbol{T}}_0$|-rule|$\hat{\boldsymbol{T}}_1$|-rule|$\hat{\boldsymbol{T}}_2$|-ruleSun and CaiBHABH|$\hat{\boldsymbol{T}}_0$|-rule|$\hat{\boldsymbol{T}}_1$|-rule|$\hat{\boldsymbol{T}}_2$|-ruleSun and CaiBHABH
ParametersAR(1), |$\pi=0.3,\rho=0.8$|Banded(1), |$\pi=0.3,\rho=0.5$|
mFDR0.03620.04250.03770.04370.03550.04200.03530.03800.04220.04320.03510.0415
s.e.(mFDR)0.00070.00030.00020.00070.00070.00070.00050.00040.00030.00060.00050.0005
FDR0.03580.04250.03770.04320.03510.04150.03520.03800.04220.04300.03480.0413
s.e.(FDR)0.00070.00030.00020.00070.00060.00070.00050.00040.00030.00050.00050.0005
TP241.81100.31274.4273.7242.2269.6241.0532.0621.7273.2241.8269.6
s.e.(TP)1.16561.63912.09121.18891.11641.23581.02671.34191.29431.15361.10971.1918
ParametersAR(1), |$\pi =0.1$|⁠, |$\rho = 0.8$|Banded(1), |$\pi =0.1$|⁠, |$\rho = 0.5$|
mFDR0.04610.04910.04380.05420.04710.04950.04570.04250.04670.05490.04790.0507
s.e.(mFDR)0.00180.00190.00180.00180.00170.00190.00150.00080.00070.00150.00150.0015
FDR0.04350.04890.04390.05160.04410.04630.04510.04230.04660.05380.04620.0490
s.e.(FDR)0.00170.00190.00190.00180.00170.00170.00150.00070.00060.00150.00150.0015
TP41.3353.5394.846.543.745.241.5137.4188.146.343.344.7
s.e.(TP)0.45151.23781.58190.47140.48800.49800.38780.63670.67700.44030.45610.4606
Method|$\hat{\boldsymbol{T}}_0$|-rule|$\hat{\boldsymbol{T}}_1$|-rule|$\hat{\boldsymbol{T}}_2$|-ruleSun and CaiBHABH|$\hat{\boldsymbol{T}}_0$|-rule|$\hat{\boldsymbol{T}}_1$|-rule|$\hat{\boldsymbol{T}}_2$|-ruleSun and CaiBHABH
ParametersAR(1), |$\pi=0.3,\rho=0.8$|Banded(1), |$\pi=0.3,\rho=0.5$|
mFDR0.03620.04250.03770.04370.03550.04200.03530.03800.04220.04320.03510.0415
s.e.(mFDR)0.00070.00030.00020.00070.00070.00070.00050.00040.00030.00060.00050.0005
FDR0.03580.04250.03770.04320.03510.04150.03520.03800.04220.04300.03480.0413
s.e.(FDR)0.00070.00030.00020.00070.00060.00070.00050.00040.00030.00050.00050.0005
TP241.81100.31274.4273.7242.2269.6241.0532.0621.7273.2241.8269.6
s.e.(TP)1.16561.63912.09121.18891.11641.23581.02671.34191.29431.15361.10971.1918
ParametersAR(1), |$\pi =0.1$|⁠, |$\rho = 0.8$|Banded(1), |$\pi =0.1$|⁠, |$\rho = 0.5$|
mFDR0.04610.04910.04380.05420.04710.04950.04570.04250.04670.05490.04790.0507
s.e.(mFDR)0.00180.00190.00180.00180.00170.00190.00150.00080.00070.00150.00150.0015
FDR0.04350.04890.04390.05160.04410.04630.04510.04230.04660.05380.04620.0490
s.e.(FDR)0.00170.00190.00190.00180.00170.00170.00150.00070.00060.00150.00150.0015
TP41.3353.5394.846.543.745.241.5137.4188.146.343.344.7
s.e.(TP)0.45151.23781.58190.47140.48800.49800.38780.63670.67700.44030.45610.4606
TABLE 2

Results for different covariance structure (data driven) and different parameters of the model (7). For each multiple testing procedure, we provide the false discovery rate (FDR), marginal false discovery rate (mFDR), and expected true positives (TP) and their standard error. Based on 500 data generations. In bold, the most powerful procedure. |$K=15000,b=0,\tau = \sqrt{2}$|⁠.

Method|$\hat{\boldsymbol{T}}_0$|-rule|$\hat{\boldsymbol{T}}_1$|-rule|$\hat{\boldsymbol{T}}_2$|-ruleSun and CaiBHABH|$\hat{\boldsymbol{T}}_0$|-rule|$\hat{\boldsymbol{T}}_1$|-rule|$\hat{\boldsymbol{T}}_2$|-ruleSun and CaiBHABH
ParametersAR(1), |$\pi=0.3,\rho=0.8$|Banded(1), |$\pi=0.3,\rho=0.5$|
mFDR0.03620.04250.03770.04370.03550.04200.03530.03800.04220.04320.03510.0415
s.e.(mFDR)0.00070.00030.00020.00070.00070.00070.00050.00040.00030.00060.00050.0005
FDR0.03580.04250.03770.04320.03510.04150.03520.03800.04220.04300.03480.0413
s.e.(FDR)0.00070.00030.00020.00070.00060.00070.00050.00040.00030.00050.00050.0005
TP241.81100.31274.4273.7242.2269.6241.0532.0621.7273.2241.8269.6
s.e.(TP)1.16561.63912.09121.18891.11641.23581.02671.34191.29431.15361.10971.1918
ParametersAR(1), |$\pi =0.1$|⁠, |$\rho = 0.8$|Banded(1), |$\pi =0.1$|⁠, |$\rho = 0.5$|
mFDR0.04610.04910.04380.05420.04710.04950.04570.04250.04670.05490.04790.0507
s.e.(mFDR)0.00180.00190.00180.00180.00170.00190.00150.00080.00070.00150.00150.0015
FDR0.04350.04890.04390.05160.04410.04630.04510.04230.04660.05380.04620.0490
s.e.(FDR)0.00170.00190.00190.00180.00170.00170.00150.00070.00060.00150.00150.0015
TP41.3353.5394.846.543.745.241.5137.4188.146.343.344.7
s.e.(TP)0.45151.23781.58190.47140.48800.49800.38780.63670.67700.44030.45610.4606
Method|$\hat{\boldsymbol{T}}_0$|-rule|$\hat{\boldsymbol{T}}_1$|-rule|$\hat{\boldsymbol{T}}_2$|-ruleSun and CaiBHABH|$\hat{\boldsymbol{T}}_0$|-rule|$\hat{\boldsymbol{T}}_1$|-rule|$\hat{\boldsymbol{T}}_2$|-ruleSun and CaiBHABH
ParametersAR(1), |$\pi=0.3,\rho=0.8$|Banded(1), |$\pi=0.3,\rho=0.5$|
mFDR0.03620.04250.03770.04370.03550.04200.03530.03800.04220.04320.03510.0415
s.e.(mFDR)0.00070.00030.00020.00070.00070.00070.00050.00040.00030.00060.00050.0005
FDR0.03580.04250.03770.04320.03510.04150.03520.03800.04220.04300.03480.0413
s.e.(FDR)0.00070.00030.00020.00070.00060.00070.00050.00040.00030.00050.00050.0005
TP241.81100.31274.4273.7242.2269.6241.0532.0621.7273.2241.8269.6
s.e.(TP)1.16561.63912.09121.18891.11641.23581.02671.34191.29431.15361.10971.1918
ParametersAR(1), |$\pi =0.1$|⁠, |$\rho = 0.8$|Banded(1), |$\pi =0.1$|⁠, |$\rho = 0.5$|
mFDR0.04610.04910.04380.05420.04710.04950.04570.04250.04670.05490.04790.0507
s.e.(mFDR)0.00180.00190.00180.00180.00170.00190.00150.00080.00070.00150.00150.0015
FDR0.04350.04890.04390.05160.04410.04630.04510.04230.04660.05380.04620.0490
s.e.(FDR)0.00170.00190.00190.00180.00170.00170.00150.00070.00060.00150.00150.0015
TP41.3353.5394.846.543.745.241.5137.4188.146.343.344.7
s.e.(TP)0.45151.23781.58190.47140.48800.49800.38780.63670.67700.44030.45610.4606
 
Remark 5

In Table 2, we conducted simulations using only AR(1) and Banded(1) covariance structures, both of which represent short-range dependencies. Our estimation method utilizes a pseudolikelihood approach, which is effective for short-range covariance structures even when the number of hypotheses |$K$| is relatively small. However, this estimation strategy may not perform well for moderate values of |$K$| when dealing with long-range or equicorrelated covariances. Achieving reliable performance with long-range covariance may require a very large |$K$|⁠, and in the case of equicorrelated covariance, parameter estimates may be inconsistent regardless of |$K$|⁠. Thus more assumptions are needed in order to estimate the parameters, and this is outside the scope of our manuscript. Assuming short-range dependence is reasonable and practical for many real-world applications, including GWAS.

5.3 Comparisons with the optimal locFDRs

In many practical GWAS situations, it is assumed that the LD matrices have a block diagonal structure, where block sizes can range from |$\sim$|50 to as large as |$\sim$|2000. Furthermore, each block has a short-range dependence structure. We observed from extensive simulations that our method for fixed small |$N\in \lbrace 2,3 \rbrace$| already gives a significant power advantage even though |$N$| is not chosen according to the LD block patterns. Below we design a simulation study to demonstrate the fact that we do not lose much power if we base our inference on |$\mathrm{ locFDR}_2$| statistics, instead of the optimal |$\mathrm{ locFDR}_K$| statistics. Here, we choose a block diagonal covariance matrix of block size |$B=7$| and within each block, we assume an equicorrelated covariance (ie, with entries in (12) for various values of |$\rho$|⁠) in model (7). The simulations are performed using the rejection cutoff specified in Equation (3). Under this setup, we can calculate the oracle |$\mathrm{ locFDR}_K$| statistics in |$O(K2^B)$| time. Table 3 describes the results for various parameter values. As expected, the highest possible power is achieved using the optimal test statistic |$\boldsymbol{T}_K$|⁠. Using |$\boldsymbol{T}_0$| leads to substantially lower power, whereas using |$\boldsymbol{T}_2$| recovers a substantial proportion of the “optimal power” even though it is suboptimal (the optimal test statistic using |$N=6$| rather than |$N=2$|⁠). Comparing the |$\mathrm{ locFDR}_N$| statistics (for small |$N$|⁠) with the |$\mathrm{ locFDR}_K$| statistics is not very meaningful when the correlations among the z-scores are low, as the z-scores behave nearly independently in such cases, and the difference between |$\mathrm{ locFDR}_N$| and |$\mathrm{ locFDR}_K$| will be minimal. Hence the most interesting cases in Table 3 are the columns corresponding to |$\rho =0.5$| and |$\rho =0.7$|⁠, when we can observe substantial power gain compared to |${\bf T}_0$|-rule. Additional simulation results comparing the performance of the oracle locFDR and locFDR|$_N$| are provided in Section S3.2. The simulations show that the power gain diminishes as |$N$| increases. In certain settings, such as the AR(1) structure, the benefits of increasing |$N$| beyond 2 or 3 are negligible, whereas in other structures like the equicorrelated model, the gains are small but noticeable. Ultimately, determining whether the modest increase in power justifies the additional computational effort is a decision left to the researcher.

TABLE 3

Power, false discovery rate (FDR), and computation time (in sec) of oracle local false discovery rate (locFDR) (⁠|$\text{Optimal}(T_K)$|⁠) and locFDR|$_N$| (⁠|$T_0,T_1,T_2,T_3,T_4,T_5$|⁠). The data was generated with a block diagonal covariance with block size 7, and each block has equicorrelated covariance with correlation |$\rho , K=700,b=0,\tau = \sqrt{2.5},\pi = 0.3$|⁠. Based on 1000 data generation. In bold, the most powerful procedure.

Method|$\boldsymbol{T}_0$|-rule|$\boldsymbol{T}_1$|-rule|$\boldsymbol{T}_2$|-rule|$\boldsymbol{T}_3$|-rule|$\boldsymbol{T}_4$|-rule|$\boldsymbol{T}_5$|-ruleOptimal(⁠|$\boldsymbol{T}_K$|⁠)-rule
Parameters|$\rho = 0.3$|
mFDR0.04930.04940.04910.04980.04980.04960.0501
s.e.(mFDR)0.00150.00140.00130.00130.00130.00120.0013
FDR0.04940.04960.04910.05000.05020.04980.0503
s.e.(FDR)0.00150.00140.00130.00130.00130.00120.0012
TP21.825.527.528.729.429.930.1
s.e.(TP)0.15110.15870.16470.16580.17080.16950.1718
Cmputation (in s) 0.22570.69422.607415.360990.2856374.417514.0325
s.e.(Computation)0.00140.00500.01600.28180.91561.59410.1121
Parameters|$\rho = 0.5$|
mFDR0.05040.04990.04990.04950.04970.05010.0495
s.e.(mFDR)0.00160.00120.00120.00110.00100.00100.0010
FDR0.04990.04980.04970.04950.04980.05020.0496
s.e.(FDR)0.00160.00120.00110.00110.00110.00100.0010
TP21.832.537.640.542.143.143.5
s.e.(TP)0.15300.17240.18430.19300.19440.19950.2000
Cmputation (in s) 0.22690.68792.597415.092791.5824375.031913.9282
s.e.(Computation)0.00140.00480.01520.35910.95581.56920.0950
Parameters|$\rho = 0.7$|
mFDR0.04980.04970.04880.04850.04970.04930.0492
s.e.(mFDR)0.00180.00100.00090.00090.00080.00080.0008
FDR0.04790.04970.04880.04840.04970.04940.0492
s.e.(FDR)0.00180.00100.00090.00090.00080.00080.0008
TP21.846.056.962.365.367.067.7
s.e.(TP)0.15240.20690.22480.23360.23470.23870.2401
Cmputation (in s) 0.22650.68582.579714.559889.7155371.065813.7856
s.e.(Computation)0.00140.00480.01600.21580.93571.65020.1015
Method|$\boldsymbol{T}_0$|-rule|$\boldsymbol{T}_1$|-rule|$\boldsymbol{T}_2$|-rule|$\boldsymbol{T}_3$|-rule|$\boldsymbol{T}_4$|-rule|$\boldsymbol{T}_5$|-ruleOptimal(⁠|$\boldsymbol{T}_K$|⁠)-rule
Parameters|$\rho = 0.3$|
mFDR0.04930.04940.04910.04980.04980.04960.0501
s.e.(mFDR)0.00150.00140.00130.00130.00130.00120.0013
FDR0.04940.04960.04910.05000.05020.04980.0503
s.e.(FDR)0.00150.00140.00130.00130.00130.00120.0012
TP21.825.527.528.729.429.930.1
s.e.(TP)0.15110.15870.16470.16580.17080.16950.1718
Cmputation (in s) 0.22570.69422.607415.360990.2856374.417514.0325
s.e.(Computation)0.00140.00500.01600.28180.91561.59410.1121
Parameters|$\rho = 0.5$|
mFDR0.05040.04990.04990.04950.04970.05010.0495
s.e.(mFDR)0.00160.00120.00120.00110.00100.00100.0010
FDR0.04990.04980.04970.04950.04980.05020.0496
s.e.(FDR)0.00160.00120.00110.00110.00110.00100.0010
TP21.832.537.640.542.143.143.5
s.e.(TP)0.15300.17240.18430.19300.19440.19950.2000
Cmputation (in s) 0.22690.68792.597415.092791.5824375.031913.9282
s.e.(Computation)0.00140.00480.01520.35910.95581.56920.0950
Parameters|$\rho = 0.7$|
mFDR0.04980.04970.04880.04850.04970.04930.0492
s.e.(mFDR)0.00180.00100.00090.00090.00080.00080.0008
FDR0.04790.04970.04880.04840.04970.04940.0492
s.e.(FDR)0.00180.00100.00090.00090.00080.00080.0008
TP21.846.056.962.365.367.067.7
s.e.(TP)0.15240.20690.22480.23360.23470.23870.2401
Cmputation (in s) 0.22650.68582.579714.559889.7155371.065813.7856
s.e.(Computation)0.00140.00480.01600.21580.93571.65020.1015
TABLE 3

Power, false discovery rate (FDR), and computation time (in sec) of oracle local false discovery rate (locFDR) (⁠|$\text{Optimal}(T_K)$|⁠) and locFDR|$_N$| (⁠|$T_0,T_1,T_2,T_3,T_4,T_5$|⁠). The data was generated with a block diagonal covariance with block size 7, and each block has equicorrelated covariance with correlation |$\rho , K=700,b=0,\tau = \sqrt{2.5},\pi = 0.3$|⁠. Based on 1000 data generation. In bold, the most powerful procedure.

Method|$\boldsymbol{T}_0$|-rule|$\boldsymbol{T}_1$|-rule|$\boldsymbol{T}_2$|-rule|$\boldsymbol{T}_3$|-rule|$\boldsymbol{T}_4$|-rule|$\boldsymbol{T}_5$|-ruleOptimal(⁠|$\boldsymbol{T}_K$|⁠)-rule
Parameters|$\rho = 0.3$|
mFDR0.04930.04940.04910.04980.04980.04960.0501
s.e.(mFDR)0.00150.00140.00130.00130.00130.00120.0013
FDR0.04940.04960.04910.05000.05020.04980.0503
s.e.(FDR)0.00150.00140.00130.00130.00130.00120.0012
TP21.825.527.528.729.429.930.1
s.e.(TP)0.15110.15870.16470.16580.17080.16950.1718
Cmputation (in s) 0.22570.69422.607415.360990.2856374.417514.0325
s.e.(Computation)0.00140.00500.01600.28180.91561.59410.1121
Parameters|$\rho = 0.5$|
mFDR0.05040.04990.04990.04950.04970.05010.0495
s.e.(mFDR)0.00160.00120.00120.00110.00100.00100.0010
FDR0.04990.04980.04970.04950.04980.05020.0496
s.e.(FDR)0.00160.00120.00110.00110.00110.00100.0010
TP21.832.537.640.542.143.143.5
s.e.(TP)0.15300.17240.18430.19300.19440.19950.2000
Cmputation (in s) 0.22690.68792.597415.092791.5824375.031913.9282
s.e.(Computation)0.00140.00480.01520.35910.95581.56920.0950
Parameters|$\rho = 0.7$|
mFDR0.04980.04970.04880.04850.04970.04930.0492
s.e.(mFDR)0.00180.00100.00090.00090.00080.00080.0008
FDR0.04790.04970.04880.04840.04970.04940.0492
s.e.(FDR)0.00180.00100.00090.00090.00080.00080.0008
TP21.846.056.962.365.367.067.7
s.e.(TP)0.15240.20690.22480.23360.23470.23870.2401
Cmputation (in s) 0.22650.68582.579714.559889.7155371.065813.7856
s.e.(Computation)0.00140.00480.01600.21580.93571.65020.1015
Method|$\boldsymbol{T}_0$|-rule|$\boldsymbol{T}_1$|-rule|$\boldsymbol{T}_2$|-rule|$\boldsymbol{T}_3$|-rule|$\boldsymbol{T}_4$|-rule|$\boldsymbol{T}_5$|-ruleOptimal(⁠|$\boldsymbol{T}_K$|⁠)-rule
Parameters|$\rho = 0.3$|
mFDR0.04930.04940.04910.04980.04980.04960.0501
s.e.(mFDR)0.00150.00140.00130.00130.00130.00120.0013
FDR0.04940.04960.04910.05000.05020.04980.0503
s.e.(FDR)0.00150.00140.00130.00130.00130.00120.0012
TP21.825.527.528.729.429.930.1
s.e.(TP)0.15110.15870.16470.16580.17080.16950.1718
Cmputation (in s) 0.22570.69422.607415.360990.2856374.417514.0325
s.e.(Computation)0.00140.00500.01600.28180.91561.59410.1121
Parameters|$\rho = 0.5$|
mFDR0.05040.04990.04990.04950.04970.05010.0495
s.e.(mFDR)0.00160.00120.00120.00110.00100.00100.0010
FDR0.04990.04980.04970.04950.04980.05020.0496
s.e.(FDR)0.00160.00120.00110.00110.00110.00100.0010
TP21.832.537.640.542.143.143.5
s.e.(TP)0.15300.17240.18430.19300.19440.19950.2000
Cmputation (in s) 0.22690.68792.597415.092791.5824375.031913.9282
s.e.(Computation)0.00140.00480.01520.35910.95581.56920.0950
Parameters|$\rho = 0.7$|
mFDR0.04980.04970.04880.04850.04970.04930.0492
s.e.(mFDR)0.00180.00100.00090.00090.00080.00080.0008
FDR0.04790.04970.04880.04840.04970.04940.0492
s.e.(FDR)0.00180.00100.00090.00090.00080.00080.0008
TP21.846.056.962.365.367.067.7
s.e.(TP)0.15240.20690.22480.23360.23470.23870.2401
Cmputation (in s) 0.22650.68582.579714.559889.7155371.065813.7856
s.e.(Computation)0.00140.00480.01600.21580.93571.65020.1015

6 UK BIOBANK HEIGHT DATA ANALYSIS

6.1 Data preparation

We have a centered phenotype vector |$\boldsymbol{y}^{n\times 1}$| and a centered genotype matrix |$X^{n \times p}$| (so the column means are zero) from |$n$| individuals and |$p$| different genotypes, henceforth SNPs, with |$n>p$|⁠. We want to find the SNPs that are conditionally associated with the phenotype given other SNPs while controlling the mFDR at level |$\alpha = 0.05$|⁠, using the |$\mathrm{ locFDR}_N$| statistics for |$N>0$|⁠. For finding the z-scores, we use the linear model,|$\boldsymbol{y} = X\boldsymbol{\beta } + \boldsymbol{\epsilon }.$| The least squares estimate of |$\boldsymbol{\beta }$| is |$\hat{\boldsymbol{\beta }} = (X^TX)^{-1}X^T\boldsymbol{y}$| with |$\hat{\boldsymbol{\beta }} \sim \mathcal {N}(\boldsymbol{\beta },\sigma ^2(X^TX)^{-1})$|⁠. Let |$S = \mathrm{ diag}((X^TX)^{-1})$|⁠, and let the estimated variance of |$\epsilon _i$| be |$\hat{\sigma ^2} = \frac{1}{n-p}{(\boldsymbol{y}-X\hat{\boldsymbol{\beta }})^T(\boldsymbol{y}-X\hat{\boldsymbol{\beta }})}$|⁠. Our z-score vector is |${\bf Z}= \frac{1}{\hat{\sigma }} S^{-1/2}\hat{\boldsymbol{\beta }}$|⁠, and the correlation among the z-scores is |$\Sigma = S^{-1/2}(X^TX)^{-1}S^{-1/2}$|⁠. For the GWAS of height, we detail all data pre-processing steps in Section S4.1, resulting in 486,573 z-scores |${\bf Z}$| and their correlation matrix |$\Sigma$|⁠.

6.2 Data analysis

The whole genome consists of 22 chromosomes. We assume the test statistics in each chromosome follow the multiple-group model described in Section 4.3. So each chromosome is partitioned into multiple independent blocks, and each block has its own set of z-scores and correlation structure, as detailed in Section S4.1.

6.2.1 Genome-wide analysis

We test all SNPs across the entire genome using level |$\alpha = 0.05$| in Algorithm 2. Figure 2 illustrates the number of rejections alongside the rejection counts from the competitor Sun and Cai (Cai and Sun, 2009). We excluded the BH (which had 344 rejections) and ABH (which had 403 rejections) methods from Figure 2 since they reject far fewer hypotheses compared to Sun and Cai and |$\hat{\boldsymbol{T}}_0$|⁠.

UpSet plot showing the number of significant single nucleotide polymorphisms (SNPs) by each method (in the rows) combining all chromosomes. It also shows different intersections (in the columns) of the rejected hypotheses by different methods (intersections with a minimum size of 15 are shown only).
FIGURE 2

UpSet plot showing the number of significant single nucleotide polymorphisms (SNPs) by each method (in the rows) combining all chromosomes. It also shows different intersections (in the columns) of the rejected hypotheses by different methods (intersections with a minimum size of 15 are shown only).

As shown in Figure 2, the novel |$\hat{\boldsymbol{T}}_3$| method yields the most discoveries. The number of discoveries increasing as |$N$| increases. The |$\hat{\boldsymbol{T}}_3$| method uniquely identified 125 SNPs. We conducted a manual search in the NHGRI-EBI GWAS Catalog (2024) to find literature support for some of these uniquely identified SNPs. We identified 12 SNPs, which are listed in Table S7 along with their P-values (the smallest one among all reported studies is tabulated) and chromosome locations. These SNPs notably have extremely small p-values, underscoring the importance of accounting for dependence in the analysis.

6.2.2 Per chromosome analysis

We applied Algorithm 2 at level |$\alpha = 0.05$| to each chromosome separately. Table 4 shows the number of rejections in each chromosome, alongside the number of rejections by off-the-shelf competitors: Sun and Cai (Cai and Sun, 2009), BH, and ABH. The BH and ABH methods reject far fewer hypotheses compared to Sun and Cai and |$\hat{\boldsymbol{T}}_0$|⁠. This is because the BH and ABH methods are not designed to address group heterogeneity. All chromosomes, except for 18 and 22, exhibit a significant increase in the number of rejections when using the |$\boldsymbol{\hat{T}}_3$|-rule compared to the |$\boldsymbol{\hat{T}}_0$|-rule. Specifically, chromosomes 10 and 21 demonstrate nearly double the number of rejections, suggesting that these chromosomes may possess multiple separate signals that can be discovered when taking the highly correlated neighboring test statistics into account but missed otherwise.

TABLE 4

Number of rejections by each method (denoted in columns) in each chromosome (denoted in rows). Separate analysis for each chromosome.

ChromosomeBHABHSun and Cai|$\hat{\boldsymbol{T}}_0$||$\hat{\boldsymbol{T}}_1$||$\hat{\boldsymbol{T}}_2$||$\hat{\boldsymbol{T}}_3$|
chr132325051576972
chr224294342455055
chr318233436374151
chr424254036434241
chr533365255596464
chr621233736454849
chr719223229333637
chr812132822272729
chr925273939363742
chr10661817222531
chr1115172626303133
chr1223273937425056
chr138141913151618
chr143313981313
chr1521213438434750
chr16351516151817
chr1712152125283539
chr1816172629282727
chr1934354041474951
chr2011151923293331
chr212297111113
chr220031110
ChromosomeBHABHSun and Cai|$\hat{\boldsymbol{T}}_0$||$\hat{\boldsymbol{T}}_1$||$\hat{\boldsymbol{T}}_2$||$\hat{\boldsymbol{T}}_3$|
chr132325051576972
chr224294342455055
chr318233436374151
chr424254036434241
chr533365255596464
chr621233736454849
chr719223229333637
chr812132822272729
chr925273939363742
chr10661817222531
chr1115172626303133
chr1223273937425056
chr138141913151618
chr143313981313
chr1521213438434750
chr16351516151817
chr1712152125283539
chr1816172629282727
chr1934354041474951
chr2011151923293331
chr212297111113
chr220031110
TABLE 4

Number of rejections by each method (denoted in columns) in each chromosome (denoted in rows). Separate analysis for each chromosome.

ChromosomeBHABHSun and Cai|$\hat{\boldsymbol{T}}_0$||$\hat{\boldsymbol{T}}_1$||$\hat{\boldsymbol{T}}_2$||$\hat{\boldsymbol{T}}_3$|
chr132325051576972
chr224294342455055
chr318233436374151
chr424254036434241
chr533365255596464
chr621233736454849
chr719223229333637
chr812132822272729
chr925273939363742
chr10661817222531
chr1115172626303133
chr1223273937425056
chr138141913151618
chr143313981313
chr1521213438434750
chr16351516151817
chr1712152125283539
chr1816172629282727
chr1934354041474951
chr2011151923293331
chr212297111113
chr220031110
ChromosomeBHABHSun and Cai|$\hat{\boldsymbol{T}}_0$||$\hat{\boldsymbol{T}}_1$||$\hat{\boldsymbol{T}}_2$||$\hat{\boldsymbol{T}}_3$|
chr132325051576972
chr224294342455055
chr318233436374151
chr424254036434241
chr533365255596464
chr621233736454849
chr719223229333637
chr812132822272729
chr925273939363742
chr10661817222531
chr1115172626303133
chr1223273937425056
chr138141913151618
chr143313981313
chr1521213438434750
chr16351516151817
chr1712152125283539
chr1816172629282727
chr1934354041474951
chr2011151923293331
chr212297111113
chr220031110

7 DISCUSSION

In this paper, we suggest practical multiple-testing procedures for the general 2-group model. The computational impossibility of calculating the oracle locFDRs necessitates approximation. We suggest finding an optimal solution in a restricted decision space |$\mathcal {C}_N$|⁠, and show empirically that we do not lose much by concentrating on this smaller class. We provide practical algorithms. We apply our methodology to a GWAS study of height, and we find more associations than existing practical methods. Our developed method works in GWAS without resorting to extensive modeling assumptions and excessive computational load of a complete Bayesian framework as done in Zhu and Stephens (2017).

We opted to focus on the class |$\mathcal {C}_N$|⁠, defined by physical neighborhood, for the sake of simplicity and clarity. However, our framework can readily accommodate natural extensions with minimal adjustments. In some setups, for example, assuming model (7), a possible approach is to define the |$i$|th locFDR statistic by conditioning on the top |$N$| highest correlated |$z$|-scores with |$Z_i$|⁠. Additionally, we may consider selecting this value of |$N$| based on |$i$|⁠, determined by the strength of correlation of the test statistic with the other test statistics.

Proposition 1 enables us to see how we can devise a method for mFDR control using the statistics |$T_{i,N}$|⁠, which is optimal in |$\mathcal {C}_N$|⁠. If we want to control other error measures like the FDR, positive FDR (pFDR) or false discovery exceedance (FDX), and the entire joint distribution of the |$z$|-score vector is known, we can always define a policy of the form |$\boldsymbol{D}^t = (D_1^t({\bf Z}_{1,N}),\ldots ,D_K^t({\bf Z}_{K,N}))$|⁠, |$D_i^t({\bf Z}_{i,N}) = \mathbf {I}(T_{i,N}\le t)$| and determine

with |$\mathrm{ Err}\in \lbrace \mathrm{ FDR},\mathrm{ pFDR},\mathrm{ FDX}\rbrace$|⁠. This procedure |$\boldsymbol{D}^{t_{\alpha ,N}}$| will control the chosen error by construction. However, it is important to note that the single threshold policy |$\boldsymbol{D}^{t_{\alpha ,N}}$| with these error rates may not exhibit a similar optimality result as shown in Theorem 1. To find optimal solutions, we need to solve the generic problem with |$\mathrm{ Err }\in \lbrace \mathrm{ FDR},\mathrm{ pFDR},\mathrm{ FDX}\rbrace$|⁠,

(13)

Finding such a solution is difficult because other error measures such as FDX, FDR, and pFDR do not decompose in the same way as mFDR as shown in Section S1 Equation (1) in the SM. It is a challenge to find a similar solution for these error measures.

Sun and Cai (2009) developed a method assuming hypothesis states follow the hidden Markov model (HMM) structure and the z-scores given the hypothesis states are independent. Hence their method does not use correlation among the |$z$|-scores. Interestingly, Proposition 1 holds for any joint distribution of |$({\bf Z},\boldsymbol{h})$| on |$\mathbb {R}^K \times \lbrace 0,1\rbrace ^K$|⁠. Although it is normally assumed that |$h_i \overset{\mathrm{ i.i.d.}}{\sim } \mathrm{ Ber}(\pi )$|⁠, we can also assume that |$\boldsymbol{h}$| follows an HMM as done in Sun and Cai (2009), and we can draw an inference based on the statistics |$T_{i,N}$| in these situations also. An interesting direction to extend our approach is thus to also allow an HMM structure of hypothesis states in addition to the correlation among |$z$|-scores. We expect that it will be possible to derive a practical solution for this setting. In particular, our estimation method is such that we can easily extend to this HMM-based model as well, which can extend the applicability of our approach to additional real data scenarios.

The proposed approach for defining the |$\mathrm{ locFDR}_N$| statistic assumes a natural one-dimensional ordering of hypothesis states. However, this method can be generalized to spatial multiple testing. Optimally, inference at a location |$s^{*}$| is based on the statistic |$\mathbb {P}_{\Psi }\left(\theta (s^{*}) = 0 \vert {\bf Z}^n\right)$|⁠, where |$\theta (s)$| indicates whether location |$s$| is truly non-null, |$\Psi$| represents the parameters of the spatial model, and |${\bf Z}^n = (Z_1, \dots , Z_n)$| are the test statistics in all the |$n$| locations. The approach of Sun et al. (2015) is to estimate the optimal decision functions. This can be computationally and statistically challenging for complex spatial models. An alternative is to make inferences for location |$s^{*}$| using the statistic |$\mathbb {P}_{\Psi _S}(\theta (s^{*}) = 0 \vert {\bf Z}^S)$|⁠, where |$S \subset \lbrace s_1, \dots , s_n\rbrace$|⁠, |${\bf Z}^S$| contains observations only at locations in |$S$|⁠, and |$\Psi _S$| (⁠|$\subset \Psi$|⁠) represents the parameters affecting the spatial sub-model distribution restricted to region |$S$|⁠. The choice of |$S$| for an unknown location |$s^{*}$| can be made in various ways, such as selecting the |$N$| nearest neighbors of |$s^{*}$| based on Euclidean distance, or using prior knowledge.

Another related extension is to use additional side information |$\boldsymbol{x}_i$| (alongside z-scores |${\bf Z}$|⁠) available for the |$i$|th hypothesis state |$h_i$| to improve power. The model considered in Scott et al. (2015) can be formulated with |$\boldsymbol{x}_i$| as follows:

where |$c(\boldsymbol{x}_i) = \dfrac{1}{1+\exp {(-\boldsymbol{x}_i^{^{\prime }} \boldsymbol{\beta })}}$|⁠. This model assumes the independence of z-scores conditional on |$h_i$|⁠, and inference can be made based on the estimated statistic |$\mathbb {P}(h_i = 1 \vert Z_i, \boldsymbol{x}_i)$| using the EM algorithm, as demonstrated in Scott et al. (2015). We can naturally extend the independent model to account for dependencies among z-scores as follows:

In this dependent model, the optimal inference is based on the statistic |$\mathbb {P}(h_i = 1 \vert {\bf Z}, \boldsymbol{x}_i)$|⁠, which presents computational challenges. Instead, it would be more feasible to adapt our method and make inferences using the statistic |$\mathbb {P}(h_i = 0 \vert {\bf Z}_{i,N}, \boldsymbol{x}_i)$| rather than |$\mathbb {P}(h_i = 0 \vert {\bf Z}, \boldsymbol{x}_i)$|⁠.

Our approach for parameter estimation in the general 2-group model is based on the pseudolikelihood. Under short-ranged dependency among the z-scores, parameter estimates by our method are expected to converge (as |$K \rightarrow \infty$|⁠) to the true parameters. It can be interesting to investigate the performance of the pseudolikelihood approach for other more general dependence structures.

One major observation from our simulation results is that taking even part of the (short-range) dependence into account improves power. We observe large power gains even when there is very limited and not very strong dependence. Hence our recommendation is to use |$\mathrm{ locFDR}_N$| statistics-based rules, with |$N>0$|⁠, whenever there is known dependence or the dependence within neighborhoods can be well approximated.

ACKNOWLEDGMENTS

The authors thank the reviewers, associate editor, and editor for useful comments that improved the presentation of the paper. We utilized ChatGPT, an AI-driven natural language processing (NLP) tool, to assist with refining the English and correcting code in this paper.

FUNDING

This research was partially supported by the Israeli Science Foundation Grant ISF 2180/20.

CONFLICT OF INTEREST

None declared.

DATA AVAILABILITY

This research has been conducted using the UK Biobank Resource under Application Number 56885. To protect the privacy of the study participants, we are unable to share the data. Researchers interested in accessing the dataset should contact the UK Biobank directly. More information about the UK Biobank can be found at http://www.ukbiobank.ac.uk.

References

Benjamini
 
Y.
,
Hochberg
 
Y.
(
1995
).
Controlling the false discovery rate: a practical and powerful approach to multiple testing
.
Journal of the Royal Statistical Society: Series B (Methodological)
,
57
,
289
300
.

Benjamini
 
Y.
,
Krieger
 
A. M.
,
Yekutieli
 
D.
(
2006
).
Adaptive linear step-up procedures that control the false discovery rate
.
Biometrika
,
93
,
491
507
.

Cai
 
T. T.
(
2024
).
Codes
. ].

Cai
 
T. T.
,
Sun
 
W.
(
2009
).
Simultaneous testing of grouped hypotheses: finding needles in multiple haystacks
.
Journal of the American Statistical Association
,
104
,
1467
1481
.

Cox
 
D. R.
,
Reid
 
N.
(
2004
).
A note on pseudolikelihood constructed from marg inal densities
.
Biometrika
,
91
,
729
737
.

Dempster
 
A. P.
,
Laird
 
N. M.
,
Rubin
 
D. B.
(
1977
).
Maximum likelihood from incomplete data via the em algorithm
.
Journal of the Royal Statistical Society: Series B (Methodological)
,
39
,
1
22
.

Efron
 
B.
(
2007
).
Correlation and large-scale simultaneous significance testing
.
Journal of the American Statistical Association
,
102
,
93
103
.

Efron
 
B.
(
2008
).
Simultaneous inference: when should hypothesis testing problems be combined?
.
The Annals of Applied Statistics
,
2
,
197
223
.

Efron
 
B.
(
2010
).
Correlated z-values and the accuracy of large-scale statistical estimates
.
Journal of the American Statistical Association
,
105
,
1042
1055
.

Efron
 
B.
,
Tibshirani
 
R.
,
Storey
 
J. D.
,
Tusher
 
V.
(
2001
).
Empirical Bayes analysis of a microarray experiment
.
Journal of the American Statistical Association
,
96
,
1151
1160
.

Fan
 
J.
,
Han
 
X.
(
2017
).
Estimation of the false discovery proportion with unknown dependence
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
79
,
1143
1164
.

Ferkingstad
 
E.
,
Frigessi
 
A.
,
Rue
 
H.
,
Thorleifsson
 
G.
,
Kong
 
A.
(
2008
).
Unsupervised empirical Bayesian multiple testing with external covariates
.
The Annals of Applied Statistics
,
2
,
714
735
.

Genovese
 
C.
,
Wasserman
 
L.
(
2002
).
Operating characteristics and extensions of the false discovery rate procedure
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
64
,
499
517
.

Heller
 
R.
,
Rosset
 
S.
(
2021
).
Optimal control of false discovery criteria in the two-group model
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
83
,
133
155
.

Heller
 
R.
,
Yekutieli
 
D.
(
2014
).
Replicability analysis for genome-wide association studies
.
The Annals of Applied Statistics
,
8
,
481
498
.

Jin
 
J.
,
Cai
 
T. T.
(
2007
).
Estimating the null and the proportion of nonnull effects in large-scale multiple comparisons
.
Journal of the American Statistical Association
,
102
,
495
506
.

NHGRI-EBI GWAS Catalog
(
2024
).
Nhgri-ebi catalog of human genome-wide association studies
.
[Accessed 9 August 2024
].

Schwartzman
 
A.
,
Dougherty
 
R. F.
,
Taylor
 
J. E.
(
2008
).
False discovery rate analysis of brain diffusion direction maps
.
The Annals of Applied Statistics
,
2
,
153
.

Scott
 
J. G.
,
Kelly
 
R. C.
,
Smith
 
M. A.
,
Zhou
 
P.
,
Kass
 
R. E.
(
2015
).
False discovery rate regression: an application to neural synchrony detection in primary visual cortex
.
Journal of the American Statistical Association
,
110
,
459
471
.

Shu
 
H.
,
Nan
 
B.
,
Koeppe
 
R.
(
2015
).
Multiple testing for neuroimaging via hidden Markov random field
.
Biometrics
,
71
,
741
750
.

Stephens
 
M.
(
2017
).
False discovery rates: a new deal
.
Biostatistics
,
18
,
275
294
.

Sun
 
W.
,
Cai
 
T. T.
(
2007
).
Oracle and adaptive compound decision rules for false discovery rate control
.
Journal of the American Statistical Association
,
102
,
901
912
.

Sun
 
W.
,
Cai
 
T. T.
(
2009
).
Large-scale multiple testing under dependence
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
71
,
393
424
.

Sun
 
W.
,
Reich
 
B. J.
,
Tony Cai
 
T.
,
Guindani
 
M.
,
Schwartzman
 
A.
(
2015
).
False discovery control in large-scale spatial multiple testing
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
77
,
59
83
.

Varin
 
C.
,
Reid
 
N.
,
Firth
 
D.
(
2011
).
An overview of composite likelihood methods
.
Statistica Sinica
,
21
,
5
42
.

Xie
 
J.
,
Cai
 
T. T.
,
Maris
 
J.
,
Li
 
H.
(
2011
).
Optimal false discovery rate control for dependent data
.
Statistics and its Interface
,
4
,
417
.

Zhu
 
X.
,
Stephens
 
M.
(
2017
).
Bayesian large-scale multiple regression with summary statistics from genome-wide association studies
.
The Annals of Applied Statistics
,
11
,
1561
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]
Open Science Framework awards
Open Materials Open Materials
The components of the research methodology needed to reproduce the reported procedure and analysis are publicly available for this article.