Two-sample distribution tests in high dimensions via max-sliced Wasserstein distance and bootstrapping

Anderson

T. W.

(

1962

).

On the distribution of the two-sample Cramér-von Mises criterion

.

Ann. Math. Statist.

33

,

1148

–

59

.

Arjovsky

M.

,

Chintala

S.

&

Bottou

L.

(

2017

). Wasserstein generative adversarial networks. In Proc. 34th Int. Conf. Mach. Learn., vol.

70

, pp.

214

–

23

. PMLR.

Ashburner

M.

,

Ball

C. A.

,

Blake

J. A.

,

Botstein

D.

,

Butler

H.

,

Cherry

J. M.

,

Davis

A. P.

,

Dolinski

K.

,

Dwight

S. S.

,

Eppig

J. T.

et al. (

2000

).

Gene ontology: tool for the unification of biology

.

Nature Genet.

25

,

25

–

9

.

Ba

K. D.

,

Nguyen

H. L.

,

Nguyen

H. N.

&

Rubinfeld

R.

(

2011

).

Sublinear time algorithms for earth mover’s distance

.

Theory Comp. Syst

.

48

,

428

–

42

.

Bhattacharya

B. B.

(

2019

).

A general asymptotic framework for distribution-free graph-based two-sample tests

.

J. R. Statist. Soc. B

81

,

575

–

602

.

Bhattacharya

B. B

. (

2020

).

Asymptotic distribution and detection thresholds for two-sample tests based on geometric graphs

.

Ann. Statist.

48

,

2879

–

903

.

Bickel

P. J

. (

1969

).

A distribution free version of the Smirnov two sample test in the |$ p $|-variate case

.

Ann. Math. Statist

.

40

,

1

–

23

.

Biswas

M.

&

Ghosh

A. K.

(

2014

).

A nonparametric two-sample test applicable to high dimensional data

.

J. Mult. Anal

.

123

,

160

–

71

.

Biswas

M.

,

Mukhopadhyay

M.

&

Ghosh

A. K.

(

2014

).

A distribution-free two-sample run test applicable to high-dimensional data

.

Biometrika

101

,

913

–

26

.

Blumensath

T.

&

Davies

M. E.

(

2009

).

Iterative hard thresholding for compressed sensing

.

Appl. Comp. Harmon. Anal

.

27

,

265

–

74

.

Cao

R.

&

Van Keilegom

I.

(

2006

).

Empirical likelihood tests for two-sample problems via nonparametric density estimation

.

Can. J. Statist

.

34

,

61

–

77

.

Cerami

E.

,

Gao

J.

,

Dogrusoz

U.

,

Gross

B. E.

,

Sumer

S. O.

,

Aksoy

B. A.

,

Jacobsen

A.

,

Byrne

C. J.

,

Heuer

M. L.

,

Larsson

E.

et al. (

2012

).

The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data

.

Cancer Discov

.

2

,

401

–

4

.

Chen

H.

&

Friedman

J. H.

(

2017

).

A new graph-based two-sample test for multivariate and object data

.

J. Am. Statist. Assoc

.

112

,

397

–

409

.

Chernozhukov

V.

,

Chetverikov

D.

&

Kato

K

. (

2014

).

Gaussian approximation of suprema of empirical processes

.

Ann. Statist.

42

,

1564

–

97

.

Chernozhukov

V.

,

Chetverikov

D.

&

Kato

K.

(

2015

).

Comparison and anti-concentration bounds for maxima of Gaussian random vectors

.

Prob. Theory Rel. Fields

162

,

47

–

70

.

Chernozhukov

V.

,

Chetverikov

D.

&

Kato

K.

(

2016

).

Empirical and multiplier bootstraps for suprema of empirical processes of increasing complexity, and related Gaussian couplings

.

Stoch. Proces. Appl

.

126

,

3632

–

51

.

Chernozhukov

V.

,

Chetverikov

D.

&

Kato

K

. (

2017

).

Central limit theorems and bootstrap in high dimensions

.

Ann. Prob

.

45

,

2309

–

52

.

Chernozhukov

V.

,

Chetverikov

D.

,

Kato

K.

&

Koike

Y

. (

2022

).

Improved central limit theorem and bootstrap approximations in high dimensions

.

Ann. Statist.

50

,

2562

–

86

.

Das

P. M.

&

Singal

R.

(

2004

).

DNA methylation and cancer

.

J. Clin. Oncol.

22

,

4632

–

42

.

Dereich

S.

,

Scheutzow

M.

&

Schottstedt

R

. (

2013

).

Constructive quantization: approximation by empirical measures

.

Ann. Inst. H. Poincaré Prob. Statist

.

49

,

1183

–

203

.

Deshpande

I.

,

Hu

Y.-T.

,

Sun

R.

,

Pyrros

A.

,

Siddiqui

N.

,

Koyejo

S.

,

Zhao

Z.

,

Forsyth

D.

&

Schwing

A. G.

(

2019

). Max-sliced Wasserstein distance and its use for GANs. In Proc. IEEE Conf. Comp. Vis. Pat. Recog., pp.

10640

–

8

. Piscataway, NJ: IEEE Press.

Fournier

N.

&

Guillin

A.

(

2015

).

On the rate of convergence in Wasserstein distance of the empirical measure

.

Prob. Theory Rel. Fields

162

,

707

–

38

.

Friedman

J. H.

&

Rafsky

L. C

. (

1979

).

Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests

.

Ann. Statist.

7

,

697

–

717

.

Fukunaga

T.

,

Fujita

Y.

,

Kishima

H.

&

Yamashita

T

. (

2018

).

Methylation dependent down-regulation of G0S2 leads to suppression of invasion and improved prognosis of IDH1-mutant glioma

.

PLoS One

13

,

e0206552

.

Gao

H.

&

Shao

X

. (

2023

).

Two sample testing in high dimension via maximum mean discrepancy

.

J. Mach. Learn. Res

.

24

,

1

–

33

.

Goldfeld

Z.

&

Greenewald

K.

(

2020

). Gaussian-smoothed optimal transport: metric structure and statistical efficiency. In Proc. 23rd Int. Conf. Artif. Intel. Statist., vol.

108

, pp.

3327

–

37

. PMLR.

Goldfeld

Z.

,

Kato

K.

,

Nietert

S.

&

Rioux

G

. (

2022

). Limit distribution theory for smooth |$ p $|-Wasserstein distances. arXiv: 2203.00159v1.

Gretton

A.

,

Borgwardt

K. M.

,

Rasch

M. J.

,

Schölkopf

B.

&

Smola

A. J.

(

2007

). A kernel method for the two-sample-problem. In

Advances in Neural Information Processing Systems, Ed

.

Schölkopf

B.

,

Platt

J.

and

Hoffman

T.

, pp.

513

–

20

.

Cambridge, MA

:

MIT Press

.

Google Preview

Gretton

A.

,

Borgwardt

K. M.

,

Rasch

M. J.

,

Schölkopf

B.

&

Smola

A. J

. (

2012

).

A kernel two-sample test

.

J. Mach. Learn. Res

.

13

,

723

–

73

.

Henze

N

. (

1988

).

A multivariate two-sample test based on the number of nearest neighbor type coincidences

.

Ann. Statist.

16

,

772

–

83

.

Hsu

J. B.-K.

,

Chang

T.-H.

,

Lee

G. A.

,

Lee

T.-Y.

&

Chen

C.-Y

. (

2019

).

Identification of potential biomarkers related to glioma survival by gene expression profile analysis

.

BMC Med. Genomics

11

,

1

–

18

.

Hu

X.

&

Lei

J.

(

2024

).

A two-sample conditional distribution test using conformal prediction and weighted rank sum

.

J. Am. Statist. Assoc

.

119

,

1136

–

54

.

Imaizumi

M.

,

Ota

H.

&

Hamaguchi

T.

(

2022

).

Hypothesis test and confidence analysis with Wasserstein distance on general dimension

.

Neural Comp

.

34

,

1448

–

87

.

Jain

P.

,

Tewari

A.

&

Kar

P.

(

2014

). On iterative hard thresholding methods for high-dimensional M-estimation. In Proc. 28th Int. Conf. Neural Info. Proces. Syst., vol.

1

, pp.

685

–

93

. Cambridge, MA: MIT Press.

Kim

I.

,

Balakrishnan

S.

&

Wasserman

L

. (

2020

).

Robust multivariate nonparametric tests via projection averaging

.

Ann. Statist.

48

,

3417

–

41

.

Klughammer

J.

,

Kiesel

B.

,

Roetzer

T.

,

Fortelny

N.

,

Nemc

A.

,

Nenning

K.-H.

,

Furtner

J.

,

Sheffield

N. C.

,

Datlinger

P.

,

Peter

N.

et al. (

2018

).

The DNA methylation landscape of glioblastoma disease progression shows extensive heterogeneity in time and space

.

Nature Med.

24

,

1611

–

24

.

Lei

J

. (

2020

).

Convergence and concentration of empirical measures under Wasserstein distance in unbounded functional spaces

.

Bernoulli

26

,

767

–

98

.

Lin

Z.

,

Lopes

M. E.

&

Müller

H.-G.

(

2023

).

High-dimensional manova via bootstrapping and its application to functional and sparse count data

.

J. Am. Statist. Assoc

.

118

,

177

–

91

.

Liu

H.

,

Wang

H.

&

Song

M.

(

2020

).

Projections onto the intersection of a one-norm ball or sphere and a two-norm ball or sphere

.

J. Optimiz. Theory Appl

.

187

,

520

–

34

.

Lopes

M. E.

,

Lin

Z.

&

Müller

H.-G

. (

2020

).

Bootstrapping max statistics in high dimensions: near-parametric rates under weak variance decay and application to functional and multinomial data

.

Ann. Statist.

48

,

1214

–

29

.

Lopes

M. E.

&

Yao

J

. (

2022

).

A sharp lower-tail bound for Gaussian maxima with application to bootstrap methods in high dimensions

.

Electron. J. Statist.

16

,

58

–

83

.

Louis

D. N.

,

Ohgaki

H.

,

Wiestler

O. D.

,

Cavenee

W. K.

,

Burger

P. C.

,

Jouvet

A.

,

Scheithauer

B. W.

&

Kleihues

P.

(

2007

).

The 2007 WHO classification of tumours of the central nervous system

.

Acta Neuropathol

.

114

,

97

–

109

.

Mazor

T.

,

Pankov

A.

,

Johnson

B. E.

,

Hong

C.

,

Hamilton

E. G.

,

Bell

R. J.

,

Smirnov

I. V.

,

Reis

G. F.

,

Phillips

J. J.

,

Barnes

M. J.

et al. (

2015

).

DNA methylation and somatic mutations converge on the cell cycle and define similar evolutionary histories in brain tumors

.

Cancer Cell

28

,

307

–

17

.

Moore

L. D.

,

Le

T.

&

Fan

G.

(

2013

).

DNA methylation and its basic function

.

Neuropsychopharmacology

38

,

23

–

38

.

Negahban

S. N.

,

Ravikumar

P.

,

Wainwright

M. J.

&

Yu

B

. (

2012

).

A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers

.

Statist. Sci.

27

,

538

–

57

.

Nietert

S.

,

Goldfeld

Z.

&

Kato

K.

(

2021

). Smooth p-Wasserstein distance: structure, empirical approximation, and statistical applications. In Proc. 38th Int. Conf. Mach. Learn., vol.

139

, pp.

8172

–

83

. PMLR.

Nietert

S.

,

Goldfeld

Z.

,

Sadhu

R.

&

Kato

K.

(

2022

). Statistical, robustness, and computational guarantees for sliced Wasserstein distances. In Proc. 36th Int. Conf. Neural Info. Proces. Syst., pp.

28179

–

93

. Red Hook, NY: Curran Associates.

Noushmehr

H.

,

Weisenberger

D. J.

,

Diefes

K.

,

Phillips

H. S.

,

Pujara

K.

,

Berman

B. P.

,

Pan

F.

,

Pelloski

C. E.

,

Sulman

E. P.

,

Bhat

K. P.

et al. (

2010

).

Identification of a CpG island methylator phenotype that defines a distinct subgroup of glioma

.

Cancer Cell

17

,

510

–

22

.

Pan

W.

,

Tian

Y.

,

Wang

X.

&

Zhang

H.

(

2018

).

Ball divergence: nonparametric two sample test

.

Ann. Statist

.

46

,

1109

–

37

.

R Development Core Team

(

2025

). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0, http://www.R-project.org.

Rabin

J.

,

Peyré

G.

,

Delon

J.

&

Bernot

M.

(

2012

). Wasserstein barycenter and its application to texture mixing. In

Scale Space and Variational Methods in Computer Vision (Lecture Notes Comp. Sci. 6667)

, Ed.

Bruckstein

A. M.

,

ter Haar Romeny

B. M.

,

Bronstein

A. M.

and

Bronstein

M. M.

, pp.

435

–

46

.

Berlin

:

Springer

.

Ramdas

A.

,

García Trillos

N.

&

Cuturi

M

. (

2017

).

On Wasserstein two-sample testing and related families of nonparametric tests

.

Entropy

19

,

47

.

Ramdas

A.

,

Reddi

S. J.

,

Póczos

B.

,

Singh

A.

&

Wasserman

L.

(

2015

). On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In Proc. 29th AAAI Conf. Artif. Intel., pp.

3571

–

7

. Washington, DC: AAAI Press.

Rosenbaum

P. R.

(

2005

).

An exact distribution-free test comparing two multivariate distributions based on adjacency

.

J. R. Statist. Soc. B

67

,

515

–

30

.

Sadhu

R.

,

Goldfeld

Z.

&

Kato

K

. (

2022

). Limit distribution theory for the smooth 1-Wasserstein distance with applications. arXiv: 2107.13494v5.

Schilling

M. F.

(

1986

).

Multivariate two-sample tests based on nearest neighbors

.

J. Am. Statist. Assoc

.

81

,

799

–

806

.

Sejdinovic

D.

,

Sriperumbudur

B.

,

Gretton

A.

&

Fukumizu

K

. (

2013

).

Equivalence of distance-based and RKHS-based statistics in hypothesis testing

.

Ann. Statist.

41

,

2263

–

91

.

Smirnov

N.

(

1948

).

Table for estimating the goodness of fit of empirical distributions

.

Ann. Math. Statist.

19

,

279

–

81

.

Székely

G. J.

&

Rizzo

M. L

. (

2004

).

Testing for equal distributions in high dimension

.

InterStat

5

,

1249

–

72

.

Tomczak

K.

,

Czerwińska

P.

&

Wiznerowicz

M

. (

2015

).

Review: The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge

.

Contemp. Oncol. (Pozn)

2015

,

68

–

77

.

Uhlen

M.

,

Zhang

C.

,

Lee

S.

,

Sjöstedt

E.

,

Fagerberg

L.

,

Bidkhori

G.

,

Benfeitas

R.

,

Arif

M.

,

Liu

Z.

,

Edfors

F.

et al. (

2017

).

A pathology atlas of the human cancer transcriptome

.

Science

357

,

eaan2507

.

van der Vaart

A. W.

&

Wellner

J. A.

(

1996

).

Weak Convergence and Empirical Processes: With Applications to Statistics

.

New York

:

Springer

.

Villani

C.

(

2009

).

Optimal Transport: Old and New

, vol.

338

.

Berlin

:

Springer

.

Wainwright

M. J.

(

2019

).

High-Dimensional Statistics: A Non-Asymptotic Viewpoint

.

Cambridge

:

Cambridge University Press

.

Wald

A.

&

Wolfowitz

J.

(

1940

).

On a test whether two samples are from the same population

.

Ann. Math. Statist.

11

,

147

–

62

.

Wang

D.

,

Berglund

A.

,

Kenchappa

R. S.

,

Forsyth

P. A.

,

Mulé

J. J.

&

Etame

A. B.

(

2016

).

BIRC3 is a novel driver of therapeutic resistance in glioblastoma

.

Sci. Rep.

6

,

1

–

13

.

PubMed

Wang

J.

,

Gao

R.

&

Xie

Y.

(

2021

). Two-sample test using projected Wasserstein distance. In

2021 IEEE Int. Symp. Inf. Theory (ISIT)

, pp.

3320

–

5

.

Piscataway, NJ

:

IEEE Press

.

Google Preview

Xi

J.

&

Niles-Weed

J

. (

2022

). Distributional convergence of the sliced Wasserstein process. arXiv: 2206.00156v1.

Yan

J.

&

Zhang

X.

(

2023

).

Kernel two-sample tests in high dimensions: interplay between moment discrepancy and dimension-and-sample orders

.

Biometrika

110

,

411

–

30

.

Zhu

C.

&

Shao

X

. (

2021

).

Interpoint distance based two sample tests in high dimension

.

Bernoulli

27

,

1189

–

211

.