Megavariate methods capture complex genotype-by-environment interactions

Average runtime in minutes (SE) for the balanced experimental design based on 10 simulated replicates.

Model	Solver	10/500	10/2,000	50/2,000	200/2,000	2,000/2,000	200/20,000
GREML	REML	46.75 (0.37)	172.61 (17.93)	—	—	—	—
D-GREML	REML	0.06 (<0.1)	0.19 (<0.1)	8.32 (3.51)	—	—	—
MegaLMM	MCMC	0.31 (0.01)	4.38 (0.06)	7.23 (1.19)	17.71 (4.02)	130.77 (11.51)	—
MegaSEM	PEGS	<0.01 (<0.01)	0.01 (<0.01)	0.04 (<0.01)	0.14 (<0.01)	2.92 (0.02)	5.26 (0.07)
MV	PEGS	<0.01 (<0.01)	<0.1 (<0.01)	0.02 (<0.01)	9.12 (1.62)	97.14 (1.29)	82.22 (5.71)
XFA	PEGS	<0.01 (<0.01)	<0.1 (<0.01)	0.03 (<0.01)	0.49 (0.09)	—	81.46 (1.38)
HCS	PEGS	<0.01 (<0.01)	<0.01 (<0.01)	0.02 (<0.01)	0.22 (0.04)	38.74 (3.60)	37.74 (4.45)
SCT	PEGS	<0.01 (<0.01)	0.01 (<0.01)	0.04 (<0.01)	0.15 (0.01)	1.65 (0.01)	5.25 (0.05)
UV	PEGS	<0.01 (<0.01)	0.01 (<0.01)	0.04 (<0.01)	0.14 (<0.01)	1.44 (0.01)	5.20 (0.06)

Model	Solver	10/500	10/2,000	50/2,000	200/2,000	2,000/2,000	200/20,000
GREML	REML	46.75 (0.37)	172.61 (17.93)	—	—	—	—
D-GREML	REML	0.06 (<0.1)	0.19 (<0.1)	8.32 (3.51)	—	—	—
MegaLMM	MCMC	0.31 (0.01)	4.38 (0.06)	7.23 (1.19)	17.71 (4.02)	130.77 (11.51)	—
MegaSEM	PEGS	<0.01 (<0.01)	0.01 (<0.01)	0.04 (<0.01)	0.14 (<0.01)	2.92 (0.02)	5.26 (0.07)
MV	PEGS	<0.01 (<0.01)	<0.1 (<0.01)	0.02 (<0.01)	9.12 (1.62)	97.14 (1.29)	82.22 (5.71)
XFA	PEGS	<0.01 (<0.01)	<0.1 (<0.01)	0.03 (<0.01)	0.49 (0.09)	—	81.46 (1.38)
HCS	PEGS	<0.01 (<0.01)	<0.01 (<0.01)	0.02 (<0.01)	0.22 (0.04)	38.74 (3.60)	37.74 (4.45)
SCT	PEGS	<0.01 (<0.01)	0.01 (<0.01)	0.04 (<0.01)	0.15 (0.01)	1.65 (0.01)	5.25 (0.05)
UV	PEGS	<0.01 (<0.01)	0.01 (<0.01)	0.04 (<0.01)	0.14 (<0.01)	1.44 (0.01)	5.20 (0.06)

Six scenarios vary in terms of the number of environments and individuals (no. environments/no. individuals). Models are ordered based on computational performance. The SE is shown in parenthesis.

Table 1.

Average runtime in minutes (SE) for the balanced experimental design based on 10 simulated replicates.

Model	Solver	10/500	10/2,000	50/2,000	200/2,000	2,000/2,000	200/20,000
GREML	REML	46.75 (0.37)	172.61 (17.93)	—	—	—	—
D-GREML	REML	0.06 (<0.1)	0.19 (<0.1)	8.32 (3.51)	—	—	—
MegaLMM	MCMC	0.31 (0.01)	4.38 (0.06)	7.23 (1.19)	17.71 (4.02)	130.77 (11.51)	—
MegaSEM	PEGS	<0.01 (<0.01)	0.01 (<0.01)	0.04 (<0.01)	0.14 (<0.01)	2.92 (0.02)	5.26 (0.07)
MV	PEGS	<0.01 (<0.01)	<0.1 (<0.01)	0.02 (<0.01)	9.12 (1.62)	97.14 (1.29)	82.22 (5.71)
XFA	PEGS	<0.01 (<0.01)	<0.1 (<0.01)	0.03 (<0.01)	0.49 (0.09)	—	81.46 (1.38)
HCS	PEGS	<0.01 (<0.01)	<0.01 (<0.01)	0.02 (<0.01)	0.22 (0.04)	38.74 (3.60)	37.74 (4.45)
SCT	PEGS	<0.01 (<0.01)	0.01 (<0.01)	0.04 (<0.01)	0.15 (0.01)	1.65 (0.01)	5.25 (0.05)
UV	PEGS	<0.01 (<0.01)	0.01 (<0.01)	0.04 (<0.01)	0.14 (<0.01)	1.44 (0.01)	5.20 (0.06)

Model	Solver	10/500	10/2,000	50/2,000	200/2,000	2,000/2,000	200/20,000
GREML	REML	46.75 (0.37)	172.61 (17.93)	—	—	—	—
D-GREML	REML	0.06 (<0.1)	0.19 (<0.1)	8.32 (3.51)	—	—	—
MegaLMM	MCMC	0.31 (0.01)	4.38 (0.06)	7.23 (1.19)	17.71 (4.02)	130.77 (11.51)	—
MegaSEM	PEGS	<0.01 (<0.01)	0.01 (<0.01)	0.04 (<0.01)	0.14 (<0.01)	2.92 (0.02)	5.26 (0.07)
MV	PEGS	<0.01 (<0.01)	<0.1 (<0.01)	0.02 (<0.01)	9.12 (1.62)	97.14 (1.29)	82.22 (5.71)
XFA	PEGS	<0.01 (<0.01)	<0.1 (<0.01)	0.03 (<0.01)	0.49 (0.09)	—	81.46 (1.38)
HCS	PEGS	<0.01 (<0.01)	<0.01 (<0.01)	0.02 (<0.01)	0.22 (0.04)	38.74 (3.60)	37.74 (4.45)
SCT	PEGS	<0.01 (<0.01)	0.01 (<0.01)	0.04 (<0.01)	0.15 (0.01)	1.65 (0.01)	5.25 (0.05)
UV	PEGS	<0.01 (<0.01)	0.01 (<0.01)	0.04 (<0.01)	0.14 (<0.01)	1.44 (0.01)	5.20 (0.06)

Six scenarios vary in terms of the number of environments and individuals (no. environments/no. individuals). Models are ordered based on computational performance. The SE is shown in parenthesis.

Table 2.

Within environment accuracy for the balanced experimental design based on 10 simulated replicates.

Model	Solver	10/500	10/2,000	50/2,000	200/2,000	2,000/2,000	200/20,000
GREML	REML	0.81 (0.03)	0.89 (<0.01)	—	—	—	—
MegaLMM	MCMC	0.78 (0.04)	0.87 (<0.01)	0.87 (<0.01)	0.89 (<0.01)	0.90 (<0.01)	—
MegaSEM	PEGS	0.79 (0.04)	0.88 (<0.01)	0.89 (<0.01)	0.89 (<0.01)	0.89 (<0.01)	0.96 (<0.01)
MV	PEGS	0.81 (0.03)	0.89 (<0.01)	0.89 (<0.01)	0.90 (<0.01)	0.88 (<0.01)	0.96 (<0.01)
XFA	PEGS	0.80 (0.04)	0.89 (<0.01)	0.89 (<0.01)	0.89 (<0.01)	—	0.96 (<0.01)
HCS	PEGS	0.81 (0.03)	0.88 (<0.01)	0.88 (<0.01)	0.88 (<0.01)	0.88 (<0.01)	0.96 (<0.01)
SCT	PEGS	0.81 (0.03)	0.89 (<0.01)	0.88 (<0.01)	0.87 (<0.01)	0.87 (<0.01)	0.95 (<0.01)
UV	PEGS	0.78 (0.04)	0.87 (<0.01)	0.87 (<0.01)	0.87 (<0.01)	0.87 (<0.01)	0.95 (<0.01)

Model	Solver	10/500	10/2,000	50/2,000	200/2,000	2,000/2,000	200/20,000
GREML	REML	0.81 (0.03)	0.89 (<0.01)	—	—	—	—
MegaLMM	MCMC	0.78 (0.04)	0.87 (<0.01)	0.87 (<0.01)	0.89 (<0.01)	0.90 (<0.01)	—
MegaSEM	PEGS	0.79 (0.04)	0.88 (<0.01)	0.89 (<0.01)	0.89 (<0.01)	0.89 (<0.01)	0.96 (<0.01)
MV	PEGS	0.81 (0.03)	0.89 (<0.01)	0.89 (<0.01)	0.90 (<0.01)	0.88 (<0.01)	0.96 (<0.01)
XFA	PEGS	0.80 (0.04)	0.89 (<0.01)	0.89 (<0.01)	0.89 (<0.01)	—	0.96 (<0.01)
HCS	PEGS	0.81 (0.03)	0.88 (<0.01)	0.88 (<0.01)	0.88 (<0.01)	0.88 (<0.01)	0.96 (<0.01)
SCT	PEGS	0.81 (0.03)	0.89 (<0.01)	0.88 (<0.01)	0.87 (<0.01)	0.87 (<0.01)	0.95 (<0.01)
UV	PEGS	0.78 (0.04)	0.87 (<0.01)	0.87 (<0.01)	0.87 (<0.01)	0.87 (<0.01)	0.95 (<0.01)

Six scenarios vary in terms of the number of environments and individuals (no. environments/no. individuals). Models are ordered based on computational performance. SE is shown in parenthesis.

Table 2.

Within environment accuracy for the balanced experimental design based on 10 simulated replicates.

Model	Solver	10/500	10/2,000	50/2,000	200/2,000	2,000/2,000	200/20,000
GREML	REML	0.81 (0.03)	0.89 (<0.01)	—	—	—	—
MegaLMM	MCMC	0.78 (0.04)	0.87 (<0.01)	0.87 (<0.01)	0.89 (<0.01)	0.90 (<0.01)	—
MegaSEM	PEGS	0.79 (0.04)	0.88 (<0.01)	0.89 (<0.01)	0.89 (<0.01)	0.89 (<0.01)	0.96 (<0.01)
MV	PEGS	0.81 (0.03)	0.89 (<0.01)	0.89 (<0.01)	0.90 (<0.01)	0.88 (<0.01)	0.96 (<0.01)
XFA	PEGS	0.80 (0.04)	0.89 (<0.01)	0.89 (<0.01)	0.89 (<0.01)	—	0.96 (<0.01)
HCS	PEGS	0.81 (0.03)	0.88 (<0.01)	0.88 (<0.01)	0.88 (<0.01)	0.88 (<0.01)	0.96 (<0.01)
SCT	PEGS	0.81 (0.03)	0.89 (<0.01)	0.88 (<0.01)	0.87 (<0.01)	0.87 (<0.01)	0.95 (<0.01)
UV	PEGS	0.78 (0.04)	0.87 (<0.01)	0.87 (<0.01)	0.87 (<0.01)	0.87 (<0.01)	0.95 (<0.01)

Model	Solver	10/500	10/2,000	50/2,000	200/2,000	2,000/2,000	200/20,000
GREML	REML	0.81 (0.03)	0.89 (<0.01)	—	—	—	—
MegaLMM	MCMC	0.78 (0.04)	0.87 (<0.01)	0.87 (<0.01)	0.89 (<0.01)	0.90 (<0.01)	—
MegaSEM	PEGS	0.79 (0.04)	0.88 (<0.01)	0.89 (<0.01)	0.89 (<0.01)	0.89 (<0.01)	0.96 (<0.01)
MV	PEGS	0.81 (0.03)	0.89 (<0.01)	0.89 (<0.01)	0.90 (<0.01)	0.88 (<0.01)	0.96 (<0.01)
XFA	PEGS	0.80 (0.04)	0.89 (<0.01)	0.89 (<0.01)	0.89 (<0.01)	—	0.96 (<0.01)
HCS	PEGS	0.81 (0.03)	0.88 (<0.01)	0.88 (<0.01)	0.88 (<0.01)	0.88 (<0.01)	0.96 (<0.01)
SCT	PEGS	0.81 (0.03)	0.89 (<0.01)	0.88 (<0.01)	0.87 (<0.01)	0.87 (<0.01)	0.95 (<0.01)
UV	PEGS	0.78 (0.04)	0.87 (<0.01)	0.87 (<0.01)	0.87 (<0.01)	0.87 (<0.01)	0.95 (<0.01)

Six scenarios vary in terms of the number of environments and individuals (no. environments/no. individuals). Models are ordered based on computational performance. SE is shown in parenthesis.

The accuracy of UV was insensitive to the number of environments, as it does not capture any GxE information. All methods that capture GxE information were as predictive or better than UV, although the difference in the accuracy of GEBVs declined as the number of individuals increased or GxE correlation decreased. In scenarios with 10 environments, only SCT and MV provided the same accuracy as GBLUP but the accuracy of SCT decreased as the number of environments increased. MV was the most accurate model in all scenarios under 200 environments but its accuracy dropped in the scenario with 2,000 environments, due to the number of parameters estimated in $Σ_{β}$ and a need for bending this matrix to obtain its inverse. In the scenario with 2,000 environments, the highest accuracy was obtained by MegaLMM followed by MegaSEM. MegaSEM provided either the highest or second-highest accuracy in all scenarios except the one with the lowest dimensionality.

When taking into account both runtime and accuracy, our results indicate that the best method depends on the dimensionality of the data. MegaSEM suits scenarios with a large number of individuals and traits, providing high accuracy and low runtime. SCT and diagonalized GBLUP should be considered when data are balanced and the number of environments is modest. MegaLMM suits datasets with thousands or more traits but with a moderate number of individuals. HCS, MV, and XFA are suitable for datasets with up to 200 environments.

Real data benchmark

Results are displayed in Table 3. The predictive ability of overall averages was always greater than regional averages, indicating low genotype-by-region interactions. For overall averages, the highest predictive abilities were obtained by UVA, XFA, and HCS, whereas MegaLMM and MegaSEM had the same predictive ability as UVW.

Table 3.

Predictive ability from the 2022 G2F GxE prediction competition.

Model	Pairwise	Region	Overall
UVW	0.08 (0.03)	0.22 (0.14)	0.27 (0.11)
MV	0.12 (0.05)	0.27 (0.12)	0.30 (0.11)
MegaSEM	0.13 (0.05)	0.25 (0.15)	0.27 (0.11)
MegaLMM	0.18 (0.06)	0.24 (0.19)	0.27 (0.10)
XFA	0.21 (0.07)	0.31 (0.13)	0.35 (0.12)
HCS	0.24 (0.09)	0.34 (0.11)	0.36 (0.11)
UVA	—	—	0.35 (0.12)

Model	Pairwise	Region	Overall
UVW	0.08 (0.03)	0.22 (0.14)	0.27 (0.11)
MV	0.12 (0.05)	0.27 (0.12)	0.30 (0.11)
MegaSEM	0.13 (0.05)	0.25 (0.15)	0.27 (0.11)
MegaLMM	0.18 (0.06)	0.24 (0.19)	0.27 (0.10)
XFA	0.21 (0.07)	0.31 (0.13)	0.35 (0.12)
HCS	0.24 (0.09)	0.34 (0.11)	0.36 (0.11)
UVA	—	—	0.35 (0.12)

Corn grain yield was observed in 4,836 hybrids across 217 locations (2014–2021) predicting 548 hybrids observed across 21 environments (2022). Models are ordered based on the pairwise metric. The SE is shown in parentheses.

Table 3.

Predictive ability from the 2022 G2F GxE prediction competition.

Model	Pairwise	Region	Overall
UVW	0.08 (0.03)	0.22 (0.14)	0.27 (0.11)
MV	0.12 (0.05)	0.27 (0.12)	0.30 (0.11)
MegaSEM	0.13 (0.05)	0.25 (0.15)	0.27 (0.11)
MegaLMM	0.18 (0.06)	0.24 (0.19)	0.27 (0.10)
XFA	0.21 (0.07)	0.31 (0.13)	0.35 (0.12)
HCS	0.24 (0.09)	0.34 (0.11)	0.36 (0.11)
UVA	—	—	0.35 (0.12)

Model	Pairwise	Region	Overall
UVW	0.08 (0.03)	0.22 (0.14)	0.27 (0.11)
MV	0.12 (0.05)	0.27 (0.12)	0.30 (0.11)
MegaSEM	0.13 (0.05)	0.25 (0.15)	0.27 (0.11)
MegaLMM	0.18 (0.06)	0.24 (0.19)	0.27 (0.10)
XFA	0.21 (0.07)	0.31 (0.13)	0.35 (0.12)
HCS	0.24 (0.09)	0.34 (0.11)	0.36 (0.11)
UVA	—	—	0.35 (0.12)

Results from real data aligned with sparse testing simulations with moderate GxE (Fig. 1). This was supported by a GxE correlation of 0.4 estimated by HCS. Because UVA was among the most predictive models, we fitted UVW on the residuals of UVA to investigate how much GxE was left from UVA. This led to a slight improvement in the predictive ability of overall averages, from 0.35 (0.12) to 0.36 (0.12), which indicates low GxE correlations after fitting the main genetic term.

The precision of estimated GxE correlations and thereby accuracies of predictions of models with more complex covariance structures are lower for the following reasons. Firstly, a large number of environments had a small number of observations, whereas 3 environments had as few as 22 individuals. Secondly, the number of individuals that overlap across environments was limited with a median overlap between pairs of environments of 19 individuals. Thirdly, the relatedness between individuals within and across environments may be not high enough to provide more accurate genetic parameter estimates. Thus, collectively, if there was a stronger variation in GxE correlations between pairs of environments, these reasons may not allow the more complex models to reliably detect it. This may be different in commercial breeding data where the relatedness between individuals and the number of individuals across environments is expected to be higher. Despite those challenges, even the models with complex covariance structures converged. In conclusion, these results help select the right model for the data structure in a given dataset.

In this study, the G2F dataset was solely evaluated based on its predictive ability. However, models that capture complex GxE interactions, such as MV, XFA, MegaSEM, and MegaLMM, have additional benefits. These models allow for environment-specific predictions, which can be used to create selection indices aimed at improving performance under specific conditions or for broad adaptation. Beyond prediction and selection, MV models provide pairwise estimates of GxE correlations and genomic heritabilities for each environment. These correlations can reveal patterns and clusters of environments, while heritability estimates inform the location quality. The insights from GxE correlations and genomic heritability estimates are valuable for planning new trials, redesigning experiments, reallocating resources, and optimizing trial networks.

Scalability of different parameterizations

A general summary of the scalability of the different parameterizations is provided in Table 4. The table shows that no method is completely scalable in all scenarios. For example, as the number of markers increases, SVD (⁠ $Q α$ ⁠) and genomic relationship-based methods (⁠ $Z Z^{'}$ ⁠) are preferred over SNP-based regressions (⁠ $Z β$ and $Z^{'} Z$ ⁠). That is the case when genotype-by-sequencing (GBS) data are deployed. In contrast, datasets with more genotypes than markers are common when SNP arrays are utilized in experiments across multiple breeding programs and large populations (Allen et al. 2017; Song et al. 2017), as SNP regressions can provide more efficient computation of the genomic models. When the dataset contains a large number of genotypes and markers, dimensionality reduction (e.g. $Q_{n | \tilde{n}} α$ ⁠) provides computational feasibility without loss in accuracy, as long as there are enough principal components to capture the genetic diversity (Pocrnic et al. 2019).

Table 4.

Scalability rating by parameterization and compatible solver.

	Parameterization	Solver	No. of genotypes	No. of markers	No. of traits
1	$Z^{'} Z$	REML/BGS	****	*	*
2	$Z Z^{'}$ ⁠, $K$ [equation (20)]	REML/BGS	**	****	*
3	$Z β$ (BayesABC)	BGS	**	**	*
4	$U θ$ [equation (30)]	REML/BGS	**	****	**
5	$Y Ψ$ [equation (13)]	BGS	**	**	**
6	$Q α$ [equations (22) and (28)]	PEGS	**	***	***
7	$Q_{\tilde{n} \| n} α$ [equation (27)]	PEGS	****	***	***
8	$Z β$ [equations (4) and (31)]	PEGS	**	**	***
9	$F λ$ [equation (10)]	BGS	*	****	****
10	$F_{0} α$ [equation (19)]	PEGS	***	***	****

	Parameterization	Solver	No. of genotypes	No. of markers	No. of traits
1	$Z^{'} Z$	REML/BGS	****	*	*
2	$Z Z^{'}$ ⁠, $K$ [equation (20)]	REML/BGS	**	****	*
3	$Z β$ (BayesABC)	BGS	**	**	*
4	$U θ$ [equation (30)]	REML/BGS	**	****	**
5	$Y Ψ$ [equation (13)]	BGS	**	**	**
6	$Q α$ [equations (22) and (28)]	PEGS	**	***	***
7	$Q_{\tilde{n} \| n} α$ [equation (27)]	PEGS	****	***	***
8	$Z β$ [equations (4) and (31)]	PEGS	**	**	***
9	$F λ$ [equation (10)]	BGS	*	****	****
10	$F_{0} α$ [equation (19)]	PEGS	***	***	****

Table 4.

10.1111/pbi.2017.15.issue-3

Scalability rating by parameterization and compatible solver.

	Parameterization	Solver	No. of genotypes	No. of markers	No. of traits
1	$Z^{'} Z$	REML/BGS	****	*	*
2	$Z Z^{'}$ ⁠, $K$ [equation (20)]	REML/BGS	**	****	*
3	$Z β$ (BayesABC)	BGS	**	**	*
4	$U θ$ [equation (30)]	REML/BGS	**	****	**
5	$Y Ψ$ [equation (13)]	BGS	**	**	**
6	$Q α$ [equations (22) and (28)]	PEGS	**	***	***
7	$Q_{\tilde{n} \| n} α$ [equation (27)]	PEGS	****	***	***
8	$Z β$ [equations (4) and (31)]	PEGS	**	**	***
9	$F λ$ [equation (10)]	BGS	*	****	****
10	$F_{0} α$ [equation (19)]	PEGS	***	***	****

	Parameterization	Solver	No. of genotypes	No. of markers	No. of traits
1	$Z^{'} Z$	REML/BGS	****	*	*
2	$Z Z^{'}$ ⁠, $K$ [equation (20)]	REML/BGS	**	****	*
3	$Z β$ (BayesABC)	BGS	**	**	*
4	$U θ$ [equation (30)]	REML/BGS	**	****	**
5	$Y Ψ$ [equation (13)]	BGS	**	**	**
6	$Q α$ [equations (22) and (28)]	PEGS	**	***	***
7	$Q_{\tilde{n} \| n} α$ [equation (27)]	PEGS	****	***	***
8	$Z β$ [equations (4) and (31)]	PEGS	**	**	***
9	$F λ$ [equation (10)]	BGS	*	****	****
10	$F_{0} α$ [equation (19)]	PEGS	***	***	****

Technical guidelines for modeling large datasets with multiple traits have been provided by Misztal (2008). In that study, the author recommends starting by running UV analysis and subsequently progressing to MV models. At the time, the modeling of hundreds of traits has not been considered possible because the computational cost of REML methods increases $n^{2}$ and $k^{3}$ (Misztal 2008) with the number of genotypes (n) and traits (k), while more efficient methods, such as CT, require balanced data.

BGS is an alternative to REML (Sorensen et al. 2002), as it provides a computationally stable method to estimate variance components and regression coefficients at low memory cost. However, BGS may take a long time to run as it requires a large number of MCMC samples to provide satisfying convergence. For some Bayesian methods (Jia and Jannink 2012), variational Bayesian approaches have been proposed to avoid MCMC sampling (Hayashi and Iwata 2013).

Efficient alternatives to MCMC are available when the variance components are known a priori and only coefficients need to be inferred, those include Preconditioned Conjugate Gradient and GS (Legarra and Misztal 2008; Misztal and Legarra 2017). Only a few software implement GS as the main approach to estimate marker effects (Legarra et al. 2011). Here, we assessed the PEGS solver from Xavier and Habier (2022) as an approach to use GS while estimating variance components. While computationally efficient, the PEGS solver does not compute accuracies or confidence intervals because the inverse of the left-hand side of mixed-model equations or samples of effects from MCMC algorithms is not available.

Prediction of unobserved environments

In Table 3, a new environment was predicted using averaged predictions from previously observed environments. However, if the covariance between the training and prediction set were known, the prediction of new environments could be predicted as a linear combination of observed environments. Such covariances may be inferred from data when GxE interactions can be explained by a set of variables that are available for both training and validation datasets, such as management and environmental variables.

A schematic evolution of methods integrating genomics and environmental information is provided by Crossa et al. (2022), including crop-growth and reaction norm models. The covariance is inferred by environmental variables and, thus confined to their sample space. Alternatively, the associations between environmental variables and marker effects can be inferred in subsequent analyses. For instance, Della Coletta et al. (2023) GxE interaction networks are generated from the correlation of principal components of marker effects and principal components of environmental variables.

Post hoc modeling of the factors responsible for GxE interactions can be built from the output of unstructured models. Unlike crop-growth and reaction norm models, it does not assume that all interactions can be explained by the environmental variables available for modeling. The approach described here works by modeling covariances and, subsequently, generates predictions using conditional expectations.

Consider a scenario where a set of individuals A, observed in a set of environments X, is used to predict a new set of individuals B observed in a set of environments Z. The estimated marker effects from prediction models are based on the observed data AX. The prediction of B individuals in observed environments is given by

{\hat{G}}_{B X} = Z_{B} {\hat{B}}_{A X},

(44)

where ${\hat{G}}_{B X}$ is the matrix of GEBV of B individuals on X environments, $Z_{B}$ is the marker information for B individuals, and ${\hat{B}}_{A X}$ is the matrix of marker effects for X environments. The next step consists of projecting B individuals into Z environments. That is attained with the conditional expectation, where Z environments are predicted from a linear combination of X environments. Thus,

{\hat{G}}_{B Z | B X} = Σ_{Z X} {\hat{Σ}}_{X}^{- 1} {\hat{G}}_{B X},

(45)

where $Σ_{X}$ is the genetic variance–covariance matrix of X environments, and $Σ_{Z X}$ is the covariance matrix between X and Z environments. Note that $Σ_{X}$ is estimated from the MV model [equation (38)] that estimated the marker effects, since $B_{A X} \sim N (0, Σ_{X} \otimes I)$ ⁠. Estimating $σ_{Z X}$ requires prediction if Z has not been observed.

The prediction of $Σ_{Z X}$ can be inferred from parameters that drive GxE interactions. Let

\begin{aligned} Σ_{X} & = U_{X} D_{X}^{2} U_{X}^{'} \\ = Q_{X} Q_{X}^{'} \end{aligned}

(46)

where $Q_{X} = U_{X} D_{X}$ ⁠. Now, assuming that the principal components $Q_{X}$ can be modeled as a linear function of parameters that drive GxE interactions (⁠ $W_{X}$ ⁠), we obtain

Q_{X} = W_{X} Ω_{X} + E_{X}

(47)

where $W_{X}$ is the design matrix of explanatory variables, $Ω_{X}$ is the matrix of regression coefficients, $E_{X}$ is the matrix of residuals. Note that equation (47) acts as a post hoc modeling of environmental reaction norms, such that if the same set of variables is known for environments Z, principal components can be predicted using $W_{Z}$ ⁠. Thus,

{\hat{Q}}_{Z X} = W_{Z} {\hat{Ω}}_{X}

(48)

and the covariance between environments X and Z can be inferred by

{\hat{Σ}}_{Z X} = {\hat{Q}}_{Z X} Q_{X}^{'} .

(49)

Note that equation (47) utilizes a linear model to fit the eigenstructure of the GxE covariance; however, the evaluation of nonparametric models (e.g. random forest) is encouraged when the interaction patterns are complex beyond additivity (Alves et al. 2020; Waters et al. 2023; Resende et al. 2024).

Conclusion

Scalable MV approaches increase the accuracy of GEBVs within environments compared to a UV approach. Specialized models, parameterizations, and solvers enable an increasing number of individuals, markers, and environments.

In the sparse testing simulation, XFA and MegaLMM were the most accurate methods across scenarios, and HCS when GxE was constant. In the balanced simulation where runtime and accuracy were recorded, the MV was the most accurate model up to 200 environments, then surpassed by MegaLMM in the scenario with 2,000 environments. PEGS-based models were considerably more efficient than REML-based GBLUP, where MegaSEM, SCT, and UV provided the lowest runtime. In the real data analysis, predictions from UVA and overall prediction averages from HCS and XFA were the most predictive approaches. The capability of MV and MegaSEM to capture unstructured GxE patterns did not translate in higher accuracy than models with simpler covariance structures. Future studies should consider fitting multiple trait–environment combinations.

Data availability

Soybean genomic data utilized to simulate sparse testing is available in the R package mas, also available in the R package SoyNAM and project website: https://www.soybase.org/SoyNAM/. Corn Genomes-to-Field dataset utilized for real data benchmark is available on the project website: https://www.maizegxeprediction.org/. Data and R scripts to reproduce simulations are available on GitHub: https://github.com/alenxav/GXE24.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Literature cited

Allen

Winfield

Burridge

Downie

Benbow

Barker

Wilkinson

Coghill

Waterfall

Davassi

, et al.

2017

Characterization of a Wheat Breeders’ Array suitable for high-throughput SNP genotyping of global accessions of hexaploid bread wheat (Triticum aestivum)

Plant Biotechnol J

(

390

–

401

Alves

de Resende

MDV

Azevedo

Silva

FFe

Rocha

JRASC

Nunes

ACP

Carneiro

APS

dos Santos

2020

Optimization of Eucalyptus breeding through random regression models allowing for reaction norms in response to environmental gradients

Tree Genet Genomes

(

–

10.1007/s11295-020-01431-5

10.1186/s12711-022-00741-7

Bermann

Lourenco

Forneris

Legarra

Misztal

2022

On the equivalence between marker effect models and breeding value models and direct genomic values with the algorithm for proven and young

Genet Sel Evol

(

Bustos-Korts

Malosetti

Chapman

van Eeuwijk

2016

. Modelling of genotype by environment interaction and prediction of complex traits across multiple environments as a synthesis of crop growth modelling, genetics and statistics, In:

Crop Systems Biology: Narrowing the Gaps between Crop Modelling and Genetics

Springer

. p.

–

Crossa

Fritsche-Neto

Montesinos-Lopez

Costa-Neto

Dreisigacker

Montesinos-Lopez

Bentley

2021

The modern plant breeding triangle: optimizing the use of genomics, phenomics, and enviromics data

Front Plant Sci

651480

10.3389/fpls.2021.651480

Crossa

Montesinos-Lopez

Pérez-Rodríguez

Costa-Neto

Fritsche-Neto

Ortiz

Martini

Lillemo

Montesinos-Lopez

Jarquin

, et al.

2022

. Genome and environment based prediction models and methods of complex traits incorporating genotype × environment interaction. In:

Genomic Prediction of Complex Traits: Methods and Protocols

Springer

. p.

245

–

283

Cuevas

Crossa

Soberanis

Pérez-Elizalde

Pérez-Rodríguez

Campos

Montesinos-López

Burgueño

2016

Genomic prediction of genotype × environment interaction kernel regression models

Plant Genome

(

10.3835/plantgenome2016.03.0024

10.1017/S0016672310000285

Della Coletta

Liese

Fernandes

Mikel

Bohn

Lipka

Hirsch

2023

Linking genetic and environmental factors through marker effect networks to understand trait plasticity

Genetics

224

(

iyad103

10.1093/genetics/iyad103

de Los Campos

Gianola

Rosa

Weigel

Crossa

2010

Semi-parametric genomic-enabled prediction of genetic values using reproducing kernel Hilbert spaces methods

Genet Res (Camb)

(

295

–

308

de Los Campos

Hickey

Pong-Wong

Daetwyler

Calus

2013

Whole-genome regression and prediction methods applied to plant and animal breeding

Genetics

193

(

327

–

345

10.1534/genetics.112.143313

Diers

Specht

Rainey

Cregan

Song

Ramasubramanian

Graef

Nelson

Schapaugh

Wang

, et al.

2018

Genetic architecture of soybean yield and agronomic traits

G3 (Bethesda)

3367

–

3375

10.1534/g3.118.200332

Elias

Robbins

Doerge

Tuinstra

2016

Half a century of studying genotype × environment interactions in plant breeding experiments

Crop Sci

2090

–

2105

10.2135/cropsci2015.01.0061

Falconer

1952

The problem of environment and selection

Am Nat

293

–

298

Falconer

Mackay

1983

Quantitative Genetics

Longman

Google Preview

10.1534/genetics.103.025734

Gianola

Sorensen

2004

Quantitative genetic models for describing simultaneous and recursive relationships between phenotypes

Genetics

167

1407

–

1424

Gilmour

Butler

Cullis

Gogel

Thompson

2017

Asreml-r reference manual version 4.

VSN International Ltd, Hemel Hempstead, HP1 1ES, UK

Habier

Fernando

Dekkers

2007

The impact of genetic relationship information on genome-assisted breeding values

Genetics

177

2389

–

2397

. doi:

10.1534/genetics.107.081190

Habier

Fernando

Garrick

2013

Genomic blup decoded: a look into the black box of genomic prediction

Genetics

194

597

–

607

. doi:

10.1534/genetics.113.152207

Hardner

2017

Exploring opportunities for reducing complexity of genotype-by-environment interaction models

Euphytica

213

248

. doi:

10.1007/s10681-017-2023-0

10.1007/s00122-013-2231-5

Hayashi

Iwata

2013

A Bayesian method and its variational approximation for prediction of genomic breeding values in multiple traits

BMC Bioinformatics

–

10.1186/1471-2105-14-34

Hayes

Hill

1981

Modification of estimates of parameters in the construction of genetic selection indices (‘bending’)

Biometrics

483

–

493

Heslot

Akdemir

Sorrells

Jannink

2014

Integrating environmental covariates and crop modeling into the genomic selection framework to predict genotype by environment interactions

Theor Appl Genet

127

(

463

–

480

Jarquín

Crossa

Lacaze

Du Cheyron

Daucourt

Lorgeou

Piraux

Guerreiro

Pérez

Calus

, et al.

2014

A reaction norm model for genomic selection using high-dimensional genomic and environmental data

Theor Appl Genet

127

(

595

–

607

10.1007/s00122-013-2243-1

Jia

Jannink

2012

Multiple-trait genomic selection methods increase genetic value prediction accuracy

Genetics

192

(

1513

–

1522

10.1534/genetics.112.144246

Konstantinov

Erasmus

1993

Using transformation algorithms to estimate (co) variance components by REML in models with equal design matrices

S Afr J Anim Sci

187

–

191

Legarra

Misztal

2008

Computing strategies in genome-wide selection

J Dairy Sci

(

360

–

366

10.3168/jds.2007-0403

Legarra

Ricard

Filangi

2011

Gs3: Genomic Selection, Gibbs Sampling, Gauss-Seidel (and Bayescπ)

INRA

Google Preview

Needell

Ramdas

2015

Convergence properties of the randomized extended Gauss–Seidel and Kaczmarz methods

SIAM J Matrix Anal Appl

(

1590

–

1604

Malosetti

Ribaut

van Eeuwijk

2013

The statistical analysis of multi-environment data: modeling genotype-by-environment interaction and its genetic basis

Front Physiol

37433

10.3389/fphys.2013.00044

10.1093/genetics/157.4.1819

Martini

Crossa

Toledo

Cuevas

2020

On Hadamard and Kronecker products in covariance structures for genotype × environment interaction

Plant Genome

e20033

Meuwissen

THE

Hayes

Goddard

2001

Prediction of total genetic value using genome-wide dense marker maps

Genetics

157

1819

–

1829

Meyer

1985

Maximum likelihood estimation of variance components for a multivariate mixed model with equal design matrices

Biometrics

153

–

165

Meyer

2009a

Factor-analytic models for genotype × environment type problems and structured covariance matrices

Genet Sel Evol

–

10.1186/1297-9686-41-21

Meyer

2009b

Factor-analytic models for genotype × environment type problems and structured covariance matrices

Genet Sel Evol

(

–

10.1186/1297-9686-41-21

Meyer

2019

“Bending” and beyond: better estimates of quantitative genetic parameters?

J Anim Breed Genet

136

243

–

251

10.1111/jbg.2019.136.issue-4

Misztal

2008

Reliable computing in estimation of variance components

J Anim Breed Genet

125

363

–

370

10.1111/jbg.2008.125.issue-6

Misztal

Legarra

2017

Invited review: efficient computation strategies in genomic selection

Animal

731

–

736

10.1017/S1751731116002366

Möhring

Piepho

2009

Comparison of weighting in two-stage analysis of plant breeding trials

Crop Sci

1977

–

1988

10.2135/cropsci2009.02.0083

10.1038/s41437-021-00412-1

Montesinos-López

Flores-Cortes

de la Rosa

Crossa

2021

A guide for kernel generalized regression methods for genomic-enabled prediction

Heredity (Edinb)

126

577

–

596

Ødegård

Indahl

Strandén

Meuwissen

2018

Large-scale genomic prediction using singular value decomposition of the genotype matrix

Genet Sel Evol

–

10.1186/s12711-018-0373-2

Piepho

Möhring

Schulz-Streeck

Ogutu

2012

A stage-wise approach for the analysis of multi-environment trials

Biom J

844

–

860

Pocrnic

Lourenco

Masuda

Misztal

2016

Dimensionality of genomic information and performance of the algorithm for proven and young for different livestock species

Genet Sel Evol

–

10.1186/s12711-016-0261-6

Pocrnic

Lourenco

Masuda

Misztal

2019

Accuracy of genomic blup when considering a genomic relationship matrix based on the number of the largest eigenvalues: a simulation study

Genet Sel Evol

–

10.1186/s12711-019-0516-0

Resende

Xavier

Silva

PIT

Resende

Jarquin

Marcatti

2024

Gis-based G × E modeling of maize hybrids through enviromic markers engineering

New Phytologist

10.1111/nph.19951

10.1186/s13059-021-02416-w

Runcie

Cheng

Crawford

2021

MegaLMM: mega-scale linear mixed models for genomic predictions with thousands of traits

Genome Biol

(

–

Schaeffer

1986

Pseudo expectation approach to variance component estimation

J Dairy Sci

(

2884

–

2889

10.3168/jds.S0022-0302(86)80743-3

10.3835/plantgenome2016.10.0109

Song

Yan

Quigley

Jordan

Fickus

Schroeder

Song

Charles An

Hyten

Nelson

, et al.

2017

Genetic characterization of the soybean nested association mapping population

Plant Genome

(

–

10.1111/anzs.2003.45.issue-4

Sorensen

Gianola

2002

Likelihood, Bayesian and MCMC Methods in Quantitative Genetics

Springer

Strandén

Garrick

2009

Derivation of equivalent computing algorithms for genomic predictions and reliabilities of animal merit

J Dairy Sci

(

2971

–

2975

10.3168/jds.2008-1929

Thompson

Cullis

Smith

Gilmour

2003

A sparse implementation of the average information algorithm for factor analytic and reduced rank variance models

Aust N Z J Stat

(

445

–

459

10.1534/genetics.113.151209

Thompson

Shaw

1990

Pedigree analysis for quantitative traits: variance components without matrix inversion

Biometrics

(

399

–

413

Valente

Rosa

Gianola

Weigel

2013

Is structural equation modeling advantageous for the genetic improvement of multiple traits?

Genetics

194

561

–

572

VanRaden

Jung

1988

A general purpose approximation to restricted maximum likelihood: the tilde-hat approach

J Dairy Sci

187

–

194

10.3168/jds.S0022-0302(88)79541-7

10.1007/s00122-023-04319-9

Waters

van der Werf

Robinson

Hickey

Clark

2023

Partitioning the forms of genotype-by-environment interaction in the reaction norm analysis of stability

Theor Appl Genet

136

Xavier

Habier

2022

A new approach fits multivariate genomic prediction models efficiently

Genet Sel Evol

–

10.1186/s12711-022-00730-w

Xavier

Jarquin

Howard

Ramasubramanian

Specht

Graef

Beavis

Diers

Song

Cregan

, et al.

2018

Genome-wide analysis of grain yield stability and environmental interactions in a multiparental soybean population

G3 (Bethesda)

519

–

529

10.1534/g3.117.300300

Xavier

Muir

Rainey

2019

bWGR: Bayesian whole-genome regression

Bioinformatics

(

1957

–

1959

10.1093/bioinformatics/btz794