Abstract

Genomic prediction models that capture genotype-by-environment (GxE) interaction are useful for predicting site-specific performance by leveraging information among related individuals and correlated environments, but implementing such models is computationally challenging. This study describes the algorithm of these scalable approaches, including 2 models with latent representations of GxE interactions, namely MegaLMM and MegaSEM, and an efficient multivariate mixed-model solver, namely Pseudo-expectation Gauss–Seidel (PEGS), fitting different covariance structures [unstructured, extended factor analytic (XFA), Heteroskedastic compound symmetry (HCS)]. Accuracy and runtime are benchmarked on simulated scenarios with varying numbers of genotypes and environments. MegaLMM and PEGS-based XFA and HCS models provided the highest accuracy under sparse testing with 100 testing environments. PEGS-based unstructured model was orders of magnitude faster than restricted maximum likelihood (REML) based multivariate genomic best linear unbiased predictions (GBLUP) while providing the same accuracy. MegaSEM provided the lowest runtime, fitting a model with 200 traits and 20,000 individuals in ∼5 min, and a model with 2,000 traits and 2,000 individuals in less than 3 min. With the genomes-to-fields data, the most accurate predictions were attained with the univariate model fitted across environments and by averaging environment-level genomic estimated breeding values (GEBVs) from models with HCS and XFA covariance structures.

Introduction

Multienvironment trials (METs) are the main source of data for plant breeding decision-making. Genotype-by-environment (GxE) interactions occur when the performance of a genotype varies across trials due to different environmental conditions. Predicting genotype performance within environments is essential for enhanced adaptation and selection of superior genotypes. Understanding how genotypes perform in diverse environments allows breeders to tailor cultivars to specific conditions or select genotypes that perform consistently well across different environments. This leads to more effective development of cultivars and ensures new cultivars thrive in the intended conditions (Elias et al. 2016).

Genomic prediction (Meuwissen et al. 2001) is used to predict genotype performance within environments. It exploits genomic relationships between genotypes to predict breeding values (Habier et al. 2007), which accelerates genetic progress by increasing the accuracy of predictions and shortening breeding cycles, while it also allows breeders to reduce phenotyping costs (Crossa et al. 2021). Genomic prediction models that fit genetic correlations between pairs of environments, thereby capturing GxE interaction, can be more accurate than univariate (UV) models that fit phenotypes of each environment independently (Xavier and Habier 2022).

Genomic prediction models that account for GxE interactions are computationally demanding (Heslot et al. 2014). They are more complex than UV genomic models due to many covariance parameters that need to be estimated for solving large and dense mixed-model equations (Hardner 2017; Martini et al. 2020). The most complex one is the unstructured model that estimates a different covariance for each pair of environments (Bustos-Korts et al. 2016; Crossa et al. 2022). The computational burden of such a model is further exacerbated by the progress in high-throughput phenotyping, which increases the number of agronomic traits and environments substantially. One solution for reducing model complexity is fitting compound symmetry models that assume a constant genetic correlation for all pairs of environments (Cuevas et al. 2016). Another solution is fitting GxE interaction kernels that compute the Hadamard product of genetic and environment covariance matrices and scale it by a single variance (Jarquín et al. 2014). Often, the analysis of MET even follows simpler parameterizations by using Finlay–Wilkinson approaches or GGE models (Malosetti et al. 2013). All of these solutions may have lower accuracy in capturing GxE interactions and predictions than unstructured models. Thus, computationally efficient methods for estimating parameters and solving mixed-model equations of these complex models are needed when the number of environments and traits is large.

The purpose of this study is to review computationally efficient methods that can be utilized for genomic prediction within environments while capturing complex GxE patterns. We rate the scalability of these methods for applying them to varying numbers of genotypes, markers, and environments. Benchmarks of accuracy and runtime are performed on multiple simulated scenarios and real data.

Materials and Methods

In environment-specific models, phenotypes in environment k are modeled as

(1)

with variance

(2)

where the vector of phenotypes (yk) is modeled by the overall mean (μk), a vector of genetic signal (gk) and a vector of residuals (ek). The phenotypic covariance matrix Vk is the sum of genetic (Gk) and residual (Rk) covariance matrices. In an additive model, the genetic covariance matrix is calculated as

(3)

where σβ(k)2 denotes the variance of marker effects and Z is a matrix that contains the molecular allele dosage coded as 0, 1, or 2 for different genotypes in rows and different marker loci in columns of the matrix. Residuals in each environment k are assumed independent and identically distributed as Rk=Iσe(k)2. The genetic term can be described as a linear combination of marker effects (Habier et al. 2007) as

(4)

where Var(βk)=Iσβ(k)2. In MET analyses, the covariance between a pair of environments is denoted as Cov(βk,βk), which is a function of the GxE correlation between environments k and k, ρβ(k,k), as

(5)

Realistically, every environment has its unique signal (βk) and every pair of environments has a unique correlation (Falconer 1952). Approaches designed to accommodate such assumptions are presented next. Collectively, these correspond to efficient (1) models, (2) parameterizations, and (3) solvers.

(1) Efficient models

Univariate by environment

By treating environments as independent, the UV model is the simplest approach to obtaining environment-specific predictions. They are obtained by fitting location-year combinations from either raw or spatially adjusted phenotypes (Möhring and Piepho 2009; Piepho et al. 2012). Accurate predictions can be achieved when genotypes in the model are related, but accuracy can be constrained by the dataset size (Xu 2003) and genetic scope (Habier et al. 2013).

Multivariate across environment

Environments are treated as different traits that are fitted with an unstructured genetic covariance matrix (Falconer 1952; Falconer and Mackay 1983). This requires the estimation of genetic variances and covariances between pairs of environments [equation (5)].

Simplified canonical transformation

For complete data across environments, i.e. balanced without missing phenotypes, canonical transformation (CT) is an efficient method to fit all environments simultaneously, while capturing the shared genetic information across correlated environments (Meyer 1985; Konstantinov and Erasmus 1993). The traditional CT method (see  Appendix A) iterates through transforming trait phenotypes and solving transformed variance components until convergence.

A simplified canonical transformation (SCT) that avoids the iterative estimation of covariances is described below. Phenotypes are transformed once by singular value decomposition (SVD)

(6)

where Y={y1,,yK} is a matrix of phenotypes with rows and columns representing observations and traits, respectively, K is the number of environments, F=UD is a latent traits matrix that equals the principal components of Y, and V is a rotation matrix. The latent traits are orthogonal with cov(fi,fj)=0 for every pair of latent traits i and j. Subsequently, each latent trait is fitted using a UV model

(7)

and estimated breeding values of the original traits are calculated by

(8)

where G^={g^1,,g^K}, B^=Γ^V, Γ^={γ^1,,γ^K}, and V is rotating marker effects back to the natural scale of the phenotypes.

MegaLMM

MegaLMM (Runcie et al. 2021) extends the decomposition of SCT with a more parsimonious latent representation of the phenotypes, with the addition of trait-specific model terms (see  Appendix B). This enables efficient handling of missing values in Y while permitting the model to accommodate more traits. Latent spaces are inferred from a stochastic matrix decomposition of Y based on the following statistical model:

(9)

The matrix of phenotypes is decomposed into latent spaces F rotated by Λ and residuals J. This residual matrix J contains genetic signal (ZΔ) not captured by FΛ and true error (E), thus J=ZΔ+E. In this approach, each trait is fitted by a UV model as

(10)

and, then, each latent space is modeled by

(11)

The shared genetic signal is captured by Fλk, while the environment-specific genetic signal is captured by Zkδk. Multivariate marker effects are estimated by β^k=Γλ^k+δ^k, so that the complete matrix of estimated breeding values can be calculated with

(12)

The latent spaces are strongly shrunken according to their relative importance (Runcie et al. 2021). The original implementation of MegaLMM uses Markov Chain Monte Carlo (MCMC) for estimating F^ and Λ^ by alternating between the conditional models (Y|F)=FΛ+J and (Y|Λ)=ΛF+J for updating the coefficients.

Structural equation models

Structural equation models (SEMs) are used for modeling directionally correlated response variables (Gianola and Sorensen 2004) and causal trait networks (Valente et al. 2013). For MET analyses, a fully connected SEM fits the phenotypes of environment k as a function of phenotypes at other environments and model terms specific to environment k:

(13)

where μk is the intercept, ψk is a vector of length k that quantifies the linear associations between phenotypes of trait k and phenotypes from other traits (ψk=0), and δk is a vector of environment-specific marker effects. Marker effects are estimated from

(14)

where Ψ^={ψ^1,ψ^2,,ψ^K} and Δ^={δ^1,δ^2,,δ^K}.

The SEM solution is straightforward for balanced datasets with a small number of traits. SEM differs from the MegaLMM model by explicitly parameterizing phenotypic traits (YΨ) in the model as opposed to using a latent representation (Fλ).

MegaSEM

An intermediate parameterization between SEM and MegaLMM, here referred to as MegaSEM, can be achieved by utilizing a matrix of UV genomic estimated breeding values (GEBVs) from all traits to parameterize the model for every trait. The matrix of UV GEBVs is given by G0=ZB0, using UV marker effects B0={b0(1),b0(2),,b0(K)}. The matrix of UV GEBVs is then decomposed by SVD as

(15)

Once principal components (F0=U0D0) have been computed, MegaSEM fits each trait as

(16)

which has the same form as equation (10). Similar to the MegaLMM model, the shared genetic signal is captured by F0αk and environment-specific genetic signal is captured by Zkδk. Estimated breeding values are calculated by

(17)

Note that F0=ZB0V0, because

(18)

as V0V0=I. As shown in equation (18), the estimated marker effects for trait k, β^k, are a linear combination of marker effects from all environments estimated by UV analyses in B0, and environment-specific effects α^k and δ^k, i.e. β^k=B0V0α^k+δ^k.When all principal components of G0 are utilized (i.e. no dimensionality reduction), the MegaSEM model can be simplified further by

(19)

where the environment-specific term can be omitted if the model is parameterized with all linear combinations of environments, which yields B^=B0V0A^.

The computational cost of MegaSEM consists of running a UV model twice, first to estimate B0 and then to calculate B^, in addition to the SVD of G0.

(2) Efficient parameterizations

Kernels and rotations

Kernel-based genomic prediction is used to parameterize linear and nonlinear relationships (de Los Campos et al. 2010, 2013; Montesinos-López et al. 2021). Rotation of kernels by spectral, eigenvalue decomposition (EVD), or singular-value decomposition (SVD), enables solving such models by a Gauss–Seidel (GS) algorithm (Legarra and Misztal 2008). Rotations are also useful when the number of parameters far exceeds the number of genotypes because they can reduce the dimensionality of the problem substantially (Ødegård et al. 2018; Xavier and Habier 2022). Genomic prediction based on a kernel K can be described as

(20)

The kernel, which describes genetic relationships, can be decomposed by EVD (Thompson and Shaw 1990) as

(21)

where Q=UD. The model can be reparameterized as

(22)

where genetic effects are fitted as g=Qα with variance–covariance matrix Var(g)=QQσg2=Kσg2 and σg2 is the genetic variance. To calculate predictions for individuals who were not in the training dataset (TS) used in the analysis, a rotation matrix is defined as

(23)

Let KPS,TS be the kernel that contains the genetic relationships between prediction (PS) and training dataset (TS) individuals. Then, the matrix QPS,TS can be calculated as

(24)

and estimated breeding values can be obtained by

(25)

For verification, it can be shown that Q=KR=UD.

Rotation for reducing dimensions

When fitting a linear additive model (g=Zβ) with a large number of parameters (pn), the kernel trick can be used to reduce model dimensionality by kernalizing the genomic information with K=ZZ, and then decomposing K via EVD [equation (21)]. Marker effects are calculated with

(26)

where the rotation matrix [R, equation (23)] originates from the decomposition of K. This approach reduces the number of parameters from p to n. When n is also large, the rotation matrix R~ can be created from the subset (n~<n) as described in equation (23). The full design matrix is created as

(27)

where Z~ and R~ herein represent the genotypes and rotation matrix generated from a subset of individuals. The dimensionality of the matrix is hereby reduced to less than the original number of parameters and observations. Alternatively, the sparse inversion of K is another computationally efficient way to solve kernel models (see  Appendix D).

Rotations using SVD

The kernel rotation parameterizations aforementioned, Q and Qn|n~ can also be derived directly by SVD of Z as

(28)

which yields the same U and D from equation (21). Principal components are obtained either with Q=UD or Q=ZV. Subset rotation [equation (27)] are obtained with Qn|n~=ZnVn~, where Vn~ comes from the SVD of a population subset. Models parameterized by Qα [equation (22)] using SVD recover the marker effects [equation (4)] with β=Vα.

Diagonalization

Diagonalization refers to converting a dense matrix into a diagonal matrix, which makes it useful for genetic relationship models. Using EVD for the decomposition of K as in equation (21), the genetic term g of the model in equation (20) can be written as a regression of eigenvectors, U, resulting in

(29)

Note that the covariance matrix for θ is Var(θ)=D2σg2, whereas the covariance matrix for α in equation (22) is Var(α)=Iσg2. Exploiting UU=I, equation (29) can be further transformed into

(30)

where the residual variances are unaffected because Var(e)=UUσe2=Iσe2. Computational advantages of equation (30) are the sparsity of the mixed-model equations and a reduction in dimensionality by not using all eigenvectors. For the multivariate (MV) case, diagonalization makes it feasible to solve variance components by REML using commercial software (Gilmour et al. 2017), but only when data are balanced (i.e. all genotypes are observed in all environments). This approach has become particularly useful for genome-wide association methods, both UV and MV (Zhou and Stephens 2012, 2014).

(3) Efficient solvers

Efficient univariate solver

An efficient algorithm for solving mixed-model equations is called the pseudo-expectation Gauss–Seidel (PEGS) method (Xavier and Habier 2022). PEGS is iterative based on the pseudo-expectation (PE) estimator of variance components (Schaeffer 1986) along with the estimation of coefficients using GS residual updating (Legarra and Misztal 2008). A boost in convergence is achieved by randomizing the order in which coefficients are updated within each iteration (Ma et al. 2015). PE is an approximation of REML (VanRaden and Jung 1988) that provides unbiased and invariant variance components (Xavier and Habier 2022). In terms of implementation, PEGS resembles a non-MCMC version of the Bayesian Gibbs sampler. It estimates breeding values from a single nuclear polymorphism (SNP)-best linear unbiased prediction (BLUP) model, or ridge regression, as

(31)

where βN(0,Iσβ2) and eN(0,Iσe2). The variance components are estimated as

(32)
(33)

where S is the fixed effect absorption matrix defined as S=IX(XX)1X. When the intercept is the only fixed effect, β~=Z(yy¯) and Tr(ZSZ)=j=1Ji=1I(zijz¯j)2, where zij denotes the allele dosage of genotype i and z¯j denotes the average allele dosage, both at locus j. The marker effects are solved with

(34)

where zj is a vector of allele dosage at marker j. This is followed by the update of the residuals as

(35)

Similarly, the intercept updates, followed by the residual update, are achieved with

(36)
(37)

This solver is suitable for estimating coefficients and variance components for UV models and different MV parameterizations, including those that do not explicitly involve the computation of covariances, such as CT, SEMs, and MegaSEM.

Efficient multivariate solver

In unstructured MV models (Xavier and Habier 2022), the genetic covariance of a pair of environments k and k, and the residual variance for environment k, have the following solution using PE:

(38)
(39)

where Σ^β(k,k) denotes the estimated variance–covariance matrix of marker effects in different environments and Σ^e(k) is a matrix that contains estimated residual variances for different environments on its diagonal.

The MV solution for updating coefficients for the jth marker is

(40)

where β^j is a vector of estimated marker effects for different environments at locus j, Z˙j=k=1Kzjk, zjk is a vector of allele dosage for locus j and environment k, and denotes the direct matrix product. The update of each vector of marker effects is followed by the update of the residuals, as

(41)

Depending on the number of environments, the inversion of Σ^β may require bending (Hayes and Hill 1981; Meyer 2019), which is a technique that forces a covariance matrix to be positive-definite. Inversions can also be avoided using alternative systems of equations (see  Appendix C).

Simplified covariance structures

Equation (38) provides a solution for unstructured covariance matrices. Their structure can be simplified similar to covariance matrices from extended factor analytic (XFA) models, which is meant to capture the main GxE interaction patterns (Thompson et al. 2003; Meyer 2009a, 2009b). The XFA covariance matrix is obtained by decomposing the unstructured Σβ via EVD, and then reassembling it using a few eigenpairs that explain most of the variation while reinstating the original diagonal elements to avoid changes in heritability. Let

(42)

and consider that U~ and D~ are subsets of U and D. The XFA covariance matrix is obtained with

(43)

where N is a diagonal matrix that recovers the original diagonal elements of Σβ.

Heteroskedastic compound symmetry (HCS) is another covariance structure attainable by the simplification of Σβ. In an HCS structure, all pairs of environments have the same GxE correlation, and environments have different variances. Here, this structure is derived within the PEGS algorithm from Σβ by calculating the average GxE correlation of pairs of environments and then covariances of Σβ(HCS) are calculated using equation (5).

Benchmark analyses

Sparse testing benchmark

One hundred environments were simulated to test the predictive performance with varying amounts of missing values per environment (75% and 90%), levels of heritability (h2={0.2,0.4,0.6}), and levels of correlation among environments (ρGxE={0.2,0.4,0.6}). Two genetic models for simulating phenotypes were used: (1) HCS with a constant GxE correlation and (2) an unstructured MV model with different GxE correlations between environments. For the latter, correlations ranged from 0.5 to 0.8, with a mean of 0.0 and a SD of 0.3. The phenotypes were simulated using the function SimGC from the R package bWGR (Xavier et al. 2019).

The accuracy was measured as the correlation between true and estimated breeding values within environments. The simulated dataset used real genomic information from 9 soybean families from the SoyNAM panel (Song et al. 2017; Diers et al. 2018; Xavier et al. 2018), where each family consists of ∼140 fullsibs genotyped with 4,240 markers.

MegaLMM is implemented in the R package MegaLMM (Runcie et al. 2021), and the other prediction methods are implemented in the R package bWGR (Xavier et al. 2019). The function MRR3F was used to run the UV and MV models with unstructured, HCS and XFA covariance structures. The function ZSEMF was used to run the full-rank MegaSEM [equation (19)]. Three eigenpairs were used to construct the XFA covariance matrices. Coefficients and variance components were estimated with PEGS for all methods except for MegaLMM, which uses Bayesian Gibbs sampling (BGS).

Runtime benchmark

Runtime was assessed by 5 simulated scenarios that vary the number of individuals from a biparental family (n={500,2,000,20,000}) and the number of environments (k={10,50,200,2,000}). The within-environment heritability was set to 0.2. Simulated covariance matrices were unstructured, with GxE correlations ranging from 0.5 to 0.8 with a mean of 0.0 and a SD of 0.3. The genomic information was based on 10 chromosomes of 500 cM with 1 marker per cM. Accuracy was measured as the mean correlation between predicted and true breeding values within environments. Genomic information, correlation matrices, and phenotypes were simulated using the functions SimZ, SimGC, and SimY from the R package bWGR (Xavier et al. 2019), respectively.  Appendix E provides an example of the simulation.

The complete dataset allows us to assess approaches that require balanced testing, including diagonalization [equation (30)] and CT [equations (6), (7), and (8)]. The following methods were evaluated: REML-based genomic best linear unbiased prediction (GREML) and its diagonalized version (D-GREML), MegaLMM, MegaSEM, UV by environment, MV models with unstructured, XFA and HCS covariance structures, and SCT. GREML, D-GREML, and MegaLMM were parameterized with genomic relationships, whereas other methods were fit as ridge regression [equation (4)], using the SVD parameterization [equation (28)] in scenarios with more markers than individuals. MegaSEM, UV, MV, XFA, HCS, and SCT utilized the PEGS solver, whereas MegaLMM was fitted using BGS. GREML and D-GREML were fit using asreml-r 4.2 (Gilmour et al. 2017). REML-based approaches were not tested for scenarios with more than 50 environments due to the runtime and lack of algorithmic stability. MegaLMM was not evaluated in the scenario with 20,000 individuals due to computational requirements, and XFA was not evaluated in the scenario with 2,000 traits due to numeric instability.

Real data benchmark

Predictive ability was evaluated with a corn dataset from the 2022 Genomes-to-Fields (G2F) GXE prediction competition. This dataset provides grain yield from 217 locations in the United States, harvested from 2014 to 2021, and from 4,836 unique hybrids. The validation dataset contains phenotypes from 23 locations harvested in 2022 and 548 hybrids that were not observed in the training dataset. SNP genotypes were available for hybrids. For this study, a subset of 10,000 SNPs was utilized.

The following methods were evaluated: MegaLMM, MegaSEM, univariate model fitted within each environment (UVW), univariate model that fitted BLUEs estimated across all environments (UVA), MV models with unstructured, XFA, and HCS covariance structure. These models were evaluated based on predictive ability calculated as the correlation between predicted and observed values for individuals in the validation dataset. We distinguished 3 different predictive ability values:

  1. Pairwise: Average of by-environment correlation between predicted and observed values of an environment. This metric informs the expectation of how any given environment from the training dataset would predict any given environment from the validation dataset.

  2. Region: Predictions were averaged across environments within a state and then correlated with validation phenotypes from environments of the same state. This metric assesses how the average of environments located in close proximity can predict a new environment within the same geographical scope.

  3. Overall: Predictions were averaged across all environments in the training dataset and correlated with phenotypes from each environment in the validation set. This metric assesses how well averages across all environments, which is thought of as a proxy of the targeted population of environments (TPE), predict a new environment.

Results and discussion

Sparse testing benchmark

Results are presented in Fig. 1. Heritability is the most influential factor for the accuracy of predictions. It was not influenced by the proportion of missing data, based on the 2 simulation parameters (75% and 90%). The difference in the accuracy of predictions between UV and methods that capture GxE interactions increased with increasing GxE correlations. The accuracy of UV did not vary with changes in GxE correlation because the GxE information is not captured when each environment is analyzed separately.

Prediction accuracy within-family and within-environment using 100 simulated environments with varying heritability, percentage of missing values, and GxE correlation. Genomic information was sourced from 9 soybean biparental families.
Fig. 1.

Prediction accuracy within-family and within-environment using 100 simulated environments with varying heritability, percentage of missing values, and GxE correlation. Genomic information was sourced from 9 soybean biparental families.

XFA was among the top predictive models in all scenarios. The HCS model had the highest accuracy in scenarios with constant, positive GxE correlation because it was the model used to simulate phenotypes, but HCS was equivalent to UV in the unstructured GxE scenario. MegaLMM was a top-performing model in scenarios with constant, positive GxE, and slightly lower performing in the unstructured scenario with heritability of 0.2 and 0.4. MV was a top-performing model for the unstructured GxE scenario, but was the least predictive model in the scenario with a constant GxE correlation of 0.2, and had intermediate performance when the constant GxE correlation was 0.4 and 0.6. MegaSEM provided a predictive performance between UV and MV.

In the unstructured simulation scenario, when MV was the true model, XFA and MegaLMM had similar accuracies as MV. The reason may be that covariances estimated by MV had decreasing precision with decreasing heritability and true GxE correlation. Thus, lowering the dimensionality of the covariance structure reduces the number of parameters to be estimated, which may reduce the impact of this precision on the accuracy of GEBVs. This trend was also observed by Xavier and Habier (2022).

Runtime benchmark

Results are displayed in Tables 1 and 2. All methods were faster than GBLUP solved by REML. Diagonalization considerably decreased the runtime of GBLUP, but its accuracy was not included in Table 2 due to odd results caused by singularities in the Average Information matrix. MegaLMM had a longer runtime than methods solved by PEGS because it uses an MCMC solver. MV took longer than HCS and XFA in the scenario with 200 and 2,000 environments. The lowest runtimes were obtained by UV, SCT, and MegaSEM. SCT provided approximately the same runtime as UV because it consists of running UV models in transformed spaces. The runtime of MegaLMM was more sensitive to the number of individuals than the number of traits. The runtime of MV, XFA, and HCS were more sensitive to the number of environments.

Table 1.

Average runtime in minutes (SE) for the balanced experimental design based on 10 simulated replicates.

ModelSolver10/50010/2,00050/2,000200/2,0002,000/2,000200/20,000
GREMLREML46.75 (0.37)172.61 (17.93)
D-GREMLREML0.06 (<0.1)0.19 (<0.1)8.32 (3.51)
MegaLMMMCMC0.31 (0.01)4.38 (0.06)7.23 (1.19)17.71 (4.02)130.77 (11.51)
MegaSEMPEGS<0.01 (<0.01)0.01 (<0.01)0.04 (<0.01)0.14 (<0.01)2.92 (0.02)5.26 (0.07)
MVPEGS<0.01 (<0.01)<0.1 (<0.01)0.02 (<0.01)9.12 (1.62)97.14 (1.29)82.22 (5.71)
XFAPEGS<0.01 (<0.01)<0.1 (<0.01)0.03 (<0.01)0.49 (0.09)81.46 (1.38)
HCSPEGS<0.01 (<0.01)<0.01 (<0.01)0.02 (<0.01)0.22 (0.04)38.74 (3.60)37.74 (4.45)
SCTPEGS<0.01 (<0.01)0.01 (<0.01)0.04 (<0.01)0.15 (0.01)1.65 (0.01)5.25 (0.05)
UVPEGS<0.01 (<0.01)0.01 (<0.01)0.04 (<0.01)0.14 (<0.01)1.44 (0.01)5.20 (0.06)
ModelSolver10/50010/2,00050/2,000200/2,0002,000/2,000200/20,000
GREMLREML46.75 (0.37)172.61 (17.93)
D-GREMLREML0.06 (<0.1)0.19 (<0.1)8.32 (3.51)
MegaLMMMCMC0.31 (0.01)4.38 (0.06)7.23 (1.19)17.71 (4.02)130.77 (11.51)
MegaSEMPEGS<0.01 (<0.01)0.01 (<0.01)0.04 (<0.01)0.14 (<0.01)2.92 (0.02)5.26 (0.07)
MVPEGS<0.01 (<0.01)<0.1 (<0.01)0.02 (<0.01)9.12 (1.62)97.14 (1.29)82.22 (5.71)
XFAPEGS<0.01 (<0.01)<0.1 (<0.01)0.03 (<0.01)0.49 (0.09)81.46 (1.38)
HCSPEGS<0.01 (<0.01)<0.01 (<0.01)0.02 (<0.01)0.22 (0.04)38.74 (3.60)37.74 (4.45)
SCTPEGS<0.01 (<0.01)0.01 (<0.01)0.04 (<0.01)0.15 (0.01)1.65 (0.01)5.25 (0.05)
UVPEGS<0.01 (<0.01)0.01 (<0.01)0.04 (<0.01)0.14 (<0.01)1.44 (0.01)5.20 (0.06)

Six scenarios vary in terms of the number of environments and individuals (no. environments/no. individuals). Models are ordered based on computational performance. The SE is shown in parenthesis.

Table 1.

Average runtime in minutes (SE) for the balanced experimental design based on 10 simulated replicates.

ModelSolver10/50010/2,00050/2,000200/2,0002,000/2,000200/20,000
GREMLREML46.75 (0.37)172.61 (17.93)
D-GREMLREML0.06 (<0.1)0.19 (<0.1)8.32 (3.51)
MegaLMMMCMC0.31 (0.01)4.38 (0.06)7.23 (1.19)17.71 (4.02)130.77 (11.51)
MegaSEMPEGS<0.01 (<0.01)0.01 (<0.01)0.04 (<0.01)0.14 (<0.01)2.92 (0.02)5.26 (0.07)
MVPEGS<0.01 (<0.01)<0.1 (<0.01)0.02 (<0.01)9.12 (1.62)97.14 (1.29)82.22 (5.71)
XFAPEGS<0.01 (<0.01)<0.1 (<0.01)0.03 (<0.01)0.49 (0.09)81.46 (1.38)
HCSPEGS<0.01 (<0.01)<0.01 (<0.01)0.02 (<0.01)0.22 (0.04)38.74 (3.60)37.74 (4.45)
SCTPEGS<0.01 (<0.01)0.01 (<0.01)0.04 (<0.01)0.15 (0.01)1.65 (0.01)5.25 (0.05)
UVPEGS<0.01 (<0.01)0.01 (<0.01)0.04 (<0.01)0.14 (<0.01)1.44 (0.01)5.20 (0.06)
ModelSolver10/50010/2,00050/2,000200/2,0002,000/2,000200/20,000
GREMLREML46.75 (0.37)172.61 (17.93)
D-GREMLREML0.06 (<0.1)0.19 (<0.1)8.32 (3.51)
MegaLMMMCMC0.31 (0.01)4.38 (0.06)7.23 (1.19)17.71 (4.02)130.77 (11.51)
MegaSEMPEGS<0.01 (<0.01)0.01 (<0.01)0.04 (<0.01)0.14 (<0.01)2.92 (0.02)5.26 (0.07)
MVPEGS<0.01 (<0.01)<0.1 (<0.01)0.02 (<0.01)9.12 (1.62)97.14 (1.29)82.22 (5.71)
XFAPEGS<0.01 (<0.01)<0.1 (<0.01)0.03 (<0.01)0.49 (0.09)81.46 (1.38)
HCSPEGS<0.01 (<0.01)<0.01 (<0.01)0.02 (<0.01)0.22 (0.04)38.74 (3.60)37.74 (4.45)
SCTPEGS<0.01 (<0.01)0.01 (<0.01)0.04 (<0.01)0.15 (0.01)1.65 (0.01)5.25 (0.05)
UVPEGS<0.01 (<0.01)0.01 (<0.01)0.04 (<0.01)0.14 (<0.01)1.44 (0.01)5.20 (0.06)

Six scenarios vary in terms of the number of environments and individuals (no. environments/no. individuals). Models are ordered based on computational performance. The SE is shown in parenthesis.

Table 2.

Within environment accuracy for the balanced experimental design based on 10 simulated replicates.

ModelSolver10/50010/2,00050/2,000200/2,0002,000/2,000200/20,000
GREMLREML0.81 (0.03)0.89 (<0.01)
MegaLMMMCMC0.78 (0.04)0.87 (<0.01)0.87 (<0.01)0.89 (<0.01)0.90 (<0.01)
MegaSEMPEGS0.79 (0.04)0.88 (<0.01)0.89 (<0.01)0.89 (<0.01)0.89 (<0.01)0.96 (<0.01)
MVPEGS0.81 (0.03)0.89 (<0.01)0.89 (<0.01)0.90 (<0.01)0.88 (<0.01)0.96 (<0.01)
XFAPEGS0.80 (0.04)0.89 (<0.01)0.89 (<0.01)0.89 (<0.01)0.96 (<0.01)
HCSPEGS0.81 (0.03)0.88 (<0.01)0.88 (<0.01)0.88 (<0.01)0.88 (<0.01)0.96 (<0.01)
SCTPEGS0.81 (0.03)0.89 (<0.01)0.88 (<0.01)0.87 (<0.01)0.87 (<0.01)0.95 (<0.01)
UVPEGS0.78 (0.04)0.87 (<0.01)0.87 (<0.01)0.87 (<0.01)0.87 (<0.01)0.95 (<0.01)
ModelSolver10/50010/2,00050/2,000200/2,0002,000/2,000200/20,000
GREMLREML0.81 (0.03)0.89 (<0.01)
MegaLMMMCMC0.78 (0.04)0.87 (<0.01)0.87 (<0.01)0.89 (<0.01)0.90 (<0.01)
MegaSEMPEGS0.79 (0.04)0.88 (<0.01)0.89 (<0.01)0.89 (<0.01)0.89 (<0.01)0.96 (<0.01)
MVPEGS0.81 (0.03)0.89 (<0.01)0.89 (<0.01)0.90 (<0.01)0.88 (<0.01)0.96 (<0.01)
XFAPEGS0.80 (0.04)0.89 (<0.01)0.89 (<0.01)0.89 (<0.01)0.96 (<0.01)
HCSPEGS0.81 (0.03)0.88 (<0.01)0.88 (<0.01)0.88 (<0.01)0.88 (<0.01)0.96 (<0.01)
SCTPEGS0.81 (0.03)0.89 (<0.01)0.88 (<0.01)0.87 (<0.01)0.87 (<0.01)0.95 (<0.01)
UVPEGS0.78 (0.04)0.87 (<0.01)0.87 (<0.01)0.87 (<0.01)0.87 (<0.01)0.95 (<0.01)

Six scenarios vary in terms of the number of environments and individuals (no. environments/no. individuals). Models are ordered based on computational performance. SE is shown in parenthesis.

Table 2.

Within environment accuracy for the balanced experimental design based on 10 simulated replicates.

ModelSolver10/50010/2,00050/2,000200/2,0002,000/2,000200/20,000
GREMLREML0.81 (0.03)0.89 (<0.01)
MegaLMMMCMC0.78 (0.04)0.87 (<0.01)0.87 (<0.01)0.89 (<0.01)0.90 (<0.01)
MegaSEMPEGS0.79 (0.04)0.88 (<0.01)0.89 (<0.01)0.89 (<0.01)0.89 (<0.01)0.96 (<0.01)
MVPEGS0.81 (0.03)0.89 (<0.01)0.89 (<0.01)0.90 (<0.01)0.88 (<0.01)0.96 (<0.01)
XFAPEGS0.80 (0.04)0.89 (<0.01)0.89 (<0.01)0.89 (<0.01)0.96 (<0.01)
HCSPEGS0.81 (0.03)0.88 (<0.01)0.88 (<0.01)0.88 (<0.01)0.88 (<0.01)0.96 (<0.01)
SCTPEGS0.81 (0.03)0.89 (<0.01)0.88 (<0.01)0.87 (<0.01)0.87 (<0.01)0.95 (<0.01)
UVPEGS0.78 (0.04)0.87 (<0.01)0.87 (<0.01)0.87 (<0.01)0.87 (<0.01)0.95 (<0.01)
ModelSolver10/50010/2,00050/2,000200/2,0002,000/2,000200/20,000
GREMLREML0.81 (0.03)0.89 (<0.01)
MegaLMMMCMC0.78 (0.04)0.87 (<0.01)0.87 (<0.01)0.89 (<0.01)0.90 (<0.01)
MegaSEMPEGS0.79 (0.04)0.88 (<0.01)0.89 (<0.01)0.89 (<0.01)0.89 (<0.01)0.96 (<0.01)
MVPEGS0.81 (0.03)0.89 (<0.01)0.89 (<0.01)0.90 (<0.01)0.88 (<0.01)0.96 (<0.01)
XFAPEGS0.80 (0.04)0.89 (<0.01)0.89 (<0.01)0.89 (<0.01)0.96 (<0.01)
HCSPEGS0.81 (0.03)0.88 (<0.01)0.88 (<0.01)0.88 (<0.01)0.88 (<0.01)0.96 (<0.01)
SCTPEGS0.81 (0.03)0.89 (<0.01)0.88 (<0.01)0.87 (<0.01)0.87 (<0.01)0.95 (<0.01)
UVPEGS0.78 (0.04)0.87 (<0.01)0.87 (<0.01)0.87 (<0.01)0.87 (<0.01)0.95 (<0.01)

Six scenarios vary in terms of the number of environments and individuals (no. environments/no. individuals). Models are ordered based on computational performance. SE is shown in parenthesis.

The accuracy of UV was insensitive to the number of environments, as it does not capture any GxE information. All methods that capture GxE information were as predictive or better than UV, although the difference in the accuracy of GEBVs declined as the number of individuals increased or GxE correlation decreased. In scenarios with 10 environments, only SCT and MV provided the same accuracy as GBLUP but the accuracy of SCT decreased as the number of environments increased. MV was the most accurate model in all scenarios under 200 environments but its accuracy dropped in the scenario with 2,000 environments, due to the number of parameters estimated in Σβ and a need for bending this matrix to obtain its inverse. In the scenario with 2,000 environments, the highest accuracy was obtained by MegaLMM followed by MegaSEM. MegaSEM provided either the highest or second-highest accuracy in all scenarios except the one with the lowest dimensionality.

When taking into account both runtime and accuracy, our results indicate that the best method depends on the dimensionality of the data. MegaSEM suits scenarios with a large number of individuals and traits, providing high accuracy and low runtime. SCT and diagonalized GBLUP should be considered when data are balanced and the number of environments is modest. MegaLMM suits datasets with thousands or more traits but with a moderate number of individuals. HCS, MV, and XFA are suitable for datasets with up to 200 environments.

Real data benchmark

Results are displayed in Table 3. The predictive ability of overall averages was always greater than regional averages, indicating low genotype-by-region interactions. For overall averages, the highest predictive abilities were obtained by UVA, XFA, and HCS, whereas MegaLMM and MegaSEM had the same predictive ability as UVW.

Table 3.

Predictive ability from the 2022 G2F GxE prediction competition.

ModelPairwiseRegionOverall
UVW0.08 (0.03)0.22 (0.14)0.27 (0.11)
MV0.12 (0.05)0.27 (0.12)0.30 (0.11)
MegaSEM0.13 (0.05)0.25 (0.15)0.27 (0.11)
MegaLMM0.18 (0.06)0.24 (0.19)0.27 (0.10)
XFA0.21 (0.07)0.31 (0.13)0.35 (0.12)
HCS0.24 (0.09)0.34 (0.11)0.36 (0.11)
UVA0.35 (0.12)
ModelPairwiseRegionOverall
UVW0.08 (0.03)0.22 (0.14)0.27 (0.11)
MV0.12 (0.05)0.27 (0.12)0.30 (0.11)
MegaSEM0.13 (0.05)0.25 (0.15)0.27 (0.11)
MegaLMM0.18 (0.06)0.24 (0.19)0.27 (0.10)
XFA0.21 (0.07)0.31 (0.13)0.35 (0.12)
HCS0.24 (0.09)0.34 (0.11)0.36 (0.11)
UVA0.35 (0.12)

Corn grain yield was observed in 4,836 hybrids across 217 locations (2014–2021) predicting 548 hybrids observed across 21 environments (2022). Models are ordered based on the pairwise metric. The SE is shown in parentheses.

Table 3.

Predictive ability from the 2022 G2F GxE prediction competition.

ModelPairwiseRegionOverall
UVW0.08 (0.03)0.22 (0.14)0.27 (0.11)
MV0.12 (0.05)0.27 (0.12)0.30 (0.11)
MegaSEM0.13 (0.05)0.25 (0.15)0.27 (0.11)
MegaLMM0.18 (0.06)0.24 (0.19)0.27 (0.10)
XFA0.21 (0.07)0.31 (0.13)0.35 (0.12)
HCS0.24 (0.09)0.34 (0.11)0.36 (0.11)
UVA0.35 (0.12)
ModelPairwiseRegionOverall
UVW0.08 (0.03)0.22 (0.14)0.27 (0.11)
MV0.12 (0.05)0.27 (0.12)0.30 (0.11)
MegaSEM0.13 (0.05)0.25 (0.15)0.27 (0.11)
MegaLMM0.18 (0.06)0.24 (0.19)0.27 (0.10)
XFA0.21 (0.07)0.31 (0.13)0.35 (0.12)
HCS0.24 (0.09)0.34 (0.11)0.36 (0.11)
UVA0.35 (0.12)

Corn grain yield was observed in 4,836 hybrids across 217 locations (2014–2021) predicting 548 hybrids observed across 21 environments (2022). Models are ordered based on the pairwise metric. The SE is shown in parentheses.

Results from real data aligned with sparse testing simulations with moderate GxE (Fig. 1). This was supported by a GxE correlation of 0.4 estimated by HCS. Because UVA was among the most predictive models, we fitted UVW on the residuals of UVA to investigate how much GxE was left from UVA. This led to a slight improvement in the predictive ability of overall averages, from 0.35 (0.12) to 0.36 (0.12), which indicates low GxE correlations after fitting the main genetic term.

The precision of estimated GxE correlations and thereby accuracies of predictions of models with more complex covariance structures are lower for the following reasons. Firstly, a large number of environments had a small number of observations, whereas 3 environments had as few as 22 individuals. Secondly, the number of individuals that overlap across environments was limited with a median overlap between pairs of environments of 19 individuals. Thirdly, the relatedness between individuals within and across environments may be not high enough to provide more accurate genetic parameter estimates. Thus, collectively, if there was a stronger variation in GxE correlations between pairs of environments, these reasons may not allow the more complex models to reliably detect it. This may be different in commercial breeding data where the relatedness between individuals and the number of individuals across environments is expected to be higher. Despite those challenges, even the models with complex covariance structures converged. In conclusion, these results help select the right model for the data structure in a given dataset.

In this study, the G2F dataset was solely evaluated based on its predictive ability. However, models that capture complex GxE interactions, such as MV, XFA, MegaSEM, and MegaLMM, have additional benefits. These models allow for environment-specific predictions, which can be used to create selection indices aimed at improving performance under specific conditions or for broad adaptation. Beyond prediction and selection, MV models provide pairwise estimates of GxE correlations and genomic heritabilities for each environment. These correlations can reveal patterns and clusters of environments, while heritability estimates inform the location quality. The insights from GxE correlations and genomic heritability estimates are valuable for planning new trials, redesigning experiments, reallocating resources, and optimizing trial networks.

Scalability of different parameterizations

A general summary of the scalability of the different parameterizations is provided in Table 4. The table shows that no method is completely scalable in all scenarios. For example, as the number of markers increases, SVD (Qα) and genomic relationship-based methods (ZZ) are preferred over SNP-based regressions (Zβ and ZZ). That is the case when genotype-by-sequencing (GBS) data are deployed. In contrast, datasets with more genotypes than markers are common when SNP arrays are utilized in experiments across multiple breeding programs and large populations (Allen et al. 2017; Song et al. 2017), as SNP regressions can provide more efficient computation of the genomic models. When the dataset contains a large number of genotypes and markers, dimensionality reduction (e.g. Qn|n~α) provides computational feasibility without loss in accuracy, as long as there are enough principal components to capture the genetic diversity (Pocrnic et al. 2019).

Table 4.

Scalability rating by parameterization and compatible solver.

 ParameterizationSolverNo. of genotypesNo. of markersNo. of traits
1ZZREML/BGS******
2ZZ, K [equation (20)]REML/BGS*******
3Zβ (BayesABC)BGS*****
4Uθ [equation (30)]REML/BGS********
5YΨ [equation (13)]BGS******
6Qα [equations (22) and (28)]PEGS********
7Qn~|nα [equation (27)]PEGS**********
8Zβ [equations (4) and (31)]PEGS*******
9Fλ [equation (10)]BGS*********
10F0α [equation (19)]PEGS**********
 ParameterizationSolverNo. of genotypesNo. of markersNo. of traits
1ZZREML/BGS******
2ZZ, K [equation (20)]REML/BGS*******
3Zβ (BayesABC)BGS*****
4Uθ [equation (30)]REML/BGS********
5YΨ [equation (13)]BGS******
6Qα [equations (22) and (28)]PEGS********
7Qn~|nα [equation (27)]PEGS**********
8Zβ [equations (4) and (31)]PEGS*******
9Fλ [equation (10)]BGS*********
10F0α [equation (19)]PEGS**********
Table 4.

Scalability rating by parameterization and compatible solver.

 ParameterizationSolverNo. of genotypesNo. of markersNo. of traits
1ZZREML/BGS******
2ZZ, K [equation (20)]REML/BGS*******
3Zβ (BayesABC)BGS*****
4Uθ [equation (30)]REML/BGS********
5YΨ [equation (13)]BGS******
6Qα [equations (22) and (28)]PEGS********
7Qn~|nα [equation (27)]PEGS**********
8Zβ [equations (4) and (31)]PEGS*******
9Fλ [equation (10)]BGS*********
10F0α [equation (19)]PEGS**********
 ParameterizationSolverNo. of genotypesNo. of markersNo. of traits
1ZZREML/BGS******
2ZZ, K [equation (20)]REML/BGS*******
3Zβ (BayesABC)BGS*****
4Uθ [equation (30)]REML/BGS********
5YΨ [equation (13)]BGS******
6Qα [equations (22) and (28)]PEGS********
7Qn~|nα [equation (27)]PEGS**********
8Zβ [equations (4) and (31)]PEGS*******
9Fλ [equation (10)]BGS*********
10F0α [equation (19)]PEGS**********

Technical guidelines for modeling large datasets with multiple traits have been provided by Misztal (2008). In that study, the author recommends starting by running UV analysis and subsequently progressing to MV models. At the time, the modeling of hundreds of traits has not been considered possible because the computational cost of REML methods increases n2 and k3 (Misztal 2008) with the number of genotypes (n) and traits (k), while more efficient methods, such as CT, require balanced data.

BGS is an alternative to REML (Sorensen et al. 2002), as it provides a computationally stable method to estimate variance components and regression coefficients at low memory cost. However, BGS may take a long time to run as it requires a large number of MCMC samples to provide satisfying convergence. For some Bayesian methods (Jia and Jannink 2012), variational Bayesian approaches have been proposed to avoid MCMC sampling (Hayashi and Iwata 2013).

Efficient alternatives to MCMC are available when the variance components are known a priori and only coefficients need to be inferred, those include Preconditioned Conjugate Gradient and GS (Legarra and Misztal 2008; Misztal and Legarra 2017). Only a few software implement GS as the main approach to estimate marker effects (Legarra et al. 2011). Here, we assessed the PEGS solver from Xavier and Habier (2022) as an approach to use GS while estimating variance components. While computationally efficient, the PEGS solver does not compute accuracies or confidence intervals because the inverse of the left-hand side of mixed-model equations or samples of effects from MCMC algorithms is not available.

Prediction of unobserved environments

In Table 3, a new environment was predicted using averaged predictions from previously observed environments. However, if the covariance between the training and prediction set were known, the prediction of new environments could be predicted as a linear combination of observed environments. Such covariances may be inferred from data when GxE interactions can be explained by a set of variables that are available for both training and validation datasets, such as management and environmental variables.

A schematic evolution of methods integrating genomics and environmental information is provided by Crossa et al. (2022), including crop-growth and reaction norm models. The covariance is inferred by environmental variables and, thus confined to their sample space. Alternatively, the associations between environmental variables and marker effects can be inferred in subsequent analyses. For instance, Della Coletta et al. (2023) GxE interaction networks are generated from the correlation of principal components of marker effects and principal components of environmental variables.

Post hoc modeling of the factors responsible for GxE interactions can be built from the output of unstructured models. Unlike crop-growth and reaction norm models, it does not assume that all interactions can be explained by the environmental variables available for modeling. The approach described here works by modeling covariances and, subsequently, generates predictions using conditional expectations.

Consider a scenario where a set of individuals A, observed in a set of environments X, is used to predict a new set of individuals B observed in a set of environments Z. The estimated marker effects from prediction models are based on the observed data AX. The prediction of B individuals in observed environments is given by

(44)

where G^BX is the matrix of GEBV of B individuals on X environments, ZB is the marker information for B individuals, and B^AX is the matrix of marker effects for X environments. The next step consists of projecting B individuals into Z environments. That is attained with the conditional expectation, where Z environments are predicted from a linear combination of X environments. Thus,

(45)

where ΣX is the genetic variance–covariance matrix of X environments, and ΣZX is the covariance matrix between X and Z environments. Note that ΣX is estimated from the MV model [equation (38)] that estimated the marker effects, since BAXN(0,ΣXI). Estimating σZX requires prediction if Z has not been observed.

The prediction of ΣZX can be inferred from parameters that drive GxE interactions. Let

(46)

where QX=UXDX. Now, assuming that the principal components QX can be modeled as a linear function of parameters that drive GxE interactions (WX), we obtain

(47)

where WX is the design matrix of explanatory variables, ΩX is the matrix of regression coefficients, EX is the matrix of residuals. Note that equation (47) acts as a post hoc modeling of environmental reaction norms, such that if the same set of variables is known for environments Z, principal components can be predicted using WZ. Thus,

(48)

and the covariance between environments X and Z can be inferred by

(49)

Note that equation (47) utilizes a linear model to fit the eigenstructure of the GxE covariance; however, the evaluation of nonparametric models (e.g. random forest) is encouraged when the interaction patterns are complex beyond additivity (Alves et al. 2020; Waters et al. 2023; Resende et al. 2024).

Conclusion

Scalable MV approaches increase the accuracy of GEBVs within environments compared to a UV approach. Specialized models, parameterizations, and solvers enable an increasing number of individuals, markers, and environments.

In the sparse testing simulation, XFA and MegaLMM were the most accurate methods across scenarios, and HCS when GxE was constant. In the balanced simulation where runtime and accuracy were recorded, the MV was the most accurate model up to 200 environments, then surpassed by MegaLMM in the scenario with 2,000 environments. PEGS-based models were considerably more efficient than REML-based GBLUP, where MegaSEM, SCT, and UV provided the lowest runtime. In the real data analysis, predictions from UVA and overall prediction averages from HCS and XFA were the most predictive approaches. The capability of MV and MegaSEM to capture unstructured GxE patterns did not translate in higher accuracy than models with simpler covariance structures. Future studies should consider fitting multiple trait–environment combinations.

Data availability

Soybean genomic data utilized to simulate sparse testing is available in the R package mas, also available in the R package SoyNAM and project website: https://www.soybase.org/SoyNAM/. Corn Genomes-to-Field dataset utilized for real data benchmark is available on the project website: https://www.maizegxeprediction.org/. Data and R scripts to reproduce simulations are available on GitHub: https://github.com/alenxav/GXE24.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Literature cited

Allen
AM
,
Winfield
MO
,
Burridge
AJ
,
Downie
RC
,
Benbow
HR
,
Barker
GL
,
Wilkinson
PA
,
Coghill
J
,
Waterfall
C
,
Davassi
A
, et al.
2017
.
Characterization of a Wheat Breeders’ Array suitable for high-throughput SNP genotyping of global accessions of hexaploid bread wheat (Triticum aestivum)
.
Plant Biotechnol J
.
15
(
3
):
390
401
.

Alves
RS
,
de Resende
MDV
,
Azevedo
CF
,
Silva
FFe
,
Rocha
JRASC
,
Nunes
ACP
,
Carneiro
APS
,
dos Santos
GA
.
2020
.
Optimization of Eucalyptus breeding through random regression models allowing for reaction norms in response to environmental gradients
.
Tree Genet Genomes
.
16
(
1
):
1
8
.

Bermann
M
,
Lourenco
D
,
Forneris
NS
,
Legarra
A
,
Misztal
I
.
2022
.
On the equivalence between marker effect models and breeding value models and direct genomic values with the algorithm for proven and young
.
Genet Sel Evol
.
54
(
1
):
52
.

Bustos-Korts
D
,
Malosetti
M
,
Chapman
S
,
van Eeuwijk
F
.
2016
. Modelling of genotype by environment interaction and prediction of complex traits across multiple environments as a synthesis of crop growth modelling, genetics and statistics, In:
Crop Systems Biology: Narrowing the Gaps between Crop Modelling and Genetics
.
Springer
. p.
55
82
.

Crossa
J
,
Fritsche-Neto
R
,
Montesinos-Lopez
OA
,
Costa-Neto
G
,
Dreisigacker
S
,
Montesinos-Lopez
A
,
Bentley
AR
.
2021
.
The modern plant breeding triangle: optimizing the use of genomics, phenomics, and enviromics data
.
Front Plant Sci
.
12
:
651480
.

Crossa
J
,
Montesinos-Lopez
OA
,
Pérez-Rodríguez
P
,
Costa-Neto
G
,
Fritsche-Neto
R
,
Ortiz
R
,
Martini
JW
,
Lillemo
M
,
Montesinos-Lopez
A
,
Jarquin
D
, et al.
2022
. Genome and environment based prediction models and methods of complex traits incorporating genotype × environment interaction. In:
Genomic Prediction of Complex Traits: Methods and Protocols
.
Springer
. p.
245
283
.

Cuevas
J
,
Crossa
J
,
Soberanis
V
,
Pérez-Elizalde
S
,
Pérez-Rodríguez
P
,
Campos
G
,
Montesinos-López
O
,
Burgueño
J
.
2016
.
Genomic prediction of genotype × environment interaction kernel regression models
.
Plant Genome
.
9
(
3
).

Della Coletta
R
,
Liese
SE
,
Fernandes
SB
,
Mikel
MA
,
Bohn
MO
,
Lipka
AE
,
Hirsch
CN
.
2023
.
Linking genetic and environmental factors through marker effect networks to understand trait plasticity
.
Genetics
.
224
(
4
):
iyad103
.

de Los Campos
G
,
Gianola
D
,
Rosa
GJ
,
Weigel
KA
,
Crossa
J
.
2010
.
Semi-parametric genomic-enabled prediction of genetic values using reproducing kernel Hilbert spaces methods
.
Genet Res (Camb)
.
92
(
4
):
295
308
.

de Los Campos
G
,
Hickey
JM
,
Pong-Wong
R
,
Daetwyler
HD
,
Calus
MP
.
2013
.
Whole-genome regression and prediction methods applied to plant and animal breeding
.
Genetics
.
193
(
2
):
327
345
.

Diers
BW
,
Specht
J
,
Rainey
KM
,
Cregan
P
,
Song
Q
,
Ramasubramanian
V
,
Graef
G
,
Nelson
R
,
Schapaugh
W
,
Wang
D
, et al.
2018
.
Genetic architecture of soybean yield and agronomic traits
.
G3 (Bethesda)
.
8
:
3367
3375
.

Elias
AA
,
Robbins
KR
,
Doerge
R
,
Tuinstra
MR
.
2016
.
Half a century of studying genotype × environment interactions in plant breeding experiments
.
Crop Sci
.
56
:
2090
2105
.

Falconer
DS
.
1952
.
The problem of environment and selection
.
Am Nat
.
86
:
293
298
.

Falconer
DS
,
Mackay
TF
.
1983
.
Quantitative Genetics
.
Longman
.

Gianola
D
,
Sorensen
D
.
2004
.
Quantitative genetic models for describing simultaneous and recursive relationships between phenotypes
.
Genetics
.
167
:
1407
1424
.

Gilmour
A
,
Butler
D
,
Cullis
B
,
Gogel
B
,
Thompson
R
.
2017
.
Asreml-r reference manual version 4.
VSN International Ltd, Hemel Hempstead, HP1 1ES, UK
.

Habier
D
,
Fernando
R
,
Dekkers
JC
.
2007
.
The impact of genetic relationship information on genome-assisted breeding values
.
Genetics
.
177
:
2389
2397
. doi:

Habier
D
,
Fernando
RL
,
Garrick
DJ
.
2013
.
Genomic blup decoded: a look into the black box of genomic prediction
.
Genetics
.
194
:
597
607
. doi:

Hardner
C
.
2017
.
Exploring opportunities for reducing complexity of genotype-by-environment interaction models
.
Euphytica
.
213
:
248
. doi:

Hayashi
T
,
Iwata
H
.
2013
.
A Bayesian method and its variational approximation for prediction of genomic breeding values in multiple traits
.
BMC Bioinformatics
.
14
:
1
14
.

Hayes
J
,
Hill
W
.
1981
.
Modification of estimates of parameters in the construction of genetic selection indices (‘bending’)
.
Biometrics
.
37
:
483
493
.

Heslot
N
,
Akdemir
D
,
Sorrells
ME
,
Jannink
JL
.
2014
.
Integrating environmental covariates and crop modeling into the genomic selection framework to predict genotype by environment interactions
.
Theor Appl Genet
.
127
(
2
):
463
480
.

Jarquín
D
,
Crossa
J
,
Lacaze
X
,
Du Cheyron
P
,
Daucourt
J
,
Lorgeou
J
,
Piraux
F
,
Guerreiro
L
,
Pérez
P
,
Calus
M
, et al.
2014
.
A reaction norm model for genomic selection using high-dimensional genomic and environmental data
.
Theor Appl Genet
.
127
(
3
):
595
607
.

Jia
Y
,
Jannink
JL
.
2012
.
Multiple-trait genomic selection methods increase genetic value prediction accuracy
.
Genetics
.
192
(
4
):
1513
1522
.

Konstantinov
K
,
Erasmus
G
.
1993
.
Using transformation algorithms to estimate (co) variance components by REML in models with equal design matrices
.
S Afr J Anim Sci
.
23
:
187
191
.

Legarra
A
,
Misztal
I
.
2008
.
Computing strategies in genome-wide selection
.
J Dairy Sci
.
91
(
1
):
360
366
.

Legarra
A
,
Ricard
A
,
Filangi
O
.
2011
.
Gs3: Genomic Selection, Gibbs Sampling, Gauss-Seidel (and Bayescπ)
.
INRA
.

Ma
A
,
Needell
D
,
Ramdas
A
.
2015
.
Convergence properties of the randomized extended Gauss–Seidel and Kaczmarz methods
.
SIAM J Matrix Anal Appl
.
36
(
4
):
1590
1604
.

Malosetti
M
,
Ribaut
JM
,
van Eeuwijk
FA
.
2013
.
The statistical analysis of multi-environment data: modeling genotype-by-environment interaction and its genetic basis
.
Front Physiol
.
4
:
37433
.

Martini
JW
,
Crossa
J
,
Toledo
FH
,
Cuevas
J
.
2020
.
On Hadamard and Kronecker products in covariance structures for genotype × environment interaction
.
Plant Genome
.
13
:
e20033
.

Meuwissen
THE
,
Hayes
BJ
,
Goddard
ME
.
2001
.
Prediction of total genetic value using genome-wide dense marker maps
.
Genetics
.
157
:
1819
1829
.

Meyer
K
.
1985
.
Maximum likelihood estimation of variance components for a multivariate mixed model with equal design matrices
.
Biometrics
.
41
:
153
165
.

Meyer
K
.
2009a
.
Factor-analytic models for genotype × environment type problems and structured covariance matrices
.
Genet Sel Evol
.
41
:
1
11
.

Meyer
K
.
2009b
.
Factor-analytic models for genotype × environment type problems and structured covariance matrices
.
Genet Sel Evol
.
41
(
1
):
1
11
.

Meyer
K
.
2019
.
“Bending” and beyond: better estimates of quantitative genetic parameters?
J Anim Breed Genet
.
136
:
243
251
.

Misztal
I
.
2008
.
Reliable computing in estimation of variance components
.
J Anim Breed Genet
.
125
:
363
370
.

Misztal
I
,
Legarra
A
.
2017
.
Invited review: efficient computation strategies in genomic selection
.
Animal
.
11
:
731
736
.

Möhring
J
,
Piepho
HP
.
2009
.
Comparison of weighting in two-stage analysis of plant breeding trials
.
Crop Sci
.
49
:
1977
1988
.

Montesinos-López
A
,
Montesinos-López
OA
,
Montesinos-López
JC
,
Flores-Cortes
CA
,
de la Rosa
R
,
Crossa
J
.
2021
.
A guide for kernel generalized regression methods for genomic-enabled prediction
.
Heredity (Edinb)
.
126
:
577
596
.

Ødegård
J
,
Indahl
U
,
Strandén
I
,
Meuwissen
TH
.
2018
.
Large-scale genomic prediction using singular value decomposition of the genotype matrix
.
Genet Sel Evol
.
50
:
1
12
.

Piepho
HP
,
Möhring
J
,
Schulz-Streeck
T
,
Ogutu
JO
.
2012
.
A stage-wise approach for the analysis of multi-environment trials
.
Biom J
.
54
:
844
860
.

Pocrnic
I
,
Lourenco
DA
,
Masuda
Y
,
Misztal
I
.
2016
.
Dimensionality of genomic information and performance of the algorithm for proven and young for different livestock species
.
Genet Sel Evol
.
48
:
1
9
.

Pocrnic
I
,
Lourenco
DA
,
Masuda
Y
,
Misztal
I
.
2019
.
Accuracy of genomic blup when considering a genomic relationship matrix based on the number of the largest eigenvalues: a simulation study
.
Genet Sel Evol
.
51
:
1
10
.

Resende
RT
,
Xavier
A
,
Silva
PIT
,
Resende
MP
,
Jarquin
D
,
Marcatti
GE
.
2024
.
Gis-based G × E modeling of maize hybrids through enviromic markers engineering
.
New Phytologist
.
1
.

Runcie
DE
,
Qu
J
,
Cheng
H
,
Crawford
L
.
2021
.
MegaLMM: mega-scale linear mixed models for genomic predictions with thousands of traits
.
Genome Biol
.
22
(
1
):
1
25
.

Schaeffer
L
.
1986
.
Pseudo expectation approach to variance component estimation
.
J Dairy Sci
.
69
(
11
):
2884
2889
.

Song
Q
,
Yan
L
,
Quigley
C
,
Jordan
BD
,
Fickus
E
,
Schroeder
S
,
Song
BH
,
Charles An
YQ
,
Hyten
D
,
Nelson
R
, et al.
2017
.
Genetic characterization of the soybean nested association mapping population
.
Plant Genome
.
10
(
2
):
1
14
.

Sorensen
D
,
Gianola
D
,
Gianola
D
.
2002
.
Likelihood, Bayesian and MCMC Methods in Quantitative Genetics
.
Springer
.

Strandén
I
,
Garrick
D
.
2009
.
Derivation of equivalent computing algorithms for genomic predictions and reliabilities of animal merit
.
J Dairy Sci
.
92
(
6
):
2971
2975
.

Thompson
R
,
Cullis
B
,
Smith
A
,
Gilmour
A
.
2003
.
A sparse implementation of the average information algorithm for factor analytic and reduced rank variance models
.
Aust N Z J Stat
.
45
(
4
):
445
459
.

Thompson
E
,
Shaw
R
.
1990
.
Pedigree analysis for quantitative traits: variance components without matrix inversion
.
Biometrics
.
46
(
2
):
399
413
.

Valente
BD
,
Rosa
GJ
,
Gianola
D
,
Wu
XL
,
Weigel
K
.
2013
.
Is structural equation modeling advantageous for the genetic improvement of multiple traits?
Genetics
.
194
:
561
572
.

VanRaden
P
,
Jung
Y
.
1988
.
A general purpose approximation to restricted maximum likelihood: the tilde-hat approach
.
J Dairy Sci
.
71
:
187
194
.

Waters
DL
,
van der Werf
JH
,
Robinson
H
,
Hickey
LT
,
Clark
SA
.
2023
.
Partitioning the forms of genotype-by-environment interaction in the reaction norm analysis of stability
.
Theor Appl Genet
.
136
:
99
.

Xavier
A
,
Habier
D
.
2022
.
A new approach fits multivariate genomic prediction models efficiently
.
Genet Sel Evol
.
54
:
1
15
.

Xavier
A
,
Jarquin
D
,
Howard
R
,
Ramasubramanian
V
,
Specht
JE
,
Graef
GL
,
Beavis
WD
,
Diers
BW
,
Song
Q
,
Cregan
PB
, et al.
2018
.
Genome-wide analysis of grain yield stability and environmental interactions in a multiparental soybean population
.
G3 (Bethesda)
.
8
:
519
529
.

Xavier
A
,
Muir
W
,
Rainey
K
.
2019
.
bWGR: Bayesian whole-genome regression
.
Bioinformatics
.
36
(
6
):
1957
1959
.

Xu
S
.
2003
.
Theoretical basis of the Beavis effect
.
Genetics
.
165
:
2259
2268
. doi:

Zhou
X
,
Stephens
M
.
2012
.
Genome-wide efficient mixed-model analysis for association studies
.
Nat Genet
.
44
:
821
824
.

Zhou
X
,
Stephens
M
.
2014
.
Efficient multivariate linear mixed-model algorithms for genome-wide association studies
.
Nat Methods
.
11
(
4
):
407
409
.

Appendix A: Canonical transformation

Under CT, the matrix of phenotypes is converted into a series of orthogonal canonical traits (F) as

(A1 )

where M has the following properties:

(A2 )

The matrix M is obtained using EVD by first decomposing Σe0=UeDe2Ue, then building T=ReΣβ0Re where Re=UeDe1, subsequently decomposing T=UTDT2UT, resulting on

(A3 )

The covariances are converted back to the original scale with

(A4 )

where covariances in transformed scale are notates as ΣβF,ΣeF and the starting values of each iteration as Σβ0,Σe0. CT iterates on (1) transforming the phenotypes; (2) solving the canonical trait model; and (3) transforming covariances back to the phenotypic. These steps (1–3) are repeated up to convergence (ΣΣ0).

Appendix B: Connection between MegaLMM and SCT

A relation between the latent spaces from MegaLMM with the SCT can be drawn by defining the variance not capture latent spaces J in terms of leftover rotation, as J=FH and V=Λ+H, such that

(B1 )

making the decomposition from equation (9) feasible and parsimonious approximation of FV when it is not possible to compute the exact rotation via SVD due to the presence of missing value in the phenotypic matrix.

Appendix C: Avoiding inversions with PEGS

The nonvector operations involved in the multivariate PEGS are (1) the inversion of Σ^β and (2) solving the marker effects [equation (40)]. The former issue can be mitigated by multiplying both sides of the equation by Σ^β, as proposed by Strandén and Garrick (2009). Thus,

(C1 )

For the latter issue, one may address the inversion of the left-hand side by running an inner GS, for the single site update of coefficients. Since the algorithm convergence is computed across coefficients for all marker–environment combinations, a single iteration of inner GS per marker update suffices. Let

(C2 )

aiming to solve Cβ=r. The single site update of the jth marker effect, at kth trait consists of

(C3 )

Appendix D: Sparse inversion of kernels

Kernel-based models [equation (20)] are commonly solved through the inversion of K. When the number of genotyped individuals is large, the algorithm for proven and young (APY) provides a sparse representation of the inverse genomic relationship matrices (Pocrnic et al. 2016; Bermann et al. 2022). Under APY, dense inversion is performed only on a subset of the relationship matrix containing a representative sample of individuals referred to as the core set. Thus,

(D1 )

where c and n describe the core and noncore set of genotypes, respectively, P=KncKcc1, and M is a diagonal matrix with the element-wise Schur complement, as mii={kiikicKcc1kci}.

Appendix E: Sample code

An example of how data was simulated and how PEGS-based models were fitted in R is provided in Fig. E1. It simulated 500 F2 individuals with 10 chromosomes of 100 SNPs. It fits MegaSEM, MV, XFA, and HCS. Last, it computed the average accuracy within the environment for one of the models as the correlations between estimated and true breeding values.

Code to simulate data and fit efficient multivariate models in R using the bWGR package (version 2.2.9).
Fig. E1.

Code to simulate data and fit efficient multivariate models in R using the bWGR package (version 2.2.9).

Author notes

Conflicts of interest: The author(s) declare no conflicts of interest.

This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs licence (https://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial reproduction and distribution of the work, in any medium, provided the original work is not altered or transformed in any way, and that the work is properly cited. For commercial re-use, please contact [email protected] for reprints and translation rights for reprints. All other permissions can be obtained through our RightsLink service via the Permissions link on the article page on our site—for further information please contact [email protected].
Editor: M Calus
M Calus
Editor
Search for other works by this author on: