A New Comparative Framework for Estimating Selection on Synonymous Substitutions

Abstract

Selection on synonymous codon usage is a well-known and widespread phenomenon, yet existing models often do not account for it or its effect on synonymous substitution rates. In this article, we develop and expand the capabilities of multiclass synonymous substitution (MSS) models, which account for such selection by partitioning synonymous substitutions into 2 or more classes and estimating a relative substitution rate for each class, while accounting for important confounders like mutation bias. We identify extensive heterogeneity among relative synonymous substitution rates in an empirical dataset of ∼12,000 gene alignments from 12 Drosophila species. We validate model performance using data simulated under a forward population genetic simulation, demonstrating that MSS models are robust to model misspecification. MSS rates are significantly correlated with other covariates of selection on codon usage (population-level polymorphism data and tRNA abundance data), suggesting that models can detect weak signatures of selection on codon usage. With the MSS model, we can now study selection on synonymous substitutions in diverse taxa, independent of any a priori assumptions about the forces driving that selection.

molecular evolution, codon usage bias, synonymous substitutions, codon substitution models

Introduction

A fundamental goal of evolutionary biology is to identify drivers of natural selection acting on genes and genomes. Synonymous or silent substitutions shape codon usage across the genome, and although they were once widely regarded as neutral, the field is gaining an increasing appreciation that such a view is unjustifiably simplistic. Fitness considerations underlying selective forces acting on synonymous changes include translational efficiency (tRNA abundances), successful mRNA folding, and maintenance of CpG sites for methylation (Ikemura 1981; Kanaya et al. 2001; Trotta 2013). Selection on codon usage—a compelling indicator of selection—has been long recognized as widespread in Drosophila (Marais et al. 2004), Enterobacteriaceae (Sharp and Li 1987a), and other taxonomic groups (Marais and Duret 2001; Trotta 2013). Synonymous substitution rates, corrected for nucleotide rate biases and codon frequencies, should not, therefore, be expected to be the same for all codon pairs. This would be conceptually similar to expecting the F81 (Felsenstein 1981) nucleotide substitution model to be biologically adequate. Instead, rates should be allowed to vary among synonymous codon pairs, and be partially informed by key biological traits like gene expression. Such correlations can inform researchers about the relative importance of underlying mechanisms driving selection on synonymous codon usage.

Multiple classic studies conducted in the pregenomic era found evidence for purifying selection on codon usage as a way to maintain translational efficiency and accuracy in highly expressed genes (Ikemura 1981; Moriyama and Powell 1997; Iriarte et al. 2013). The link between codon usage bias and gene expression is well-studied and supported, but does not (nor should it) fully explain selection on synonymous codon usage. For example, Goetz and Fuglsang (2005) found that while gene transcript abundance and various measures of codon usage bias in Escherichia coli were correlated, the largest correlation (with Sharp and Li’s (1987b) codon adaptation index) was only ∼0.43. Clearly, studies that consider how selection for transcriptional/translational efficiency shapes codon usage bias cannot account for other selective pressures on synonymous codon usage. Synonymous substitution rates, corrected for nucleotide rate biases and codon frequencies, should not, therefore, be expected to be the same for all codon pairs. This would be conceptually similar to expecting the F81 (Felsenstein 1981) nucleotide substitution model to be biologically adequate. Instead, rates should be allowed to vary among synonymous codon pairs, and be partially informed by key biological traits like gene expression. Such correlations can inform researchers about the relative importance of underlying mechanisms driving selection on synonymous codon usage.

Kosiol et al. (2007) took a comparative genomic approach to studying how substitution rates vary among synonymous codon pairs. The authors directly estimated substitution rates for all exchangeable codon pairs from a large dataset of protein-coding gene sequences, and incorporated the numerically estimated rates into an empirical codon evolutionary model. Their model is advantageous in that they directly estimate synonymous substitution rates from input data without assuming an underlying selective pressure, allowing one to explore different biological correlates with substitution rate. However, large training datasets of protein-coding genes may be difficult and time-consuming to obtain for some taxa, and could hinder similar studies in other organisms. The model is also difficult to generalize across genes and across taxa because it aggregates substitution rates across both.

More recently, Rahman et al. (2021) explored selection on synonymous substitutions in Enterobacteriaceae. In this dataset, alanine and valine were the 2 amino acids whose codon usage bias was least correlated with expression, and thus were assumed to be least likely to be affected by selection on translational efficiency. Defining a multiple synonymous substitution (MSS) codon model which treated synonymous substitutions between A and V codons (both 4-fold degenerate) as neutral, and all others as selected, Rahman et al. were able to show that a weak purifying selection operates on other synonymous substitutions, conditioned on the externally defined set of “neutral” synonymous substitutions. However, it remains unclear how the choice of this neutral set affects the model and inferences, and whether or not all genes are equally agreeable to it.

Here, we expand the class of data-driven MSS evolutionary models. An MSS model is a codon-substitution model that includes 2 or more substitution rate parameters for synonymous codon pairs, in addition to differentiating synonymous and nonsynonymous substitutions. Rates are estimated directly from sequence alignments using phylogenetic maximum likelihood. The most complex MSS model, called SynREVCodon, partitions synonymous substitutions into 67 classes—1 for each pair of synonymous codons reachable by a single nucleotide substitution, assuming the universal genetic code—and estimates a relative substitution rate for each class. Like other codon models, MSS accounts for important confounders, such as phylogenetic relatedness, nucleotide substitution biases, and unequal codon frequencies. MSS enables a rigorous statistical comparison of synonymous substitution rates within a gene or across a set of genes. We define and apply MSS models of different scope and complexity to a collection of ∼12,000 gene alignments from 12 Drosophila species to investigate the extent and correlates of rate variation among synonymous substitutions. We use simulations, population-level data, and tRNA abundance as extrinsic covariates to validate model performance.

Materials and Methods

Multiclass Synonymous Substitution Models

Multiclass synonymous substitution (MSS) models extend the standard Muse-Gaut (MG94) class of codon substitution models (Muse and Gaut 1994), by allowing synonymous substitutions to occur at different relative rates. The instantaneous rate of substitution from 1 sense (x) codon to another (y), $x \to y$ is defined in Table 1.

Table 1

Open in new tab

Instantaneous substitution rates from one sense codon (x) to another (y)

One nucleotide?	Synonymous?	Substitution rate	Example
Yes	Yes	$α (x, y) θ (x, y) π (x, y)$	$A G \underline{C} \to A G \underline{T}$ ⁠: $α_{A G C : A G T} θ_{C T} π_{T}^{3}$
Yes	No	$β (x, y) θ (x, y) π (x, y)$	$A \underline{G} C \to A \underline{C} C$ ⁠: $β θ_{C G} π_{C}^{2}$
No	*	0	$\underline{T C} C \to \underline{A G} C : 0$

One nucleotide?	Synonymous?	Substitution rate	Example
Yes	Yes	$α (x, y) θ (x, y) π (x, y)$	$A G \underline{C} \to A G \underline{T}$ ⁠: $α_{A G C : A G T} θ_{C T} π_{T}^{3}$
Yes	No	$β (x, y) θ (x, y) π (x, y)$	$A \underline{G} C \to A \underline{C} C$ ⁠: $β θ_{C G} π_{C}^{2}$
No	*	0	$\underline{T C} C \to \underline{A G} C : 0$

Table 1

Open in new tab

Instantaneous substitution rates from one sense codon (x) to another (y)

One nucleotide?	Synonymous?	Substitution rate	Example
Yes	Yes	$α (x, y) θ (x, y) π (x, y)$	$A G \underline{C} \to A G \underline{T}$ ⁠: $α_{A G C : A G T} θ_{C T} π_{T}^{3}$
Yes	No	$β (x, y) θ (x, y) π (x, y)$	$A \underline{G} C \to A \underline{C} C$ ⁠: $β θ_{C G} π_{C}^{2}$
No	*	0	$\underline{T C} C \to \underline{A G} C : 0$

One nucleotide?	Synonymous?	Substitution rate	Example
Yes	Yes	$α (x, y) θ (x, y) π (x, y)$	$A G \underline{C} \to A G \underline{T}$ ⁠: $α_{A G C : A G T} θ_{C T} π_{T}^{3}$
Yes	No	$β (x, y) θ (x, y) π (x, y)$	$A \underline{G} C \to A \underline{C} C$ ⁠: $β θ_{C G} π_{C}^{2}$
No	*	0	$\underline{T C} C \to \underline{A G} C : 0$

We follow the standard assumption that multinucleotide substitutions must be realized via a sequence of single-step substitutions, although this assumption can be relaxed (Lucaci et al. 2021). $θ (x, y)$ encodes nucleotide substitutional biases, and follows the MG94xREV convention (Wisotsky et al. 2020), with $θ (x, y) = θ_{n m}$ ⁠, where n and m are the 2 nucleotides being exchanged. Because of time-reversibility, $θ (x, y) = θ (y, x)$ ⁠, and because of identifiability constraints (only products of rates and times can be inferred), we assert $θ_{A G} = 1$ ⁠. In the Muse-Gaut framework, $π (x, y)$ is the estimated equilibrium frequency of the target nucleotide of the substitution; we use the corrected empirical F3 × 4 (CF3 × 4) estimator (Pond et al. 2010). MSS models differ from standard codon-substitution models in that they allow synonymous substitution rates, $α (x, y)$ ⁠, to depend on which codons are being exchanged. We consider several parameterizations of this function and the corresponding models (Table 2).

Table 2

Open in new tab

The way that different MSS models model variation among synonymous substitution rates, $α (x, y)$

Model	Functional form	Parameters in addition to basic MG94
Standard codon	$α (x, y) := 1$	0
A priori	$α (x, y) = α_{i}, i = 1.. K$	K−1, 1 < K < N, where N is the number of amino acids that have synonymous codons that can be exchanged by a single nucleotide substitution, e.g. K = 2 in Rahman et al.
Genetic algorithm (GA)	$α (x, y) = α_{⟨ i ⟩}, i = 1.. K$	K−1, 1 < K < N. Same as above, except the allocation of specific (x,y) pairs to the ith rate class is not fixed a priori but inferred using a heuristic discrete space search.
SynREV	$α (x, y) = α_{A}$ A is the amino acid encoded by x and y	N−1 (18 for the Universal genetic code)
SynREVCodon	$α (x, y) = α_{x y} = α_{y x}$	P−1, where P is the number of synonymous codon pairs within one nucleotide substitution of one another (67 for the Universal genetic code)

Model	Functional form	Parameters in addition to basic MG94
Standard codon	$α (x, y) := 1$	0
A priori	$α (x, y) = α_{i}, i = 1.. K$	K−1, 1 < K < N, where N is the number of amino acids that have synonymous codons that can be exchanged by a single nucleotide substitution, e.g. K = 2 in Rahman et al.
Genetic algorithm (GA)	$α (x, y) = α_{⟨ i ⟩}, i = 1.. K$	K−1, 1 < K < N. Same as above, except the allocation of specific (x,y) pairs to the ith rate class is not fixed a priori but inferred using a heuristic discrete space search.
SynREV	$α (x, y) = α_{A}$ A is the amino acid encoded by x and y	N−1 (18 for the Universal genetic code)
SynREVCodon	$α (x, y) = α_{x y} = α_{y x}$	P−1, where P is the number of synonymous codon pairs within one nucleotide substitution of one another (67 for the Universal genetic code)

Table 2

Open in new tab

The way that different MSS models model variation among synonymous substitution rates, $α (x, y)$

Model	Functional form	Parameters in addition to basic MG94
Standard codon	$α (x, y) := 1$	0
A priori	$α (x, y) = α_{i}, i = 1.. K$	K−1, 1 < K < N, where N is the number of amino acids that have synonymous codons that can be exchanged by a single nucleotide substitution, e.g. K = 2 in Rahman et al.
Genetic algorithm (GA)	$α (x, y) = α_{⟨ i ⟩}, i = 1.. K$	K−1, 1 < K < N. Same as above, except the allocation of specific (x,y) pairs to the ith rate class is not fixed a priori but inferred using a heuristic discrete space search.
SynREV	$α (x, y) = α_{A}$ A is the amino acid encoded by x and y	N−1 (18 for the Universal genetic code)
SynREVCodon	$α (x, y) = α_{x y} = α_{y x}$	P−1, where P is the number of synonymous codon pairs within one nucleotide substitution of one another (67 for the Universal genetic code)

Model	Functional form	Parameters in addition to basic MG94
Standard codon	$α (x, y) := 1$	0
A priori	$α (x, y) = α_{i}, i = 1.. K$	K−1, 1 < K < N, where N is the number of amino acids that have synonymous codons that can be exchanged by a single nucleotide substitution, e.g. K = 2 in Rahman et al.
Genetic algorithm (GA)	$α (x, y) = α_{⟨ i ⟩}, i = 1.. K$	K−1, 1 < K < N. Same as above, except the allocation of specific (x,y) pairs to the ith rate class is not fixed a priori but inferred using a heuristic discrete space search.
SynREV	$α (x, y) = α_{A}$ A is the amino acid encoded by x and y	N−1 (18 for the Universal genetic code)
SynREVCodon	$α (x, y) = α_{x y} = α_{y x}$	P−1, where P is the number of synonymous codon pairs within one nucleotide substitution of one another (67 for the Universal genetic code)

For the a priori models, one of the substitution rates or groups of rates is designated beforehand as the reference/neutral rate (⁠ $α_{0}$ ⁠) and constrained to 1, both to assist with interpretation and to ensure identifiability. For SynREV and SynREVCodon, the mean of all $α (x, y)$ is constrained to 1. In this paper, the nonsynonymous rate component is parameterized as $β (x, y) : = β$ ⁠, where β is the shared inferred nonsynonymous substitution rate that does not depend on the codons. Because $α (x, y)$ is constrained to have mean 1, β can be interpreted as the standard dN/dS (or ω) ratio. In addition to gene-wide model parameters, a separate branch length parameter is included for every branch in the phylogenetic tree. Unless stated otherwise, all model parameters are estimated by maximum likelihood, using direct optimization, as implemented in the HyPhy package (v2.5.45-v.2.5.62) (Kosakovsky Pond et al. 2020). We further implemented a “joint” model fitter in HyPhy, which estimates a single set of MSS parameters from G > 1 genes by maximum likelihood. The joint model optimizes α parameters (shared by all genes) for the MSS model by defining and maximizing the likelihood function on multiple genes, while holding gene-specific estimates of “nuisance” parameters, such as equilibrium codon frequencies, relative branch lengths (each gene receives a single multiplicative branch length scaler for the joint phase), nucleotide substitution biases, and ω at values obtained by fitting the standard MG94 model to each gene separately. Using gene level, “plug-in” estimates dramatically lowers estimation complexity and run times.

Forward Population Genetic Simulations

To test that the method accurately identifies codon pairs under selection, a forward Wright–Fisher population genetic simulator was written (Pivirotto 2024). To begin, for each simulation an ancestral gene sequence of 300 codons was generated with codon frequencies based on a gene alignment randomly selected from the 1,613 alignments used by Rahman et al. (2021). The ancestral gene sequence was copied to a population size of N = 500 diploid individuals to found an ancestral population. The population was allowed to evolve to mutation-selection-drift equilibrium with 2 burn-in periods, with each generation including mutation and selection (without dominance) following a standard Wright–Fisher model (Ewens 2004). The first burn-in period proceeded for 200 × N generations, a value that was determined empirically to be more than sufficient to reach a point when population mean fitness was stable for a wide range of mutation and selection parameters. The second burn-in period proceeded until all gene copies had a single common ancestor in the generation when the first burn-in ended. Following the burn-in period, the simulation continued for a preset number of generations with 10 population splitting events at set time points (supplementary fig. S1, Supplementary Material online). Finally at the end of the simulation, an alignment was created by sampling a single, random gene copy from each of the 11 populations. Nonsynonymous mutations were all given a deleterious selection coefficient of 2Ns = −10 (i.e. $s = - 0.01$ ⁠). The total depth and length of the tree was 10,000 and 64,000 generations, respectively. The mutation rate per base position was $1.6 \times 10^{- 5}$ per generation, which yielded ∼1,850,000 neutral synonymous mutations and 230 neutral synonymous substitutions over the entire tree, for about 1.7 substitutions per effective synonymous site.

Synonymous selection was modeled by partitioning the codons for each amino acid into 1 or 2 neutral classes. In cases of 1 neutral class, codons within the neutral class were all neutral with respect to each other. In cases of 2 neutral classes, the codons in 1 class, selected arbitrarily, were favored over those in the other class, and changes within each class were treated as neutral. Mutations in the favored direction had a selection coefficient of s, while those in the reverse direction had a selection coefficient of −s. Sets of 100 simulated alignments were generated for values of 2Ns: 0, 0.5, 1.0, 1.5, 2.0, and 5.0. The assignment, selected or neutral, for each synonymous codon pair in each set of 100 genes, was inferred using a genetic algorithm search procedure implemented in HyPhy (see below). The data-generating (true) model included 25 selected codon pairs and 42 neutral pairs (supplementary table S1, Supplementary Material online).

A Genetic Algorithm for Learning Synonymous Substitutions From Sequence Alignments

Given a collection of S ≥ 1 training alignments, and the desired number of synonymous rate classes (1 < M < N) we can infer (learn) how to partition all synonymous substitutions into M groups to maximize model fit. Because such an optimization involves exploring a large discrete space of models without natural ordering, we applied the CHC heuristic aggressive genetic algorithm approach (Eshelman 1991), which has shown to work well for evolutionary model selection (Kosakovsky Pond et al. 2007; Delport et al. 2010a), and discrete optimization (Kosakovsky Pond et al. 2006a, 2006b) previously. An evolutionary model is uniquely encoded by a D-dimensional vector (D = number of unique synonymous rates), where entry i is an integer in [0, M), indicating that rate i belongs to the corresponding rate class. To avoid model specification redundancy, we enforce class-sortedness, i.e. among all vectors that encode the same partition of rate classes, we choose the one, where the first occurrence of class 0 precedes the first occurrence of class 1, etc. For example, (0,1,1,0,1) and (1,0,0,1,0) encode the same partition, but the former is selected as the “sorted” representative. The logic of the MSS-CHC algorithm is as follows, and key model parameters are enumerated in Table 3.

Step 1. Fit the standard MG94-REV model to each of the S input alignments to obtain branch lengths and the BIC score for the baseline or null model (all synonymous rates are equal). For computational efficiency, we fix branch lengths and nucleotide rates θ at their maximum likelihood estimates (MLEs) for the rest of the algorithm.
Step 2. Initialize P individuals (models) randomly, i.e. generate P unique D-dimensional vectors encoding MSS models. For each model, m, evaluate its fitness $f (m)$ ⁠. This fitness is defined as the Bayesian information criterion (BIC) score for the fit of the model to all S training alignments. The sample size, needed for BIC computation, is defined as the total number of characters (sequences × sites) in all alignments. Record the best current fitness F score (lowest BIC), and the corresponding most fit individual Fm
Step 3. Create the next generation through recombination and elitist selection. To do so, generate P offspring individuals, by randomly selecting 2 parents from the current generation, and performing free recombination on them: each entry of the child vector has a 50-50 probability of coming from either parent. The parent and offspring populations are combined and top P most fit individuals are retained.
Step 4. Decide if the population is too inbred and it needs to be mutagenized. This will occur if at least 1 of the following 2 conditions is met. First, the relative fitness range (relative difference between the most and the least fit model) is too low (⁠ $f_{rel}$ ⁠). Second, the number of consecutive generations during which Step 3 failed to introduce any new models is too high $f_{stagnant}$ ⁠. To mutagenize the population, each individual, except for the most fit one, is randomly mutated with probability μ of changing each element of the model vector.
Step 5. Check the termination criterion: does the number of consecutive generations without a sufficient (⁠ $δ_{BIC}$ ⁠) BIC improvement exceed $G_{conv}$ ? If the termination criterion is met exit, otherwise go to Step 3.

Table 3

Open in new tab

Key parameters of the CHC genetic algorithm for model selection

MSS-CHC parameter	Description	Value used
P	Population size or the number of models considered at each step.	32
$f_{rel}$	Minimum relative diversity needed to trigger mutation.	$10^{- 6}$
$f_{stagnant}$	The number of consecutive generations without new models entering the population	25
μ	Mutation rate	0.20
$δ_{BIC}$	Minimum generation-to-generation BIC improvement to reset “no improvement” termination criterion.	0.01
$G_{conv}$	The number of consecutive generations without BIC improvement to terminate the algorithm.	100

MSS-CHC parameter	Description	Value used
P	Population size or the number of models considered at each step.	32
$f_{rel}$	Minimum relative diversity needed to trigger mutation.	$10^{- 6}$
$f_{stagnant}$	The number of consecutive generations without new models entering the population	25
μ	Mutation rate	0.20
$δ_{BIC}$	Minimum generation-to-generation BIC improvement to reset “no improvement” termination criterion.	0.01
$G_{conv}$	The number of consecutive generations without BIC improvement to terminate the algorithm.	100

Table 3

Open in new tab

Key parameters of the CHC genetic algorithm for model selection

MSS-CHC parameter	Description	Value used
P	Population size or the number of models considered at each step.	32
$f_{rel}$	Minimum relative diversity needed to trigger mutation.	$10^{- 6}$
$f_{stagnant}$	The number of consecutive generations without new models entering the population	25
μ	Mutation rate	0.20
$δ_{BIC}$	Minimum generation-to-generation BIC improvement to reset “no improvement” termination criterion.	0.01
$G_{conv}$	The number of consecutive generations without BIC improvement to terminate the algorithm.	100

MSS-CHC parameter	Description	Value used
P	Population size or the number of models considered at each step.	32
$f_{rel}$	Minimum relative diversity needed to trigger mutation.	$10^{- 6}$
$f_{stagnant}$	The number of consecutive generations without new models entering the population	25
μ	Mutation rate	0.20
$δ_{BIC}$	Minimum generation-to-generation BIC improvement to reset “no improvement” termination criterion.	0.01
$G_{conv}$	The number of consecutive generations without BIC improvement to terminate the algorithm.	100

This search procedure is implemented in the HyPhy script MSS-GA.bf. The output of an MSS-CHC model search is a JSON file, which records every model considered by the algorithm (i.e. its D-dimensional encoding), the corresponding fitness (BIC score), and MLEs for synonymous rate parameters (⁠ $α_{i}$ ⁠). The final model is derived from the MSS-CHC run using model averaging (Kosakovsky Pond et al. 2006a, 2006b). Specifically, each model is given an Akaike weight between 0 and 1, by normalizing its BIC score as follows: $w (m) = \exp ((mi n_{models} BIC (m) - BIC (m)) / 2) / K$ with K chosen so that $w (m)$ add up to 1. Then, the rate estimates from each model are ordered from smallest (“selected”) to largest (“neutral”), with intermediate values labeled as “INT 1,” “INT 2,” etc, based on their rank, and a categorical vector is derived. For example, if the model vector was (0,1,1,2,0,2) with $α_{0} = 1, α_{1} = 0.8, α_{2} = 1.4$ ⁠, then the resulting categorical model vector would be (Int 1, selected, selected, neutral, int 1, and Neutral). Finally, for each of the D elements (rates) of the final model we compute cumulative Akaike weights that accrued to each category for that rate. A rate assignment is considered unambiguous, if a particular category (SELECTED, NEUTRAL, etc.) garners 0.90 or more of the total weight, otherwise, it is considered ambiguous.

Empirical Data Collection and Preparation

We used gene alignments from 12 Drosophila species, including Drosophila melanogaster, for which there is existing strong evidence of natural selection on synonymous sites (Shields et al. 1988; Akashi 1994; Hey and Kliman 2002). The other species included in our analysis (D. erecta, D. rhopaloa, D. elegans, D. santomea, D. simulans, D. mauritiana, D. subpulchrella, D. sechellia, D. suzukii, D. takahashii, and D. yakuba) also have available genome assemblies and similar patterns of codon usage as D. melanogaster (Supplemental File D12_codon_usage.xlsx). Soft-masked genomes and the D. melanogaster reference genome were downloaded from the genomes database of the National Center for Biotechnology Information (Supplemental File Drosophila_genomes_aligned.txt). A whole genome alignment with respect to the D melanogaster genome was generated using cactus (Armstrong et al. 2020) and the hal2mafMP.py script, and the phylogenetic tree of Suvorov et al (Suvorov et al. 2022).

To generate gene alignments, we mapped the location of D. melanogaster gene coding sequences (CDSs) in the reference genome sequence, and then built gene alignments from the whole genome alignment. A species' gene sequence was rejected from the alignment if any sequence fragments were not in the same order as the reference, or if any exons were not in the same order as the reference or not on the same strand, or if the codon frame of an exon did not match that of the reference, or if there were any stop codons. Gene alignments were rejected if their nucleotide length was not a multiple of 3 and if there were fewer than 4 species represented. These criteria yielded alignments for 11,844 genes.

Drosophila Polymorphism Analysis

If the inferred substitution rate of a codon pair reflects the extent of (purifying) selection occurring, then codon pairs with higher relative substitution rates should be closer to selective neutrality. A straightforward prediction of this hypothesis is that polymorphism densities taken from population genomic data for each of the 67 synonymous codon pairs should be positively correlated with substitution rates. In other words, codon pairs that have higher substitution rates should also have higher counts of segregating single nucleotide polymorphisms (SNPs) in a natural population. To test this prediction, we counted synonymous SNPs in a D. melanogaster population genomic sample of 197 genomes from Zambia (Lack et al. 2015) for each of the 67 synonymous codon pairs. Then for each gene, we estimated the density of each codon pair, by dividing the SNP count for a pair by the count of the corresponding amino acids.

Quantifying tRNA Abundance

We obtained tRNA abundance data for D. melanogaster as raw transcript counts per tRNA gene from Behrens et al. (2021) (GEO accession number GSE152621). We normalized the reads to counts per million, but did not also normalize by gene length because tRNA genes are all roughly the same length (76 to 90 bp) (Sharp et al. 1985). From these data, we had the tRNA abundances for 45 out of 61 codons. Some tRNA modifications, specifically Inosine and Queuosine, expand the decoding ability of tRNA isoacceptors.

When the anticodon wobble base A is biochemically modified to inosine (a Guanine analog) (Torres et al. 2015), isoacceptor recognition expands to include codons ending in U, C, or A. In eukaryotes, Inosine modifications are found on the wobble bases of the anticodons AGC (Ala), AGG (Pro), ACG (Arg), AGA (Ser), AAG (Leu), AAC (Val), AAT (Ile), and AGT (Thr). We added the abundance of each modified tRNA to the pool of tRNAs that could decode synonymous codons ending in A or C. Because the anticodon wobble base is A, its abundance already counts toward the cognate U-ending codon pools.

When the anticodon wobble base G for tRNAs that decode His, Asp, Asn, and Tyr (Chiari et al. 2010) is modified to a Queuosine, the isoacceptors' decoding capacity expands to include synonymous U-ending codons in addition to the cognate C-ending codons. We therefore added the abundance of each modified tRNA to the pool of synonymous tRNAs that could decode cognate U-ending codons.

After accounting for the expanded decoding capabilities of Queuosine- and Inosine-modified tRNAs, there were 6 codons where with multiple possible tRNA decoders: UUU (Phe), UGU (Cys), CGG (Arg), AGU (Ser), GGU (Gly), and GGG (Gly). For these codons, we assigned them the abundance of the tRNA most likely to decode them based on Crick's wobble rules (Crick 1966). Specifically, we assumed that 2-fold-redundant amino acids decode both of their codons using a single tRNA isoacceptor that recognizes U- and C-ending codons. We gave UUU (Phe) an equal tRNA abundance to UUC (Phe), gave AGU (Ser) an equal tRNA abundance to AGC (Ser), and gave UGU (Cys) an equal tRNA abundance to UGC (Cys). We deduced that the remaining amino acids (Arg and Gly) were not decoded by one tRNA that recognizes U, C, or A in the wobble position and one tRNA that only recognizes G, as neither amino acid had any tRNA abundance data for G-ending codons. We also deduced that Arg codons beginning with C could not be served by the same tRNA as Arg codons beginning with A, since the first codon position does not wobble. Therefore, we assumed that CGG (Arg) must be decoded by a single tRNA isoacceptor that recognizes G- and A-ending codons, and gave CGG (Arg) and equal tRNA abundance to CGA (Arg). We also assumed that Glycine had one tRNA isoacceptor that recognizes G- and A-ending codons (giving GGG (Gly) the same abundance as GGA (Gly)), and another isoacceptor that recognizes U- and C-ending codons (giving GGU (Gly) the same abundance as GGC (Gly)).

We made the necessary simplifying assumptions that (i) all tRNAs capable of being modified are in fact modified, (ii) modification is present in all tissues across all life stages, and (iii) modified tRNAs are equally capable of binding to all codons they can recognize. The true biology is undoubtedly more complex (for example, the level of queuosine modification is dependent on the availability of dietary queuine; Zhang et al. 2022), but teasing apart the exact tRNA pools across different conditions in a living animal is an open research question beyond the scope of this manuscript.

Results

Testing MSS by Forward Simulation

Like the MG94 model (Muse and Gaut 1994) it is based on, MSS models (including SynREVCodon) implement a model of codon substitutions as a reversible time-homogeneous Markov process with independent sites. MSS models method estimate rate parameters for synonymous codon pairs, with the implicit assumption that for a given codon pair the substitution rate in each direction is the same. However, if 1 codon is favored over the other then the mean waiting time to the favored codon will be shorter on average than the mean waiting time to the disfavored codon, violating a key MSS modeling assumption. We assessed MSS model performance under a more biologically realistic “misspecified” model using a forward population genetic simulation without the reversibility assumption, and with directly parameterized selective coefficients on a predefined set of codon pairs.

Forward population genetic simulations model the evolution of a population(s) over time by simulating the reproduction of individuals or genes as a function of their fitness. To generate data suitable for testing MSS models, a run of our simulator begins with a population that evolves to the point of mutation-drift-selection equilibrium, and then continues to evolve with periodic population splitting events to generate a final series of 11 populations from which individual sequences were then sampled.

To simulate data with a single selection coefficient on a subset of synonymous codons, we specified pairs of one-step reachable codons that were neutral with respect to each other within each redundantly encoded amino acid, and then assigned a selection coefficient to apply to pairs of codons that fell within separate neutral pair sets. We executed 100 simulations for varying levels of selection, parameterized through 2Ns (Table 4), each under a model with 25 selected codon pairs and 42 neutral pairs; these pairs were selected a priori. We next ran the genetic algorithm search procedure to infer neural and selected codon pair sets from sequence data. This classifier partitions all 67 codon pairs into nonoverlapping sets and assigns a numerical confidence of set membership (between 0 and 1) to each pair.

Table 4

Open in new tab

The results of genetic algorithm model selection applied to N = 100 alignments obtained using forward simulations for different strengths of selection (0: neutral, −5: strong purifying selection)

Selection coefficient 2Ns	ΔBIC	MCC	Fisher exact test P-value	Inferred codon classes	Simulated (true) codon classes		Ambiguously assigned pairs
Selection coefficient 2Ns	ΔBIC	MCC	Fisher exact test P-value	Inferred codon classes	Selected (25)	Neutral (42)	Ambiguously assigned pairs
0	152.8	−0.25	0.06	Selected (45)	12	32	10
0	152.8	−0.25	0.06	Neutral (22)	12	10	10
−0.5	192.6	0.093	0.54	Selected (53)	21	32	7
−0.5	192.6	0.093	0.54	Neutral (14)	4	10	7
−1	282.5	0.51	4.02e-05	Selected (29)	19	10	6
−1	282.5	0.51	4.02e-05	Neutral (38)	6	32	6
−1.5	675.0	0.65	1.7e-7	Selected (26)	20	6	4
−1.5	675.0	0.65	1.7e-7	Neutral (41)	5	36	4
−2	1630.4	0.81	2.07e-11	Selected (23)	21	2	3
−2	1630.4	0.81	2.07e-11	Neutral (44)	4	40	3
−5	17,974.7	0.97	2.6e-17	Selected (24)	24	0	0
−5	17,974.7	0.97	2.6e-17	Neutral (43)	1	42	0

Selection coefficient 2Ns	ΔBIC	MCC	Fisher exact test P-value	Inferred codon classes	Simulated (true) codon classes		Ambiguously assigned pairs
Selection coefficient 2Ns	ΔBIC	MCC	Fisher exact test P-value	Inferred codon classes	Selected (25)	Neutral (42)	Ambiguously assigned pairs
0	152.8	−0.25	0.06	Selected (45)	12	32	10
0	152.8	−0.25	0.06	Neutral (22)	12	10	10
−0.5	192.6	0.093	0.54	Selected (53)	21	32	7
−0.5	192.6	0.093	0.54	Neutral (14)	4	10	7
−1	282.5	0.51	4.02e-05	Selected (29)	19	10	6
−1	282.5	0.51	4.02e-05	Neutral (38)	6	32	6
−1.5	675.0	0.65	1.7e-7	Selected (26)	20	6	4
−1.5	675.0	0.65	1.7e-7	Neutral (41)	5	36	4
−2	1630.4	0.81	2.07e-11	Selected (23)	21	2	3
−2	1630.4	0.81	2.07e-11	Neutral (44)	4	40	3
−5	17,974.7	0.97	2.6e-17	Selected (24)	24	0	0
−5	17,974.7	0.97	2.6e-17	Neutral (43)	1	42	0

ΔBIC reports the improvement by the best model found by the genetic algorithm search compared to the baseline MG94 model. The 2 × 2 contingency table of simulated/inferred classifications for each codon pair, the corresponding P-value for association, and the MCC are provided. Bolded values in the 2 × 2 contingency table highlight the number of simulated codon classes that were correctly classified by the genetic algorithm. For classification purposes, we place each codon pair in the selected/neutral class based on which received more model averaged support. To quantify the degree of assignment uncertainty, we also tabulate how many of the 67 rates were “ambiguous,” i.e. had <0.9 support for the class assignment.

Table 4

Open in new tab

The results of genetic algorithm model selection applied to N = 100 alignments obtained using forward simulations for different strengths of selection (0: neutral, −5: strong purifying selection)

Selection coefficient 2Ns	ΔBIC	MCC	Fisher exact test P-value	Inferred codon classes	Simulated (true) codon classes		Ambiguously assigned pairs
Selection coefficient 2Ns	ΔBIC	MCC	Fisher exact test P-value	Inferred codon classes	Selected (25)	Neutral (42)	Ambiguously assigned pairs
0	152.8	−0.25	0.06	Selected (45)	12	32	10
0	152.8	−0.25	0.06	Neutral (22)	12	10	10
−0.5	192.6	0.093	0.54	Selected (53)	21	32	7
−0.5	192.6	0.093	0.54	Neutral (14)	4	10	7
−1	282.5	0.51	4.02e-05	Selected (29)	19	10	6
−1	282.5	0.51	4.02e-05	Neutral (38)	6	32	6
−1.5	675.0	0.65	1.7e-7	Selected (26)	20	6	4
−1.5	675.0	0.65	1.7e-7	Neutral (41)	5	36	4
−2	1630.4	0.81	2.07e-11	Selected (23)	21	2	3
−2	1630.4	0.81	2.07e-11	Neutral (44)	4	40	3
−5	17,974.7	0.97	2.6e-17	Selected (24)	24	0	0
−5	17,974.7	0.97	2.6e-17	Neutral (43)	1	42	0

Selection coefficient 2Ns	ΔBIC	MCC	Fisher exact test P-value	Inferred codon classes	Simulated (true) codon classes		Ambiguously assigned pairs
Selection coefficient 2Ns	ΔBIC	MCC	Fisher exact test P-value	Inferred codon classes	Selected (25)	Neutral (42)	Ambiguously assigned pairs
0	152.8	−0.25	0.06	Selected (45)	12	32	10
0	152.8	−0.25	0.06	Neutral (22)	12	10	10
−0.5	192.6	0.093	0.54	Selected (53)	21	32	7
−0.5	192.6	0.093	0.54	Neutral (14)	4	10	7
−1	282.5	0.51	4.02e-05	Selected (29)	19	10	6
−1	282.5	0.51	4.02e-05	Neutral (38)	6	32	6
−1.5	675.0	0.65	1.7e-7	Selected (26)	20	6	4
−1.5	675.0	0.65	1.7e-7	Neutral (41)	5	36	4
−2	1630.4	0.81	2.07e-11	Selected (23)	21	2	3
−2	1630.4	0.81	2.07e-11	Neutral (44)	4	40	3
−5	17,974.7	0.97	2.6e-17	Selected (24)	24	0	0
−5	17,974.7	0.97	2.6e-17	Neutral (43)	1	42	0

The model selection procedure does progressively better as the strength of negative selection is increased. This is evidenced by the progressively improving model fit (ΔBIC), better agreement with the ground truth (MCC, Fisher's exact test P-value), and the contingency table itself. For strong selective regimes, the inference procedure is highly accurate, and starting at a relatively weak selective level (2Ns = −1), there is significant predictive value in the inferred codon classes. Thus, despite being provided with data simulated under a cardinally different model, MSS Markov models are able to infer the set of neutral codon pairs with high accuracy, assuming that selection acting on the non-neutral codon pairs is sufficiently strong. The genetic algorithm procedure remains sensitive to model misspecification, since it still finds statistically significant improvement for an MSS model compared to the no selection (Ns = 0) model, based on ΔBIC. Because forward-simulated data is generated under a model that is different from the time-reversible Markov model assumed by the MSS, apparent improvement in fit may be due to phenomenological loading (Jones et al. 2018). As a separate exercise we also ran GA model selection on parametrically simulated under the properly specified model (MG94). When null data (100 alignments per run, 25 replicates) are simulated parametrically under the expected MG94 model, the genetic algorithm procedure correctly selected the null (no synonymous rate variation) model in 22/25 simulated cases, and in the remaining 3 cases had ΔBIC ≤ 2 (detailed data not shown, see Supplementary Information, Supplementary Material online for references).

Synonymous Substitution Rates in Drosophila Genes

Description and Information Content of the Data

We analyzed 11,844 gene alignments from Drosophila containing a median of 12 (range 4 to 23) sequences with a median of 411 (range 21 to 8933) codons, and median tree length of 0.48 (range 0.017 to 2.0) substitutions per nucleotide site (supplementary fig. S1, Supplementary Material online). To gain some intuition of data content in these genes that is relevant to MSS inference, we mapped codon substitution to branches using joint maximum likelihood ancestral reconstruction methods implemented in SLAC (Kosakovsky Pond and Frost 2005) and counted approximately how many single-nucleotide synonymous substitutions (SNSS) occurred in Drosophila genes. The entire collection of genes was estimated to include ∼4.12 M SNSS, and this can be viewed as an “order-of-magnitude” approximation of the sample size available for MSS inference. Per gene, there was a median of 254 (range 0 to 5762) SNSS, and as expected, this number was positively correlated with gene length and divergence (supplementary fig. S2, Supplementary Material online). Among 67 types of SNSS available in the universal genetic code (e.g. AAA↔AAG, a substitution between Lysine codons), there was an order of magnitude difference between how frequent individual substitutions were (Fig. 1, supplementary table S2, Supplementary Material online), implying that some rates might be easier to infer than others, and some rate estimates might be less noisy. Furthermore, for a fixed amino acid with more than 1 SNSS, there was generally a significant difference between the least common and the most common pair of synonymous codons being substituted. For example, there was a ∼6-fold difference in raw counts between the most (CTG↔TTG, ∼160 K) and least (CTA↔TTA, ∼25 K) common SNSS among Leucine codons, even though the same pair of nucleotides are involved (C,T). This disparity is relevant in the context of codon-level MSS models, i.e. when individual codon pairs coding for the same amino acid may have unequal relative substitution rates. Relatively rare substitutions, i.e. those present at small count numbers (e.g. TCA↔TCT for Serine, having a mean of 1.21 per alignment), or those found only in a minority of datasets (e.g. TCA↔TCT is found in 5808/11,844 alignments) can have larger sampling noise, and may be inferred less accurately. SNSS count distributions have heavy tails (Fig. 1), with a significant number of genes where substitution counts for a pair significantly exceed the mean. This pattern will help explain why rate estimates under the MSS models may also be expected to have heavy tails, with large estimates for individual genes.

Fig. 1.

Left: the average (per alignment) number of SNSS for each of the 67 codon pairs in the universal genetic code (gray dots) inferred from the Drosophila gene alignments. Plotted horizontal lines demarcate averages over individual codon pairs for each amino acid. Right (top): the distribution of the mean number of SNSS codon pairs per alignments versus the excess kurtosis shows that all of the MSS rates have a large number of observations in the tail end of the rate distributions. Right (bottom): the distribution of mean SNSS codon pairs per alignment versus the number of alignments where >0 of substitutions for the specific codon-pair are present.

Open in new tab Download slide

Estimating Synonymous Substitution Rates Using Maximum Likelihood

The most general model in the MSS class is SynREVCodon which directly estimates the relative substitution rate for each of the 67 synonymous codon pairs, $α_{i}$ ⁠. SynREVCodon has 66 identifiable relative synonymous rate (RSR) parameters. Because the mean of the synonymous substitution rates is confounded with branch lengths (only products of α and t can be estimated), we impose the constraint $Σ α_{i} / 67 = 1$ ⁠. We also consider a simpler model, SynREV, which is nested in SynREVCodon and sets all the pairwise rates for codons encoding the same amino acid to be the same. This results in 18 amino acid level codon parameters (methionine and tryptophan each have a single corresponding codon).

We fit MSS models to the collection of Drosophila alignments in 2 different ways. The first is the fully independent model (FIM), where each gene has its own set of synonymous substitution rates $α_{i}$ ⁠, along with other parameters (branch lengths, nucleotide rate biases, codon frequencies, and the nonsynonymous rates). The second is the fully shared model (FSM), where the same set of α parameters is inferred jointly from all genes, while all other parameters are estimated independently for each gene. More precisely, we estimate branch lengths, nucleotide rate biases, codon frequencies, and the nonsynonymous rates using the standard MG94xREV model for each gene, and then hold those at their point estimates while fitting the MSS rates jointly, and also estimating an overall tree length scaling parameter for each gene (proportional branch lengths are maintained from gene level fits). These approximations make the multigene fit computationally feasible.

The FIM model is expected to produce noisy gene-level estimates because parameter-rich models are fitted to relatively small alignments, but it is useful for quantifying the extent of gene-level variability in rates. The FSM model is expected, on the other hand, to yield “average” estimates with low variance, but has no capacity to account for gene-to-gene heterogeneity. Because the standard MG94 model (⁠ $α_{i} \equiv 1$ ⁠) is nested within the MSS (SynREV and SynREVCodon) model, a standard likelihood ratio test (LRT) with 17 (SynREV) or 67 (SynREVCodon) degrees of freedom can be used to test the hypothesis that some synonymous substitution rates are different, gene-by-gene, and genome-wide.

We fitted the nested hierarchy of models MG94xREV ⊂ SynREV ⊂ SynREVCodon to each of the 11,844 gene alignments and compared them to each other via LRTs (Table 5). Despite the relatively small sizes of individual alignments, the standard codon model can be rejected in favor of the parameter-rich MSS models for a substantial fraction (SynREV) or an outright majority (SynREVCodon) of the alignments, e.g. based on false discovery rates. The family wise error rate correction (Holm–Bonferroni) is much more conservative. Among the 2 MSS models, SynREV can be rejected in favor of SynREVCodon on the majority of the tested alignments.

Table 5

Open in new tab

Model comparison results using LRTs on individual Drosophila gene alignments

Null	Alternative	Rejected at P = 0.05	Rejected at P = 0.05 (Holm–Bonferroni)	Rejected at q = 0.10 (FDR, Benjamini–Hochberg)
MG94xREV	SynREV	5,237 (44.2%)	257 (2.2%)	4,920 (41.5%)
MG94xREV	SynREVCodon	8,086 (68.3%)	1,622 (13.7%)	8,546 (72.2%)
SynREV	SynREVCodon	6,793 (57.4%)	903 (7.5%)	7,065 (59.7%)

Null	Alternative	Rejected at P = 0.05	Rejected at P = 0.05 (Holm–Bonferroni)	Rejected at q = 0.10 (FDR, Benjamini–Hochberg)
MG94xREV	SynREV	5,237 (44.2%)	257 (2.2%)	4,920 (41.5%)
MG94xREV	SynREVCodon	8,086 (68.3%)	1,622 (13.7%)	8,546 (72.2%)
SynREV	SynREVCodon	6,793 (57.4%)	903 (7.5%)	7,065 (59.7%)

Table 5

Open in new tab

Model comparison results using LRTs on individual Drosophila gene alignments

Null	Alternative	Rejected at P = 0.05	Rejected at P = 0.05 (Holm–Bonferroni)	Rejected at q = 0.10 (FDR, Benjamini–Hochberg)
MG94xREV	SynREV	5,237 (44.2%)	257 (2.2%)	4,920 (41.5%)
MG94xREV	SynREVCodon	8,086 (68.3%)	1,622 (13.7%)	8,546 (72.2%)
SynREV	SynREVCodon	6,793 (57.4%)	903 (7.5%)	7,065 (59.7%)

Null	Alternative	Rejected at P = 0.05	Rejected at P = 0.05 (Holm–Bonferroni)	Rejected at q = 0.10 (FDR, Benjamini–Hochberg)
MG94xREV	SynREV	5,237 (44.2%)	257 (2.2%)	4,920 (41.5%)
MG94xREV	SynREVCodon	8,086 (68.3%)	1,622 (13.7%)	8,546 (72.2%)
SynREV	SynREVCodon	6,793 (57.4%)	903 (7.5%)	7,065 (59.7%)

Ranking the rates inferred by SynREVCodon by their median (across genes) estimates reveals a clear gradient (Fig. 2), with TTC:TTT (phenylalanine) as the fastest pair, and GGC:GGG (glycine) as the slowest pair. Because MSS models are quite parameter rich, and individual alignments are not large, many rate estimates are exactly 0 (e.g. no corresponding substitutions have occurred), and many other relative rate estimates are large, resulting in bimodal distributions. Similarly, there is a gradient of median relative amino acid rates inferred by the SynREV model (Fig. 3), with alanine being the slowest, and phenylalanine—the fastest. In the following sections, we will show that some of this variation is attributable to cognate tRNA abundance, and is significantly correlated with independent measures of selection, namely population-level polymorphism data.

Fig. 2.

Maximum likelihood point estimates for relative rates of each of the 67 pairs of synonymous substitutions in the SynREVCodon model. The rates are sorted by the median of the corresponding distribution (black tick). Rates that are exactly zero or those that are ≥3 are shown as circles, whose size and color reflect the number of alignments contributing to the corresponding estimates. The heatmap for each pair is generated using only the rates that are in (0,3).

Open in new tab Download slide

Fig. 3.

Maximum likelihood point estimates for relative rates of each of the 18 amino acid level synonymous substitutions in the SynREV model. The rates are sorted by the median of the corresponding distribution (black tick). Rates that are exactly zero or those that are ≥3 are shown as circles, whose size and color reflect the number of alignments contributing to the corresponding estimates. The heatmap for each pair is generated using only the rates that are in (0,3).

Open in new tab Download slide

For the 9 amino acids that have more than a single synonymous codon pair exchangeable by a single nucleotide substitution (Fig. 4), the rates for individual codon pairs are often quite different and not well represented by a single (amino acid level) rate. For example, for Leucine there are 9 synonymous codon pairs exchangeable by a single nucleotide substitution. The medians of the corresponding rate estimates range from 0.34 (CTC:CTG conserved) to 1.29 (CTA:CTT neutral), while the median amino acid level rate is 0.92. This observation, combined with the high rate of rejection of SynREV in favor of SynREVCodon (Table 5), provides compelling evidence that model variation in relative substitution rates should be modeled on the level of individual codon pairs.

Fig. 4.

Median (over genes) estimates of relative substitution rates from SynREV (red ticks) and SynREVCodon (blue dots) models. For amino acids with multiple pairs of synonymous substitutions, individual codon pairs are dispersed around the corresponding amino acid level estimates.

Open in new tab Download slide

Codon Pair Substitution Rates Show Strong Correlation With Polymorphism Levels

For each pair of synonymous codons that differ by a single nucleotide, SynREVCodon provides an estimate of the corresponding relative rate of substitution. It does so while accounting for some mutational biases that give rise to uneven base composition, as well as the single-step rates from those 2 codons to other synonymous and nonsynonymous states. Natural selection operating on synonymous codon pairs is a plausible biological process contributing to inferred rate variation, although the complexity of the actual biological process and the unavoidable crudeness of substitution models make direct attribution unadvisable (Jones et al. 2018). Nonetheless, neutral codon pairs are expected to exhibit the highest relative substitution rates. For pairs where 1 codon is favored over another, the substitution process would alternate between relatively rapid substitution, from unfavored to favored, followed by a much slower rate of reversion. However, the average rate of selected synonymous codon pairs is expected to be lower than for neutral codon pairs, because the waiting time for the unfavored mutations is much longer than for the favored or neutral mutations (Kimura 1962; Rahman et al. 2021).

To ascertain how well the variation in codon pair rates reflects differences in selection, we turned to a corollary prediction from population genetics: alleles under directional selection will tend to be lost or fixed relatively rapidly relative to the neutral alleles. In other words, codon pairs that experience selection should have lower rates of polymorphism within a natural population, than should neutral codon pairs.

We estimated mean SNP density across genes for each of the codon pairs, as observed in a collection of 197 genomes from Zambia, and compared them against the median of the corresponding substitution rate estimated from individual genes by SynREVCodon model (Fig. 5). The linear correlation coefficient between the 2 quantities is 0.52 (corrected R² of 0.26). Spearman rank-based correlation is similarly high (ρ=0.48) and significant (P = 0.002).

Fig. 5.

There is a positive correlation (Spearman's ρ = 0.48) between the median inferred synonymous substitution rates and the mean SNP density observed in a collection of 197 Drosophila genomes from Zambia. SNP density is an independent measure of selection pressure, as selection pressure to maintain a preferred codon decreases the SNP density for that codon. Correspondingly, the synonymous substitution rate for that codon will also be lower.

Open in new tab Download slide

As the 2 data sets, SNP density in a natural population and species alignments used for SynREVCodon, are quite independent of each other, the finding of a strong correlation supports the claim that codon pairs with high estimated substitution rates are under less selection than those with lower estimated substitution rates.

Codon Pair Substitution Rates Correlate With tRNA Abundance

Researchers have long hypothesized that codon usage is under selection to conform with tRNA abundance, as a means of ensuring translational efficiency (Ikemura 1981; Kanaya et al. 2001). Compelling evidence for this hypothesis comes from the observation that codon usage bias is inversely correlated with gene expression: highly expressed genes almost exclusively use synonymous codons served by the most abundant tRNAs. We therefore assume that substitutions between a codon served by a highly abundant tRNA and a codon served by a less abundant tRNA would be disfavored, and that purifying selection would reduce the rate of such substitutions. To examine our assumption, we obtained and normalized tRNA transcript counts from Behrens et al. (2021). We then examined tRNA modifications that increase the decoding capacity of individual tRNA isoacceptors. For each tRNA that was modified to recognize an additional codon, we added the modified tRNA's abundance to the pool of tRNAs capable of decoding that codon. For codons where the decoder tRNA was ambiguous, we used Crick's wobble pairing rules to predict which tRNA decoded which codon (Crick 1966).

We calculated the ratios in tRNA abundance between codons in a pair. Though SynREVCodon rates are symmetric (e.g. the rate of the Phenylalanine substitution TTC > TTT is equivalent to the rate TTT > TTC), the abundance ratio is sensitive to the order in which abundances are compared (e.g. if 1 tRNA pool is 4 times as abundant as the other, the abundance ratio may be 4/1 or ¼). To enforce consistency among the abundance ratios, we chose to always measure the abundance of the more common tRNA decoder pool compared to the less common one (e.g. an abundance ratio of 4/1). Doing so allowed us to maintain a consistent interpretation of the relationship between substitution rates and abundance ratio magnitudes. We found a significant negative correlation between the median synonymous substitution rates inferred by SynREVCodon and the tRNA abundance ratios (rank-based Spearman's ρ = −0.41, P = 0.005) (Fig. 6). The negative correlation is expected, because we expect higher evolutionary rates to correspond to codon pairs with similar tRNA abundances, and thus lower abundance ratio.

Fig. 6.

There is a negative correlation (rank-based Spearman's ρ = −0.41, P = 0.005) between the median inferred synonymous substitution rates (SynREVCodon), and the tRNA abundance ratio for the corresponding codon pair.

Open in new tab Download slide

Estimates of dN/dS (ω) Ratios are Decreased Under MSS Models

As demonstrated by Rahman et al. (Rahman et al. 2021) for simpler MSS models with a priori amino acid partitioning, the estimates of dN/dS ratios are expected to decrease when selection on synonymous substitutions is included in the model. For simpler models, where a subset of codons is designated as neutral and only those substitutions contribute toward dS, the intuition is simple: neutral substitutions occur at faster rates than selected substitutions, and including only those in dS will increase the value of the denominator and thus lower the ratio. Because SynREV and SynREVCodon models use a different parameterization, with dN/dS defined relative to multiple synonymous rates, it is not immediately clear what effect this would have on dN/dS estimates. We compared ω (dN/dS) estimates obtained under the standard MG94xREV model, the SynREV model, and the SynREVCodon model (Fig. 7), and found that there is a highly significant (P<<0.001) linear correlation between the estimates obtained under the standard MG94xREV model, and the 2 MSS models. In both cases, there was a mean reduction ω estimates, by a factor of 0.86 (SynREV) and 0.68 (SynREVCodon).

Fig. 7.

The relationship between ω (dN/dS) under MG94xREV, SynREV, and SynREVCodon models. Linear regression with the 0 intercept was run using the lm (y∼x + 0) function in R.

Open in new tab Download slide

We also compared ω (dN/dS) estimates obtained under the standard MG94xREV model, the SynREV model, and the SynREVCodon model for a set of 14,371 primate gene alignments, containing a median of 25 (range 8 to 26) sequences with a median of 466 (range 44 to 5137) codons, and median tree length of 0.40 (range 0.045 to 1.9) substitutions per nucleotide site. We found that ω estimates were reduced by a factor of 0.895 under the SynREV model, and by a factor of 0.82 under the SynREVCodon model (Fig. 8), suggesting that selection on synonymous codon usage is present across taxa. Selection on synonymous codon usage in primates has been reported in the past, even though it may be less pronounced than in some other clades (Chamary et al. 2006; Drummond and Wilke 2008; Dhindsa et al. 2020). As discussed in Rahman et al. (2021), potentially biased dN/dS estimates with similar extent of bias will affect many downstream analyses which rely on them.

Fig. 8.

The relationship between ω under MG94xREV, SynREV, and SynREVCodon models for primate gene alignments. Linear regression with the 0 intercept was run using the lm (y∼x + θ) function in R.

Open in new tab Download slide

Joint Estimation of MSS Models

Because genes are relatively short, and the number of sequences is small, we expect these rate estimates to be noisy. Indeed, as can be seen in Fig. 2, for many genes individual codon pair rate estimates lie in the extreme tails of the distribution: exactly 0 (which is not biologically plausible: surely any codon pair can be exchanged at some, albeit low, rate), or a large value (possible saturation). If we assume that a single MSS model can adequately represent substitution patterns in a large fraction of the genes, then we could simply infer such a model from a large number of genes jointly. This is done while allowing each alignment to use a separate set of parameters such as base frequencies, nucleotide substitution biases, dN/dS ratios, etc. Of course, if no such single model exists, i.e. sets of genes have discordant substitution preferences, and then a single “average” model may be a poor fit to any individual gene.

From all N = 11,844 available alignments, we selected random subsets of K = 4, 8, 16, 32, 64, 128, 256, 512, 1024, and 2,048 genes (5 independent draws for each K), and jointly inferred MSS models from them. This procedure is useful in establishing how stable rate inferences are based on the number of genes, and between replicates, and also in providing a means for statistically uncontaminated testing of model fit. Indeed, we may use a model inferred from a set of K genes, on the independent (hold-out) set of N-K remaining alignments for validation.

Starting with as few as 8 alignments, MSS models offer a statistically significant improvement over baseline (MG94 models) based on LRT comparisons, and with 16 alignments, they also show a consistent improvement in AIC scores (Table 6). As the number of alignments increase, the improvement in fit (AIC points) per alignment is around 3.7 for SynREV and 10 for SynREVCodon. Thus, using a “global” MSS model on multiple alignments captures the general substitution patterns of a large fraction of genes.

Table 6

Open in new tab

Fit improvements when comparing MG94xREV to SynREV and SynREVCodon models fitted jointly to subsets of multiple alignments

# Genes	LRT P-value versus MG94 P-value Mean (range)		ΔAIC/alignment Mean (range)
# Genes	SynREV	SynREVCodon	SynREV	SynREVCodon
4	0.121 (0.000, 0.599)	0.023 (0.000, 0.114)	0.369 (−4.762, 2.415)	−3.973 (−13.203, −0.518)
8	0.012 (0.000, 0.038)	0.000 (0.000, 0.002)	0.561 (−0.672, 1.683)	0.138 (−3.722, 4.881)
16	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	2.550 (1.261, 4.521)	5.198 (2.331, 10.588)
32	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.169 (1.576, 4.170)	8.249 (5.252, 12.248)
64	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.469 (2.615, 3.882)	9.435 (7.614, 10.026)
128	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.819 (3.427, 4.333)	10.060 (9.031, 10.680)
256	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.843 (3.628, 4.181)	10.092 (9.369, 11.070)
512	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.608 (3.338, 3.852)	10.100 (9.285, 10.624)
1024	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.799 (3.678, 3.907)	10.413 (10.239, 10.735)
2048	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.718 (3.617, 3.771)	10.261 (10.075, 10.380)

# Genes	LRT P-value versus MG94 P-value Mean (range)		ΔAIC/alignment Mean (range)
# Genes	SynREV	SynREVCodon	SynREV	SynREVCodon
4	0.121 (0.000, 0.599)	0.023 (0.000, 0.114)	0.369 (−4.762, 2.415)	−3.973 (−13.203, −0.518)
8	0.012 (0.000, 0.038)	0.000 (0.000, 0.002)	0.561 (−0.672, 1.683)	0.138 (−3.722, 4.881)
16	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	2.550 (1.261, 4.521)	5.198 (2.331, 10.588)
32	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.169 (1.576, 4.170)	8.249 (5.252, 12.248)
64	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.469 (2.615, 3.882)	9.435 (7.614, 10.026)
128	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.819 (3.427, 4.333)	10.060 (9.031, 10.680)
256	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.843 (3.628, 4.181)	10.092 (9.369, 11.070)
512	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.608 (3.338, 3.852)	10.100 (9.285, 10.624)
1024	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.799 (3.678, 3.907)	10.413 (10.239, 10.735)
2048	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.718 (3.617, 3.771)	10.261 (10.075, 10.380)

Means and ranges refer to the 5 random subsets of genes for a given batch size.

Table 6

Open in new tab

Fit improvements when comparing MG94xREV to SynREV and SynREVCodon models fitted jointly to subsets of multiple alignments

# Genes	LRT P-value versus MG94 P-value Mean (range)		ΔAIC/alignment Mean (range)
# Genes	SynREV	SynREVCodon	SynREV	SynREVCodon
4	0.121 (0.000, 0.599)	0.023 (0.000, 0.114)	0.369 (−4.762, 2.415)	−3.973 (−13.203, −0.518)
8	0.012 (0.000, 0.038)	0.000 (0.000, 0.002)	0.561 (−0.672, 1.683)	0.138 (−3.722, 4.881)
16	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	2.550 (1.261, 4.521)	5.198 (2.331, 10.588)
32	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.169 (1.576, 4.170)	8.249 (5.252, 12.248)
64	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.469 (2.615, 3.882)	9.435 (7.614, 10.026)
128	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.819 (3.427, 4.333)	10.060 (9.031, 10.680)
256	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.843 (3.628, 4.181)	10.092 (9.369, 11.070)
512	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.608 (3.338, 3.852)	10.100 (9.285, 10.624)
1024	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.799 (3.678, 3.907)	10.413 (10.239, 10.735)
2048	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.718 (3.617, 3.771)	10.261 (10.075, 10.380)

# Genes	LRT P-value versus MG94 P-value Mean (range)		ΔAIC/alignment Mean (range)
# Genes	SynREV	SynREVCodon	SynREV	SynREVCodon
4	0.121 (0.000, 0.599)	0.023 (0.000, 0.114)	0.369 (−4.762, 2.415)	−3.973 (−13.203, −0.518)
8	0.012 (0.000, 0.038)	0.000 (0.000, 0.002)	0.561 (−0.672, 1.683)	0.138 (−3.722, 4.881)
16	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	2.550 (1.261, 4.521)	5.198 (2.331, 10.588)
32	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.169 (1.576, 4.170)	8.249 (5.252, 12.248)
64	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.469 (2.615, 3.882)	9.435 (7.614, 10.026)
128	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.819 (3.427, 4.333)	10.060 (9.031, 10.680)
256	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.843 (3.628, 4.181)	10.092 (9.369, 11.070)
512	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.608 (3.338, 3.852)	10.100 (9.285, 10.624)
1024	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.799 (3.678, 3.907)	10.413 (10.239, 10.735)
2048	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)	3.718 (3.617, 3.771)	10.261 (10.075, 10.380)

Means and ranges refer to the 5 random subsets of genes for a given batch size.

A More Detailed Examination of How Model Choice Impacts dN/dS Estimation

Our MG94-based models define dN/dS indirectly (Table 1), by estimating the nonsynonymous substitution rate β and the synonymous substitution rate ɑ separately. The ratio of β/ɑ defines ω. One can also define dS (dN) as the expected numbers of synonymous (nonsynonymous) substitutions per unit time normalized by the expected number of synonymous and nonsynonymous sites. The latter 2 quantities are nontrivial to define and how this is done may affect dS and dN estimates substantially (see Muse 1996). For the standard MG94 model ω = β, since ɑ:= 1. For the MSS and GA-based models (Table 2), because one of the ɑ rates is identified as “neutral,” i.e. constrained to be 1, ω will be defined relative to that rate (referred to as dSn in Rahman et al. 2021). For SynREV and SynREVCodon models, only the mean of ɑ is constrained to be 1, and there is no analogous set of neutral synonymous substitutions.

Let us consider how the estimates of ω vary under different models for the barren Drosophila gene (FBgn0014127) in Table 7. In addition to the previously defined models, we also define 2 new models for illustration purposes.

Joint codon empirical. This model takes the numerical estimates for ɑ rates from the SynREVCodon model inferred jointly from 1,024 alignments (not including FBgn0014127) and then treats them as constant. Such a model is similar to protein empirical models (e.g. Dayhoff et al. 1978; Whelan and Goldman 2001) and the ECM model of Kosiol et al. (Kosiol et al. 2007)
Joint codon discretized. Using the same joint model as above, the estimates for ɑ are ranked, and the top 15% (by numerical value of the estimates) are taken to be “neutral,” thereby generating an a priori model, with a “neutral” class, and the corresponding dSn rate.

Table 7

Open in new tab

Estimates of ω under different models for the alignment of 10 sequences for the barren gene (FBgn0014127)

Model	ω	log(L)	LRT P-value	SD [ɑ]
MG94	0.085	−7040.8	n/a	0.000
SynREV	0.074	−7023.43	0.007	0.268
SynREVCodon	0.063	−6982.27	0.0002	0.744
Joint codon empirical	0.085	−7026.40	<10⁻⁶	0.176
Joint codon discretized	0.041	−7024.58	<10⁻⁶	0.200

Model	ω	log(L)	LRT P-value	SD [ɑ]
MG94	0.085	−7040.8	n/a	0.000
SynREV	0.074	−7023.43	0.007	0.268
SynREVCodon	0.063	−6982.27	0.0002	0.744
Joint codon empirical	0.085	−7026.40	<10⁻⁶	0.176
Joint codon discretized	0.041	−7024.58	<10⁻⁶	0.200

LRT P-value is computed using MG94 as null, with the appropriate number of degrees of freedom for each model (17, 67, 0, 1). SD [ɑ] is computed directly on the distribution of 68 codon-pair rates for each model.

Table 7

Open in new tab

Estimates of ω under different models for the alignment of 10 sequences for the barren gene (FBgn0014127)

Model	ω	log(L)	LRT P-value	SD [ɑ]
MG94	0.085	−7040.8	n/a	0.000
SynREV	0.074	−7023.43	0.007	0.268
SynREVCodon	0.063	−6982.27	0.0002	0.744
Joint codon empirical	0.085	−7026.40	<10⁻⁶	0.176
Joint codon discretized	0.041	−7024.58	<10⁻⁶	0.200

Model	ω	log(L)	LRT P-value	SD [ɑ]
MG94	0.085	−7040.8	n/a	0.000
SynREV	0.074	−7023.43	0.007	0.268
SynREVCodon	0.063	−6982.27	0.0002	0.744
Joint codon empirical	0.085	−7026.40	<10⁻⁶	0.176
Joint codon discretized	0.041	−7024.58	<10⁻⁶	0.200

Both of these models can be viewed as practical alternatives to the rather costly and noisy gene-by-gene or joint inference. Because these models are precomputed, they can be applied to genes from the same genus to obtain a first-order correction for variable ɑ rates with a relatively minimal parametric or computational overhead.

The estimate of ω is sensitive to the choice of model and is no greater than the MG94 (baseline) estimate for any of the MSS models. Biological datasets appear to support a significant but not extreme reduction (0.6 to 0.8×) in ω estimates compared to the MG94 model.

To establish a statistical performance baseline, we ran an ω bias simulation experiment. We randomly selected 100 Drosophila alignments, and then used the MG94, SynREV, and SynREVCodon model fits for each of these alignments to simulate 100 replicates per model (300 total per alignment). We then fitted MG94, SynREV, and SynREVCodon to each of the 30,000 replicates to estimate ω and assess the degree of ω estimation bias (Table 8).

Estimates obtained under the MG94 model are unbiased when the generating model is also MG94, and have a strong upward bias when the generating model is an MSS model. This bias is often severe.
Estimates obtained under the SynREVCodon model are biased downwards even when the generating model is also SynREVCodon; they are quite noisy (high variance), as can be expected for such a parameter rich model. Relative bias is less extreme than the corresponding values for the MG94 model.
The behavior of estimates obtained under the SynREV model is somewhere in between; a small downward bias when the true model is MG94, a significant upward bias but less extreme than MG94 on the data from SynREVCodon, and a slight downward bias on the data generated under the SynREV model itself.

Table 8

Open in new tab

Relative bias, defined as $(E [\hat{ω}] - ω_{true}) / ω_{true}$ ⁠, for different generating and fitted model pairs

Generating Mode	Fitted Model	ω estimate relative bias
Generating Mode	Fitted Model	Mean	Variance	Rate of severe* bias
MG94	MG94	0.001	0.015	0.00%
MG94	SYNREV	−0.022	0.049	0.00%
MG94	SYNREVCODON	−0.135	0.126	0.00%
SYNREV	MG94	0.175	0.575	9.09%
SYNREV	SYNREV	−0.023	0.058	0.00%
SYNREV	SYNREVCODON	−0.049	0.187	2.02%
SYNREVCODON	MG94	0.665	1.225	61.62%
SYNREVCODON	SYNREV	0.351	0.575	25.25%
SYNREVCODON	SYNREVCODON	−0.107	0.137	0.00%

Generating Mode	Fitted Model	ω estimate relative bias
Generating Mode	Fitted Model	Mean	Variance	Rate of severe* bias
MG94	MG94	0.001	0.015	0.00%
MG94	SYNREV	−0.022	0.049	0.00%
MG94	SYNREVCODON	−0.135	0.126	0.00%
SYNREV	MG94	0.175	0.575	9.09%
SYNREV	SYNREV	−0.023	0.058	0.00%
SYNREV	SYNREVCODON	−0.049	0.187	2.02%
SYNREVCODON	MG94	0.665	1.225	61.62%
SYNREVCODON	SYNREV	0.351	0.575	25.25%
SYNREVCODON	SYNREVCODON	−0.107	0.137	0.00%

Rate of severe bias is defined as the proportion of simulation scenarios where the true value of ω was outside the [2.5% to 97.5%] quantile distribution for ω estimates, i.e. the true value is in the tails of the sampling distribution.

Table 8

Open in new tab

Relative bias, defined as $(E [\hat{ω}] - ω_{true}) / ω_{true}$ ⁠, for different generating and fitted model pairs

Generating Mode	Fitted Model	ω estimate relative bias
Generating Mode	Fitted Model	Mean	Variance	Rate of severe* bias
MG94	MG94	0.001	0.015	0.00%
MG94	SYNREV	−0.022	0.049	0.00%
MG94	SYNREVCODON	−0.135	0.126	0.00%
SYNREV	MG94	0.175	0.575	9.09%
SYNREV	SYNREV	−0.023	0.058	0.00%
SYNREV	SYNREVCODON	−0.049	0.187	2.02%
SYNREVCODON	MG94	0.665	1.225	61.62%
SYNREVCODON	SYNREV	0.351	0.575	25.25%
SYNREVCODON	SYNREVCODON	−0.107	0.137	0.00%

Generating Mode	Fitted Model	ω estimate relative bias
Generating Mode	Fitted Model	Mean	Variance	Rate of severe* bias
MG94	MG94	0.001	0.015	0.00%
MG94	SYNREV	−0.022	0.049	0.00%
MG94	SYNREVCODON	−0.135	0.126	0.00%
SYNREV	MG94	0.175	0.575	9.09%
SYNREV	SYNREV	−0.023	0.058	0.00%
SYNREV	SYNREVCODON	−0.049	0.187	2.02%
SYNREVCODON	MG94	0.665	1.225	61.62%
SYNREVCODON	SYNREV	0.351	0.575	25.25%
SYNREVCODON	SYNREVCODON	−0.107	0.137	0.00%

Fitting complex models to small alignments introduces statistical inefficiencies and biases, even though the overall pattern is clear. MG94 severely overestimates dN/dS when the true model is an MSS model, and although both MSS models show downward estimator bias, it is much less extreme than the MG94 upward bias.

What is clear, however, is that even when estimated as an alignment-wide parameter, ω, is rather sensitive to how one chooses to model the variation in ɑ rates. While we do not examine it further here, previous work in Rahman et al. (2021) shows that ω estimates are even more sensitive in models which allow ω to vary across sites and branches under the a priori MSS models.

Discussion

We developed a broad class of phylogenetic codon-substitution models to infer relative synonymous substitution rates from CDS alignments. These MSS models expand upon our prior work (Rahman et al. 2021), by inferring substitution rate classes directly from the data, instead of relying on predefined classes of “neutral” or “selected” substitutions. When applied to gene alignments from 12 closely related Drosophila species, the MSS models reveal a striking degree of heterogeneity in synonymous substitution rates across the genes. MSS models provide a significantly better fit to the data compared to standard models where all synonymous substitutions share the same rate. This is true even for the fine-grained SynREVCodon model that endows individual synonymous codon pairs with separate rates, an indicator that the standard modeling assumptions describe the data quite poorly and there is ample room for improvement. Goodness-of-fit improvements over standard substitution models are easy to obtain and speak more to the gross inadequacy of existing models than to the biological veracity of the newer ones (Delport et al. 2010b; Jones et al. 2018). We corroborated the inferences made by the MSS models using forward population genetic simulations (an entirely different modeling framework), using an external dataset of within-population SNP variation (an independent source of a different type of variation data, with within-species and between-species correlation expected), and by confirming that the ranking of inferred rates is in accordance with relative tRNA abundance (a key mechanistic factor).

MSS models reveal a spectrum of substitution rates between synonymous codons across Drosophila genes, with a significant (several-fold) difference between the slowest and the fastest rates. The inferred distribution of rates shows, as expected, that substitutions between codons whose corresponding tRNAs occur at significantly different abundances tend to have lower rates, likely attributable to selection for translational efficiency. However, many other factors, both genome-wide, and gene-level likely contribute to this variation. Further elucidation of these factors requires substantial additional work and likely relies on data that are not readily available.

MSS models, like all other models, are approximations of the true biological process and are misspecified. For example, MSS models do not account for time-variable synonymous substitution rates, which can confound interpretations of model fit in mutation-selection models (Jones et al. 2018). Additionally, the interaction of context-dependent substitution processes and emergent non-Markovian dynamics presents further challenges that are not explicitly addressed in our model. While these factors are undoubtedly important, incorporating them into a comprehensive framework would significantly increase the complexity of the model, and lead to loss of computational feasibility and statistical performance. Accounting for additional biological phenomena would undoubtedly improve the model's fit to the data and the overall realism of the model, but the feasibility of such an endeavor is the subject of future work. Nevertheless, we believe that the MSS modeling framework provides a useful and practical tool for capturing the key drivers of synonymous substitution and offers valuable insights into the evolutionary processes shaping codon usage.

The removal of a long-held modeling assumption regarding the constancy of synonymous substitution rates across codon pairs will undoubtedly influence inferences of other evolutionary parameters. We showed that the introduction of MSSs reduces the estimates of gene-wide dN/dS, similar to what was reported in Rahman et al. (2021) for simpler a priori models, thereby affecting downstream analyses which rely on dN/dS or its derivatives. Additional analyses will be required to evaluate the impact that MSS models may have on other common tasks, such as phylogenetic inference, or time tree estimation. Our work here contributes to the recent thread in evolutionary modeling literature which re-examines historically accepted assumptions (Wisotsky et al. 2020; Lucaci et al. 2021), and the influence of model choice on different types of analyses, which require the selection of a substitution model.

We envision that the MSS framework presented here will have both short-term and longer-term influence on comparative evolutionary analyses. In the short-term, MSS models can be applied to genomes from different organisms and clades to infer and catalog the distributions of substitution rates among synonymous codons. Identifying a robust subset of synonymous substitutions, which evolve in a neutral fashion directly from the data, will improve inference that requires a neutral reference, and permit the examination of what factors inform the composition of the neutral set and partition the rates into different evolutionary classes. We also anticipate that a single MSS model will not describe all the genes equally well. The MSS framework can be used to select the genes which deviate from genome-wide averages and then investigate the attributes of these outlier genes. In the longer-term, natural extensions to the MSS framework and incorporation of other processes known to be important can give rise to more biologically informative tools. Specifically, MSS models can be combined with models which allow nonsynonymous rates to depend on which codons are being exchanged (Wong et al. 2006). With the exception of a few models (Kosiol et al. 2007; Yang and Nielsen 2008), codon-pair level variation in synonymous and nonsynonymous rates have not been studied jointly, and they are likely entangled, and correlated in nontrivial ways. MSS models can be extended to allow different types of site-to-site and branch-to-branch variation to accommodate heterotachy and other complex processes.

Our work contributes to the growing collection of studies revealing that synonymous substitution rates are neither fully neutral, nor accrue at constant rates over sites and branches (Sharp and Li 1987a; Shields et al. 1988; Iriarte et al. 2013; Trotta 2013; Wisotsky et al. 2020). The genome-wide selective force that powers the MSS models is that of weak selection, and though its effects are small when compared to the effects of nonsynonymous substitutions (Rahman et al. 2021), it is nonetheless readily detectable and informative. We hope the collection of tools and motivating examples provided here will entice additional research into the scope, drivers, and effect of non-neutral processes affecting synonymous mutations and substitutions.

Supplementary Material

Supplementary material is available at Molecular Biology and Evolution online.

Acknowledgments

We would like to thank Avery Selberg, Steven Weaver, Stephen Shank, and all other members of the ACME lab for their insightful feedback and discussions.

Funding

This work was supported by the U.S. National Institute of General Medical Sciences [NIGMS] (R01 GM151683 and GM144468) and by the U.S. National Institute of Allergy and Infectious Diseases [NIAID] (U24 AI183870).

Data Availability

The MSS model is included with the full set of Hyphy analyses, which can be downloaded at https://github.com/veg/hyphy-analyses/tree/master. A SynREVCodon analysis of a single gene can be run as follows:

Open in new tab

hyphy hyphy-analyses/FitModel/FitModel.bf \

–model MSS
–alignment gene.nex
–mss-type SynREVCodon

The joint model fitter script is included in the base Hyphy package. The script requires a text list of newline-separated file paths, where each file contains a codon-aware alignment and the corresponding gene tree. It is run as follows:

Open in new tab

hyphy hyphy/res/TemplateBatchFiles/MSS-joint-fitter.bf \

–model SynREVCodon \
–filelist files.txt \
–output outfile.json

The MSS-selector-2 package is also included in the base Hyphy package. Like the joint model fitter script, MSS-selector-2 requires a text list of newline-separated file paths, where each file contains a codon-aware alignment and the corresponding gene tree. It also requires that users specify the number of classes into which codons shall be partitioned. The script is run as follows:

Open in new tab

hyphy hyphy/res/TemplateBatchFiles/MSS-GA.bf \

–classes 2
–filelist files.txt \
–output outfile.json

Code to run forward simulations can be found at https://github.com/ampivirotto/mss_forward_sim. Simulated alignments used in the forward simulation analysis can be found at https://github.com/hverdonk/MSS-supplemental/tree/main/forward_simulations, and the simulated population tree is in supplementary fig. S1, Supplementary Material online. Scripts and data used in all other analyses can also be found at https://github.com/hverdonk/MSS-supplemental/.

References

Akashi

Synonymous codon usage in Drosophila melanogaster: natural selection and translational accuracy

Genetics

1994

136

(

927

–

935

10.1093/genetics/136.3.927

Armstrong

Hickey

Diekhans

Fiddes

Novak

Deran

Fang

Xie

Feng

Stiller

, et al.

Progressive cactus is a multiple-genome aligner for the thousand-genome era

Nature

2020

587

(

7833

246

–

251

10.1038/s41586-020-2871-y

Behrens

Rodschinka

Nedialkova

High-resolution quantitative profiling of tRNA abundance and modification status in eukaryotes by mim-tRNAseq

Mol Cell

2021

(

1802

–

1815.e7

10.1016/j.molcel.2021.01.028

Chamary

Parmley

Hurst

Hearing silence: non-neutral evolution at synonymous sites in mammals

Nat Rev Genet

2006

(

–

108

Chiari

Dion

Colborn

Parmakelis

Powell

On the possible role of tRNA base modifications in the evolution of codon usage: queuosine and Drosophila

J Mol Evol

2010

(

339

–

345

10.1007/s00239-010-9329-z

Crick

FHC

Codon—anticodon pairing: the wobble hypothesis

J Mol Biol

1966

(

548

–

555

10.1016/S0022-2836(66)80022-0

Dayhoff

Schwartz

Orcutt

. A model of evolutionary change in proteins. In:

Dayhoff

others

, editors.

Atlas of protein sequence and structure. Vol. 5

Washington, DC

Biomedical Research Foundation

;

1978

. p.

345

–

352

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Delport

Scheffler

Botha

Gravenor

Muse

Pond

SLK

CodonTest: modeling amino acid substitution preferences in coding sequences

PLoS Comput Biol

2010a

(

e1000885

10.1371/journal.pcbi.1000885

Delport

Scheffler

Gravenor

Muse

Pond

Benchmarking multi-rate Codon models

PLoS One

2010b

(

e11587

10.1371/journal.pone.0011587

Dhindsa

Copeland

Mustoe

Goldstein

Natural selection shapes Codon usage in the human genome

Am J Hum Genet

2020

107

(

–

10.1016/j.ajhg.2020.05.011

Drummond

Wilke

Mistranslation-induced protein misfolding as a dominant constraint on coding-sequence evolution

Cell

2008

134

(

341

–

352

10.1016/j.cell.2008.05.042

Eshelman

. The CHC adaptive search algorithm: how to have safe search when engaging in nontraditional genetic recombination. In:

Rawlins

GJE

, editors.

Foundations of genetic algorithms. Vol. 1

San Mateo, CA

Morgan Kaufmann Publishers, Inc.

;

1991

. p.

265

–

283

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Ewens

Mathematical population genetics [electronic resource]: 1. Theoretical introduction

New York

Springer

;

2004

Felsenstein

Evolutionary trees from DNA sequences: a maximum likelihood approach

J Mol Evol

1981

368

–

376

Goetz

Fuglsang

Correlation of codon bias measures with mRNA levels: analysis of transcriptome data from Escherichia coli

Biochem Biophys Res Commun

2005

327

–

10.1016/j.bbrc.2004.11.134

Hey

Kliman

Interactions between natural selection, recombination and gene density in the genes of Drosophila

Genetics

2002

160

595

–

608

10.1093/genetics/160.2.595

Ikemura

Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system

J Mol Biol

1981

151

389

–

409

10.1016/0022-2836(81)90003-6

Iriarte

Baraibar

Romero

Castro-Sowinski

Musto

Evolution of optimal codon choices in the family Enterobacteriaceae

Microbiology

2013

159

555

–

564

10.1099/mic.0.061952-0

Jones

Youssef

Susko

Bielawski

Phenomenological load on model parameters can lead to false biological conclusions

Mol Biol Evol

2018

1473

–

1488

10.1093/molbev/msy049

Kanaya

Yamada

Kinouchi

Kudo

Ikemura

Codon usage and tRNA genes in eukaryotes: correlation of Codon usage diversity with translation efficiency and with CG-dinucleotide usage as assessed by multivariate analysis

J Mol Evol

2001

290

–

298

10.1007/s002390010219

Kimura

On the probability of fixation of mutant genes in a population

Genetics

1962

713

–

719

10.1093/genetics/47.6.713

Kosakovsky Pond

Poon

AFY

Velazquez

Weaver

Hepler

Murrell

Shank

Magalis

Bouvier

Nekrutenko

, et al.

HyPhy 2.5—A customizable platform for evolutionary hypothesis testing using phylogenies

Mol Biol Evol

2020

(

295

–

299

10.1093/molbev/msz197

Kosakovsky Pond

Frost

SDW

Not so different after all: a comparison of methods for detecting amino acid sites under selection

Mol Biol Evol

2005

1208

–

1222

10.1093/molbev/msi105

Kosakovsky Pond

Mannino

Gravenor

Muse

Frost

SDW

Evolutionary model selection with a genetic algorithm: a case study using stem RNA

Mol Biol Evol

2007

159

–

170

10.1093/molbev/msl144

Kosakovsky Pond

Posada

Gravenor

Woelk

Frost Simon

Automated phylogenetic detection of recombination using a genetic algorithm

Mol Biol Evol

2006a

1891

–

1901

10.1093/molbev/msl051

Kosakovsky Pond

Posada

Gravenor

Woelk

Frost Simon

GARD: a genetic algorithm for recombination detection

Bioinformatics

2006b

3096

–

3098

10.1093/bioinformatics/btl474

Kosiol

Holmes

Goldman

An empirical Codon model for protein sequence evolution

Mol Biol Evol

2007

1464

–

1479

10.1093/molbev/msm064

Lack

Cardeno

Crepeau

Taylor

Corbett-Detig

Stevens

Langley

Pool

The Drosophila genome nexus: a population genomic resource of 623 Drosophila melanogaster genomes, including 197 from a single ancestral range population

Genetics

2015

199

1229

–

1241

10.1534/genetics.115.174664

Lucaci

Wisotsky

Shank

Weaver

Pond

SLK

Extra base hits: widespread empirical support for instantaneous multiple-nucleotide changes

PLoS One

2021

e0248337

10.1371/journal.pone.0248337

Marais

Domazet-Lošo

Tautz

Charlesworth

Correlated evolution of synonymous and nonsynonymous sites in Drosophila

J Mol Evol

2004

771

–

779

10.1007/s00239-004-2671-2

Marais

Duret

Synonymous Codon usage, accuracy of translation, and gene length in Caenorhabditis elegans

J Mol Evol

2001

(

275

–

280

10.1007/s002390010155

Moriyama

Powell

Codon usage bias and tRNA abundance in Drosophila

J Mol Evol

1997

(

514

–

523

Muse

Estimating synonymous and nonsynonymous substitution rates

Mol Biol Evol

1996

(

105

–

114

10.1093/oxfordjournals.molbev.a025549

Muse

Gaut

A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome

Mol Biol Evol

1994

(

715

–

724

10.1093/oxfordjournals.molbev.a040152

Pivirotto

. 2024. ampivirotto/mss_forward_sim. Available from: https://github.com/ampivirotto/mss_forward_sim.

Pond

Delport

Muse

Scheffler

Correcting the bias of empirical frequency parameter estimators in Codon models

PLoS One

2010

(

e11230

10.1371/journal.pone.0011230

Rahman

Kosakovsky Pond

Webb

Hey

Weak selection on synonymous codons substantially inflates dN/dS estimates in bacteria

Proc Natl Acad Sci U S A

2021

118

(

e2023575118

10.1073/pnas.2023575118

Sharp

The rate of synonymous substitution in enterobacterial genes is inversely related to codon usage bias

Mol Biol Evol

1987a

(

222

–

230

10.1093/oxfordjournals.molbev.a040443

Sharp

W-H

The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications

Nucleic Acids Res

1987b

(

1281

–

1295

10.1093/nar/15.3.1281

Sharp

Schaack

Cooley

Burke

Soil

Structure and transcription of eukaryotic tRNA gene

Crit Rev Biochem

1985

(

107

–

144

10.3109/10409238509082541

Google Scholar

Crossref

WorldCat

Shields

Sharp

Higgins

Wright

“Silent” sites in Drosophila genes are not neutral: evidence of selection among synonymous codons

Mol Biol Evol

1988

(

704

–

716

10.1093/oxfordjournals.molbev.a040525

Suvorov

Kim

Wang

Armstrong

Peede

D’Agostino

ERR

Price

Waddell

Lang

Courtier-Orgogozo

, et al.

Widespread introgression across a phylogeny of 155 Drosophila genomes

Curr Biol

2022

(

111

–

123.e5

10.1016/j.cub.2021.10.052

Torres

Piñeyro

Rodríguez-Escribà

Camacho

Reina

Saint-Léger

Filonava

Batlle

de Ribas Pouplana

Inosine modifications in human tRNAs are incorporated at the precursor tRNA level

Nucleic Acids Res

2015

(

5145

–

5157

Trotta

Selection on codon bias in yeast: a transcriptional hypothesis

Nucleic Acids Res

2013

(

9382

–

9395

Whelan

Goldman

A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach

Mol Biol Evol

2001

(

691

–

699

10.1093/oxfordjournals.molbev.a003851

Wisotsky

Kosakovsky Pond

Shank

Muse

Synonymous site-to-site substitution rate variation dramatically inflates false positive rates of selection analyses: ignore at your own peril

Mol Biol Evol

2020

(

2430

–

2439

10.1093/molbev/msaa037

Wong

Sainudiin

Nielsen

Identification of physicochemical selective pressure on protein encoding nucleotide sequences

BMC Bioinformatics

2006

(

148

10.1186/1471-2105-7-148

Yang

Nielsen

Mutation-selection models of Codon substitution and their use to estimate selective strengths on Codon usage

Mol Biol Evol

2008

(

568

–

579

10.1093/molbev/msm284

Zhang

Foo

Eren

Pan

tRNA modification dynamics from individual organisms to metaepitranscriptomics of microbiomes

Mol Cell

2022

(

891

–

906

10.1016/j.molcel.2021.12.007

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Associate Editor:

Download all slides

Month:	Total Views:
March 2025	62
April 2025	383
May 2025	40

Article Contents

A New Comparative Framework for Estimating Selection on Synonymous Substitutions

Abstract

Introduction

Materials and Methods

Multiclass Synonymous Substitution Models

Forward Population Genetic Simulations

A Genetic Algorithm for Learning Synonymous Substitutions From Sequence Alignments

Empirical Data Collection and Preparation

Drosophila Polymorphism Analysis

Quantifying tRNA Abundance

Results

Testing MSS by Forward Simulation

Synonymous Substitution Rates in Drosophila Genes

Description and Information Content of the Data

Estimating Synonymous Substitution Rates Using Maximum Likelihood

Codon Pair Substitution Rates Show Strong Correlation With Polymorphism Levels

Codon Pair Substitution Rates Correlate With tRNA Abundance

Estimates of dN/dS (ω) Ratios are Decreased Under MSS Models

Joint Estimation of MSS Models

A More Detailed Examination of How Model Choice Impacts dN/dS Estimation

Discussion

Supplementary Material

Acknowledgments

Funding

Data Availability

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

A New Comparative Framework for Estimating Selection on Synonymous Substitutions

Abstract

Introduction

Materials and Methods

Multiclass Synonymous Substitution Models

Forward Population Genetic Simulations

A Genetic Algorithm for Learning Synonymous Substitutions From Sequence Alignments

Empirical Data Collection and Preparation

Drosophila Polymorphism Analysis

Quantifying tRNA Abundance

Results

Testing MSS by Forward Simulation

Synonymous Substitution Rates in Drosophila Genes

Description and Information Content of the Data

Estimating Synonymous Substitution Rates Using Maximum Likelihood

Codon Pair Substitution Rates Show Strong Correlation With Polymorphism Levels

Codon Pair Substitution Rates Correlate With tRNA Abundance

Estimates of dN/dS (ω) Ratios are Decreased Under MSS Models

Joint Estimation of MSS Models

A More Detailed Examination of How Model Choice Impacts dN/dS Estimation

Discussion

Supplementary Material

Acknowledgments

Funding

Data Availability

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only