Why most non-RCT program evaluation findings are unreliable (and a way to improve them)

Highlights:

Well-conducted randomized controlled trials (RCTs) are considered the strongest method of evaluating a program’s effectiveness. But, for some programs, RCT evidence may not be available. In these cases, policy officials often look to “quasi-experiments”—studies that compare program participants to a group of nonparticipants selected through methods other than randomization—to gauge program effectiveness.
Most quasi-experiments unfortunately do not produce credible evidence. Even the more rigorous quasi-experimental study designs have an important weakness in practice: They are highly vulnerable to researcher biases in a way that well-conducted RCTs are not.
Specifically, a large RCT uses random assignment—essentially, a coin toss—to (i) determine the membership of the program and comparison groups, and (ii) ensure that the two groups are highly similar in key characteristics. By contrast, in a typical quasi-experiment, the researchers largely determine membership of the two groups and how to “equate” them using various statistical methods, and can consciously or unconsciously select approaches that produce a hoped-for result.
Different quasi-experimental approaches can yield widely varying results; thus, policy officials can easily end up expanding an ineffective or harmful program in the mistaken belief, based on quasi-experimental findings, that it is effective.
We propose changes to current quasi-experimental practice to increase the credibility of study findings, including full pre-specification, at the study’s inception, of all parameters to be used in the main analysis.

As regular readers of our Straight Talk newsletter are aware, we focus our reports primarily on findings of randomized controlled trials (RCTs). The reason is that RCTs, when well-conducted, are widely considered the strongest method of evaluating a program’s effectiveness. [1] Their unique advantage is the ability to ensure—through randomly assigning a sufficiently large sample of people to a program group or a control group—that there are no systematic differences between the two groups in either observable characteristics (e.g., income, education, age) or unobservable characteristics (e.g., motivation, psychological resilience, family support). Thus, any difference in outcomes between the two groups can confidently be attributed to the program and not to other factors.

However, for some programs, an RCT evaluation may not be feasible or, even if feasible, has not yet been conducted. In these cases, policy officials often look to “quasi-experiments”—that is, studies that compare program participants to a group of nonparticipants selected through methods other than randomization—to gauge program effectiveness. Indeed, quasi-experimental program evaluations are far more common than RCTs.

But are such studies reliable—that is, do their findings reasonably approximate the true program impacts? We believe the answer is usually “no,” but that two key reforms to current practice could increase the reliability of these studies. To first summarize the problem with current practice—

Methodological studies that have tested various quasi-experimental designs against large, well-conducted (“benchmark”) RCTs have found that many common quasi-experimental designs do not reliably reproduce RCT results. [2] On the other hand, they have also identified a few quasi-experimental methods that—across various simulations—do reproduce the results of benchmark RCTs with at least a moderate degree of consistency (as summarized here).

And yet, even these more rigorous quasi-experimental studies—while they may work reasonably well in simulations—have an important weakness in practice: They are highly vulnerable to researcher bias in a way that well-conducted RCTs are not. [3]

There are straightforward ways to address this vulnerability to researcher bias, as we outline later, but we start by discussing why it is such a serious problem for current quasi-experimental practice.

To be clear, we believe that most researchers and program providers are well-intentioned and do not consciously seek to report inaccurate findings. But they often face strong incentives to report evaluation findings in the most favorable possible light (e.g., to get published in a top journal and/or obtain future funding); and to not report findings of weak or no program effects, as doing so may jeopardize funding, chances of publication, and/or career advancement. They may also feel strongly invested in a program’s success, having put years of time and energy into developing or studying the program. In this context, as Nobel laureate Richard Feynman has noted, it is easy for researchers to talk themselves into study methods and interpretations of findings that confirm their hopes. Such factors may explain why exaggerated claims of effectiveness are pervasive in reported evaluation findings—a phenomenon we have discussed in numerous Straight Talk reports and which has also been systematically documented in the reporting of medical research. [4]

Well-conducted RCTs provide considerable protection against such potential biases. As an illustrative example, our organization is funding a large RCT of Bottom Line—a program that provides one-on-one guidance to help low-income students get into and graduate from college. The study, led by researchers Ben Castleman and Andrew Barr, has randomly assigned 2,422 students in three cities to either a program group that was offered the program or a control group that received usual services. The study will measure rates of college degree attainment (the study’s pre-registered primary outcome) for both groups over a seven-year period using data from the National Student Clearinghouse.

The beauty of this study design is that the die is already cast and the results will come out where they come out. Specifically, the study uses random assignment—essentially, a coin toss—to (i) determine which students are in the program versus control group, and (ii) ensure (since the sample is large) that the two groups are highly similar in key characteristics. At study completion, the rate of degree attainment for the program group will either be higher than that of the control group or it won’t, regardless of any hopes or expectations of the researchers or program provider or others. (The study’s interim report confirms that randomization did indeed produce two groups that are equivalent in observable characteristics, and presents encouraging early findings.)

If the study deviates from the above design—for example, by losing a sizable part of the randomized sample to attrition, or switching its primary targeted outcome—careful readers of the study will likely be able to detect and call out the departure (as we have in various Straight Talk reports).

By contrast, in a typical quasi-experiment—even one with a rigorous design—the die is not cast at the study’s inception; instead, the research team usually starts with all the data needed to carry out the study, and has great discretion over which data to use and how to analyze it. Most notably, instead of using the mechanics of random assignment to determine program and comparison group membership and ensure equivalence of the two groups, in a typical quasi-experiment—

Researcher discretion plays a central role in determining what individuals will constitute the study’s program and comparison groups. For instance, a research team evaluating a job training program in Tennessee might select a program group from among people who participated in the program in Nashville or Memphis or Chattanooga, who entered in years 2009, 2010, or 2011, or who were in age-group 16-20 or 20-24; and they might select a comparison group of non-participants from the same Tennessee city or another city or another state.

Researcher discretion plays a central role in “equating” the two groups. Specifically, the research team typically assumes the two groups differ in key characteristics and selects from a variety of statistical methods to adjust for such differences (e.g., analysis of covariance, propensity score matching, difference-in-differences[5]).

Importantly, even in the more rigorous quasi-experimental studies, researchers’ choice of these parameters can lead to markedly different study findings. As illustrative examples:

Quasi-experimental method: Comparative Interrupted Time Series (CITS, described here). In evaluating an Indiana school-level reform program, St. Clair, Cook, and Hallberg 2014 found that, under most parameter choices, the CITS findings reproduced the benchmark RCT finding of no significant effect on students’ reading scores.[6] However, under one reasonable parameter choice (using six years of pre-program data, as opposed to two), the CITS design showed a large, statistically-significant adverse effect on reading—0.34 standard deviations, or roughly one full grade-level of learning among students of that age.

Quasi-experimental method: Comparison group of individuals in the same state participating in the same public benefits program. In evaluating a welfare-to-work program for single-parent welfare recipients in Grand Rapids, Michigan, Bloom et. al. 2002 found that, under some reasonable parameter choices (comparison to a group of such welfare recipients in Detroit, with no statistical adjustments), findings from the comparison-group design reproduced the benchmark RCT finding of no significant effect on earnings 3-5 years after program entry.[7] However, under another reasonable parameter choice (same Detroit comparison group, difference-in-differences estimator), the comparison-group design showed a sizable positive effect—a statistically significant 29 percent gain in annual earnings.

Quasi-experimental method: Regression Discontinuity Design (RDD, described here). In evaluating the Teach for America education program, Gleason, Resch, and Berk 2012 found that, on average, the findings of 12 different RDD methods—using different samples/bandwidths and analytical methods—closely approximated the benchmark RCT finding of a modest positive effect on students’ math scores (0.12 standard deviations, or roughly 13 percent of a grade-level of learning for students of that age).[8] However, under one of those RDD methods, the effect on math was nearly twice as large (0.22 standard deviations) while under another it was almost exactly zero and not statistically significant.

In short, even in these more rigorous quasi-experimental designs, different reasonable choices of study parameters can yield very different study conclusions. As a result, researchers using these designs can try out a range of parameters and select those that produce a hoped-for result. Readers of the study usually have no way of knowing if this occurred—specifically, how many parameters were tried and discarded. As a result, policy officials could easily end up expanding an ineffective or harmful program in the mistaken belief, based on a quasi-experimental evaluation, that it is effective. (We recognize this problem can also affect RCTs, though to a lesser extent.[9])

We therefore propose a change to current quasi-experimental evaluation practice to increase the reliability of study findings.[10] First, we propose that researchers seeking to credibly determine a program’s impact in cases where an RCT is not feasible select from the subset of quasi-experimental designs that simulations have found to approximate benchmark RCT findings with some degree of consistency.

Second, we propose that researchers conducting these studies fully pre-specify and publicly register all study parameters to be used in their primary analysis before examining any outcome data for the treatment versus comparison group. By “fully” pre-specify, we mean specify the parameters with sufficient detail to leave no room for post-hoc discretion or adjustments in the study’s primary analysis. The idea is that, just as in a well-conducted RCT, the die will be cast at the start of the study, and the results will come out where they come out. The researchers could still, of course, conduct exploratory, post-hoc analyses, but these would be clearly labeled as such and their results presented as only suggestive.

Our team at Arnold Ventures funds RCT evaluations, and we ask each grantee to fully pre-specify and publicly register their study design and primary analysis at the study’s inception. Full pre-specification is even more important in quasi-experimental designs, as such studies typically allow much greater leeway for post-hoc researcher discretion than an RCT. Currently, full pre-specification is not common practice in quasi-experimental program evaluations and, until it is, policy officials should not put much stock in their findings.

References:

[1] For example, see Institute of Education Sciences and National Science Foundation, Common Guidelines for Education Research and Development, August 2013, linked here; and Food and Drug Administration standard for assessing the effectiveness of pharmaceutical drugs and medical devices, 21 C.F.R. §314.126, linked here.

[2] These studies test quasi-experimental designs against benchmark RCTs as follows. For a particular program being evaluated, they first compare program participants’ outcomes to those of a randomly-assigned control group, in order to estimate the program’s impact in the RCT. The studies then compare the same program participants with a comparison group selected through methods other than randomization, in order to estimate the program’s impact in a quasi-experimental design. The studies can thereby determine whether the quasi-experimental impact estimate replicates the benchmark estimate from the RCT.

[3] As an exception to our general skepticism about the reliability of such quasi-experiments—if a program or treatment’s effects are exceptionally large (think: penicillin for the treatment of bacterial infections, or schooling of children as compared to no schooling), a rigorous quasi-experiment will likely be able to detect the effect. This is because the program or treatment’s large effect size will likely swamp possible confounding factors (e.g., differences in characteristics between the program and comparison groups that could differentially affect their outcomes, or—as discussed in the main text—potential researcher biases). The study will thus show a large positive effect regardless of these potential confounders.

[4] Examples include: F. E. Vera-Badillo, R. Shapiro, A. Ocana, E. Amir & I. F. Tannock, “Bias in reporting of end points of efficacy and toxicity in randomized, clinical trials for women with breast cancer,” Annals of Oncology, vol. 24, no. 5, 2013, pp. 1238-1244, linked here. Nasim A. Khan, Manisha Singh, Horace J. Spencer, and Karina D. Torralba, “Randomized controlled trials of rheumatoid arthritis registered at ClinicalTrials.gov: what gets published and when,” Arthritis & Rheumatism, vol. 66, no. 10, October 2014, pp. 2664-74, linked here. John P. A. Ioannidis, “Why Most Published Research Findings Are False, PloS Medicine, vol. 2, no. 8, August 2005, p. e124, linked here.

[5] Some methods, such as propensity score matching, may include dropping certain individuals from the initial program and comparison groups.

[6] Travis St.Clair, Thomas D. Cook, and Kelly Hallberg, “Examining the Internal Validity and Statistical Precision of the Comparative Interrupted Time Series Design by Comparison With a Randomized Experiment,” American Journal of Evaluation, vol. 35, no 3, pp. 311-327, linked here.

[7] Howard S. Bloom, Charles Michalopoulos, Carolyn J. Hill, and Ying Lei, Can Nonexperimental Comparison Group Methods Match the Findings from a Random Assignment Evaluation of Mandatory Welfare-to-Work Programs?, MDRC Working Papers on Research Methodology, June 2002, linked here.

[8] Philip M. Gleason, Alexandra M. Resch, and Jillian A. Berk, “Replicating Experimental Impact Estimates Using a Regression Discontinuity Approach,” National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education, NCEE Reference Report 2012-4025, 2012, linked here.

[9] RCTs, as often conducted, also allow the researchers a degree of post-hoc discretion, albeit usually on a more limited set of parameters (e.g., regression methods used to control for any pre-program differences between the program and control group, choice of outcome measures, methods to test for statistical significance of any impact findings). This is the reason why, as described in the last paragraph of this report, we ask each RCT grantee that we fund, at the study’s inception, to fully pre-specify and publicly register all such study parameters.

[10] Our proposal applies to quasi-experimental program evaluations. We recognize that quasi-experimental methods are also used in other types of studies (e.g., studies seeking to identify risk factors predisposing youth to adverse outcomes such as crime or substance abuse). We do not intend our proposal to apply to such studies.

Highlights:

References:

Policy Area

Study Report Accuracy

Credible Positive Finding

Archives

Why most non-RCT program evaluation findings are unreliable (and a way to improve them)

Highlights:

References:

Policy Area

Study Report Accuracy

Credible Positive Finding

Archives

Subscribe to Our Newsletter