How to solve U.S. social problems when most rigorous program evaluations find disappointing effects (part one in a series)

Highlights:

The ultimate goal of evidence-based policy is to improve people’s lives.
This report discusses what we see as the key challenge in achieving that goal: When rigorously evaluated, most social programs and practices are found not to produce the hoped-for improvements in people’s lives, compared to usual services. A similar pattern occurs in other fields where rigorous evaluations are conducted, such as medicine and business.
Our next report—the second in this two-part series—will offer concrete suggestions for making progress in solving the nation’s social problems in light of this key challenge.

Our team reads studies for a living and for a hobby, and has been doing so for many years. We mainly read randomized controlled trials (RCTs) because RCTs, when well-conducted, are widely considered the strongest method of evaluating a program’s effectiveness.[i] Our main goal is to identify social programs and practices (“interventions”) that are backed by credible evidence of sizable, sustained effects on important life outcomes so that such interventions can be expanded in order to improve people’s lives on a larger scale. We summarize these interventions on the Social Programs That Work website and share the findings with government and foundation officials to help inform policy decisions. For example, we cited the evidence supporting the Nurse-Family Partnership as Exhibit A in our successful work with federal officials in 2007-2010 to establish the federal evidence-based home visiting program.[ii] Since joining the Laura and John Arnold Foundation (LJAF) in 2015, we have provided funding support for high-quality implementation of evidence-based interventions such as Career Academies and Accelerated Study in Associate Programs (ASAP), along with replication RCTs to determine whether the large effects found in earlier studies can be reproduced in new settings.

Reviewing thousands of evaluation studies over the years has also given us a profound appreciation of how challenging it is to find interventions, such as those above, that produce a real improvement in people’s lives. It is a lesson that Peter Rossi, a prominent sociologist who was a leading expert on program evaluation from the 1960s through the early 2000s, articulated as metallic “laws” in his classic 1987 paper. Here are the first two laws:

“The Iron Law of Evaluation: The expected value of any net impact assessment of any large scale social program is zero.

“The Iron Law arises from the experience that few impact assessments of large scale[iii] social programs have found that the programs in question had any net impact. The law also means that, based on the evaluation efforts of the last twenty years, the best a priori estimate of the net impact assessment of any program is zero, i.e., that the program will have no effect.

“The Stainless Steel Law of Evaluation: The better designed the impact assessment of a social program, the more likely is the resulting estimate of net impact to be zero.

“This law means that the more technically rigorous the net impact assessment, the more likely are its results to be zero – or no effect. Specifically, this law implies that estimating net impacts through randomized controlled experiments, the avowedly best approach to estimating net impacts, is more likely to show zero effects than other less rigorous approaches.”

Rossi moderated his comments in later years, citing important positive effects found in some rigorous evaluations while still noting that the majority end up with findings of no effect or substantively marginal effects. His observations have received corroboration in federally-funded RCTs of major social programs such as Head Start, Job Corps, and Upward Bound from the 1990s through the present, most of which have found weak or no positive effects compared to existing services in the community.

It is important to recognize, however, that Rossi’s “laws” apply not just to social policy, but to other fields where rigorous evaluations are conducted, such as business and medicine. Examples from various disciplines include:

Business: Of 13,000 RCTs conducted by Google and Microsoft to evaluate new products or strategies in recent years, 80 to 90 percent have reportedly found no significant effects.[iv]

Medicine: Reviews in different fields of medicine have found that 50 to 80 percent of positive results in initial clinical studies are overturned in subsequent, more definitive RCTs.[v] Thus, even in cases where initial studies—such as comparison-group designs or small RCTs—show promise, the findings usually do not hold up in more rigorous testing.

Education: Of the 90 educational interventions evaluated in RCTs commissioned by the Institute of Education Sciences and reporting findings between 2002 and 2013, close to 90 percent were found to produce weak or no positive effects.[vi]

Employment/training: In Department of Labor-commissioned RCTs that reported findings between 1992 and 2013, about 75 percent of tested interventions were found to have found weak or no positive effects.[vii]

In other words, across a range of human endeavors, most new interventions when rigorously evaluated are found to produce outcomes that are not meaningfully better than those of a control group, which receives (in most studies) usual services—such as standard medical treatment (in a medical RCT), usual business practice (in an RCT of a business strategy), or existing school practices (in an education RCT).

Someone reading academic journals or looking at web-based clearinghouses of evidence-based interventions might get the impression that there are actually a large number of social interventions shown effective through rigorous evaluations. As we have discussed in previous Straight Talk reports, however, most of these “evidence-based” interventions are backed by only preliminary or flawed findings, which, based on the long history of rigorous evaluations, are often reversed when a more definitive evaluation is subsequently carried out.

The bottom line is that it is harder to make progress than commonly appreciated. While judgments may differ about the credibility of a particular study or the policy importance of a particular impact finding, the pattern of disappointing effects for most rigorously-evaluated programs—along with findings of important positive effects for a few—is compelling and transcends multiple fields. It needs to be taken seriously.

What, then, is the path to progress in addressing the nation’s social problems? The field of medicine—where, per Rossi’s law, most rigorously-tested interventions fail—has nevertheless made enormous progress in improving human health over the past 50 years based largely on interventions that have been found effective in well-conducted RCTs (e.g., treatments for hypertension, high cholesterol, and childhood leukemia). Our next Straight Talk report will discuss how social policy can achieve similar success by focusing program and evaluation funding on two core goals: (i) building the body social interventions with strong, replicated evidence of important improvements in people’s lives; and (ii) expanding the delivery of interventions meeting that high evidence standard. The report will offer concrete suggestions for achieving these goals given the steep challenge in finding interventions that improve on current practice.

References:

[i] Institute of Education Sciences and National Science Foundation, Common Guidelines for Education Research and Development, August 2013, linked here. National Research Council and Institute of Medicine, Preventing Mental, Emotional, and Behavioral Disorders Among Young People: Progress and Possibilities, Mary Ellen O’Connell, Thomas Boat, and Kenneth E. Warner, Editors (Washington DC: National Academies Press, 2009), recommendation 12-4, p. 371, linked here. CBO’s Use of Evidence in Analysis of Budget and Economic Policies, Jeffrey R. Kling, Associate Director for Economic Analysis, November 3, 2011, page 31, linked here. U.S. Preventive Services Task Force, “Current Methods of the U.S. Preventive Services Task Force: A Review of the Process,” American Journal of Preventive Medicine, vol. 20, no. 3 (supplement), April 2001, pp. 21-35. The Food and Drug Administration’s standard for assessing the effectiveness of pharmaceutical drugs and medical devices, at 21 C.F.R. §314.126, linked here. Every Student Succeeds Act, Section 8002 definition of “evidence-based,” Public Law 114-95, December 10, 2015.

[ii] Ron Haskins and Jon Baron, Building the Connection between Policy and Evidence: The Obama Evidence-Based Initiatives, U.K. National Endowment for Science, Technology and the Arts (NESTA), September 2011, linked here.

[iii] [This footnote is in Rossi’s paper.] Note that the law emphasizes that it applied primarily to large scale social programs, primarily those that are implemented by an established governmental agency covering a region or the nation as a whole. It does not apply to small scale demonstrations or to programs run by their designers.

[iv] Jim Manzi, Uncontrolled: The Surprising Payoff of Trial-and-Error for Business, Politics, and Society, Perseus Books Group, New York, 2012, pp. 128 and 142. Jim Manzi, Science, Knowledge, and Freedom, presentation at Harvard University’s Program on Constitutional Government, December 2012, linked here.

[v] John P. A. Ioannidis, “Contradicted and Initially Stronger Effects in Highly Cited Clinical Research,” Journal of the American Medical Association, vol. 294, no. 2, July 13, 2005, pp. 218-228. Mohammad I. Zia, Lillian L. Siu, Greg R. Pond, and Eric X. Chen, “Comparison of Outcomes of Phase II Studies and Subsequent Randomized Control Studies Using Identical Chemotherapeutic Regimens,” Journal of Clinical Oncology, vol. 23, no. 28, October 1, 2005, pp. 6982-6991. John K. Chan et. al., “Analysis of Phase II Studies on Targeted Agents and Subsequent Phase III Trials: What Are the Predictors for Success,” Journal of Clinical Oncology, vol. 26, no. 9, March 20, 2008. Michael L. Maitland, Christine Hudoba, Kelly L. Snider, and Mark J. Ratain, “Analysis of the Yield of Phase II Combination Therapy Trials in Medical Oncology,” Clinical Cancer Research, vol. 16, no. 21, November 2010, pp. 5296-5302. Jens Minnerup, Heike Wersching, Matthias Schilling, and Wolf Rüdiger Schäbitz, “Analysis of early phase and subsequent phase III stroke studies of neuroprotectants: outcomes and predictors for success,” Experimental & Translational Stroke Medicine, vol. 6, no. 2, 2014.

[vi] Coalition for Evidence-Based Policy, Randomized Controlled Trials Commissioned by the Institute of Education Sciences Since 2002: How Many Found Positive Versus Weak or No Effects, July 2013, linked here.

[vii] This is based on our count of results from the Department of Labor RCTs, as identified through the Department’s research database (link).

Policy Area

Study Report Accuracy

Credible Positive Finding

Archives