- We review a randomized controlled trial (RCT) of Honest Opportunity Probation with Enforcement (HOPE), a program for probationers that provides swift and certain sanctions—such as brief jail time—for any probation violation.
- This was a replication RCT, which sought to determine whether the large effects on recidivism found in a prior RCT of HOPE in Hawaii could be reproduced in four new sites (Arkansas, Massachusetts, Oregon, and Texas).
- Based in part on the original Hawaii findings, HOPE has expanded to 28 states.
- Such wide expansion may have been premature. Our review found the new study to be well conducted; it was unable to reproduce the large effects found in the Hawaii study.
- Unsuccessful replications such as this are not unusual in various fields (including medicine) and underscore the importance of replication RCTs prior to large-scale program expansion.
No, that’s not a typo in the title of this post. We didn’t mean to say, “if you at first you don’t succeed …” One of the fundamental principles of scientific research is replication—retesting of apparently successful interventions to make sure the initial success was not a fluke. The principle is widely accepted and practiced in the physical and biological sciences, but is less so in the social sciences. Today’s case in point, a replication of the Hawaii Opportunity Probation with Enforcement (HOPE) program, makes clear why replication of positive findings is also essential in the social sciences before policymakers can be confident that a social program is truly effective.
HOPE is a high-intensity supervision program for probationers that provides swift and certain sanctions, such as brief jail time, for any probation violation. An initial, well-conducted randomized controlled trial (RCT) of HOPE in Hawaii that was published in 2009 found very large effects on crime-related outcomes—including reductions of more than 50 percent in the rate of rearrests and probation revocations during the 12 months after random assignment,[i] as we summarize here. Based in part on these blockbuster findings, many other jurisdictions began implementing HOPE-related programs. As of January 2015, HOPE or similar swift, certain, and fair (SCF) programs were being used in 28 states, one Indian nation, and one Canadian province.[ii]
Unfortunately, these jurisdictions might have acted prematurely. In 2011, the National Institute of Justice, to its credit, funded a large replication RCT of HOPE under the name Honest Opportunity Probation with Enforcement. This new study had a larger sample (1,504 probationers versus 503 in the original RCT) and was conducted in multiple states (Arkansas, Massachusetts, Oregon, and Texas, versus just Hawaii in the original RCT). The study measured crime-related outcomes over the 21-month period following random assignment. In other words, this was a large, rigorous replication RCT of HOPE as delivered on a sizable scale in multiple jurisdictions.
Researchers published the results of the study last year.[iii] The study found that the four jurisdictions were largely successful at implementing HOPE in adherence to the program model. Yet the effects found in the replication RCT (which we summarize here) were much weaker and less encouraging than those of the original study. Here’s a side-by-side comparison for two of the key targeted outcomes:
|Outcome||Original Hawaii RCT||Four-state replication RCT|
|Percent of probationers rearrested||21% of HOPE group vs 47% of control group**||40% of HOPE group vs 44% of control groupa|
|Percent of probationers whose probation was revoked||7% of HOPE group vs 15% of control group**||26% of HOPE group vs 22% of control group (i.e., a slightly higher rate for the HOPE group)b|
**Statistically significant at the 0.01 level; aClose to statistically significant (p=0.11); bStatistically significant at the 0.10 level
Overall, this cannot be considered a successful replication of the program. However, unsuccessful replications such as this are not that unusual. Based on the experience of medical RCTs, they may in fact be more common than successful replications. For example, one analysis of the results of 43 large-scale randomized (“phase III”) clinical trials in medicine found that 72 percent of those trials failed to replicate the positive results of the smaller (“phase II”) studies on which they were based, even though they used identical therapeutic regimens.[iv] Other analyses in different areas of medicine have found a similar pattern, with replication failure rates ranging between 50 and 80 percent.[v] While this kind of systematic evidence is not available for evaluations of social programs, it seems likely that the pattern would be similar.
The key message for policymakers: One relatively small study—even a well-conducted RCT with blockbuster findings such as Hawaii HOPE—generally does not provide a sufficient basis for widespread program implementation; there is a good chance that the program would not produce the hoped-for effects if implemented in new jurisdictions.
We believe the logical next step based on a positive finding such as Hawaii HOPE is a modest expansion of the program to new jurisdictions, coupled with an RCT to determine if the initial findings can be reproduced—which was exactly the course of action taken by the National Institute of Justice. By contrast, the expansion of HOPE to 28 states based on the original Hawaii RCT was well-intentioned but premature.
The frequency of unsuccessful replications is precisely the reason that the Food and Drug Administration (FDA) has, since the 1960s, generally required at least two well-conducted RCTs showing effects on important health outcomes before it will allow a pharmaceutical drug to be licensed for market. As the FDA explains:
“The usual requirement for more than one adequate and well-controlled investigation reflects the need for independent substantiation of experimental results. A single clinical experimental finding of efficacy, unsupported by other independent evidence, has not usually been considered adequate scientific support for a conclusion of effectiveness.”[vi]
What could account for the different findings between Hawaii and replication trial? As noted, we believe that both studies were well executed and that HOPE was implemented in adherence to the program model in both cases. Thus, we rule out faulty implementation of the study and/or program, which is a frequent reason for failure to replicate. In fact, both studies could have accurately measured the effects of the program in the localities where it was tested. It may be that HOPE works well in Hawaii, but not so well in Arkansas, Massachusetts, Oregon, and Texas, perhaps because of differences between Hawaii and the other states in their study populations or criminal justice systems. If this is the case, it means that unfortunately the program does not produce the targeted effects when implemented across diverse sites and populations, which would be the ideal outcome; however, it does work in some communities, and further research is needed to try to find the type of community in which it is effective.
It could also be that the original results were a statistical fluke—what statisticians call a Type I error, or “false positive.” The tests on which statistical significance are based generally accept a 5 percent risk of a false positive. In other words, even in a well-conducted RCT, one time in 20 the test will show a statistically-significant effect for an intervention that is not effective, simply by chance.[vii] As Stanford professor John Ioannidis pointed out in his seminal article “Why Most Published Research Findings are False,” if most programs and treatments are not effective—as has been found in rigorous RCTs across diverse fields such as medicine, business, and social policy[viii]—it can be shown mathematically that any given RCT finding of statistically-significant effects has a much higher chance than 5 percent of being a false positive.[ix] The chances may be as high as 40 percent under reasonable assumptions.
Our bottom line: Replication is essential if the results of social policy trials are to form a sound evidence base for policy. So… if at first you succeed, try again!
This post was written by guest author Larry Orr, Associate at the Institute for Health and Social Policy, Johns Hopkins Bloomberg School of Public Health.
Response provided by the lead author
We invited the lead study author, Pamela Lattimore, to provide written comments on our review. She provided informal input that we incorporated into the final version of this review and declined our offer to submit a separate written response.
[i] A recently-released long-term follow-up of the original Hawaii RCT reported substantially smaller effects on crime outcomes during the follow-up period (76 months after random assignment) than had been found in the earlier, 12-month report. However, this new analysis may not have been a fair test of HOPE’s long-term effects because the Hawaii HOPE program was expanded substantially after the 12-month report and, as a result, approximately 35 percent of individuals originally assigned to the control group were later transferred to the HOPE supervision group. Such control group “crossovers” diminished the contrast in probation supervision between the HOPE group and control group during the long-term follow-up, plausibly causing the study to underestimate HOPE’s true effects. Source: Angela Hawken, Jonathan Kulick, Kelly Smith, Jie Mei, Yiwen Zhang, Sara Jarman, Travis Yu, Chris Carson, and Tifanie Vial, HOPE II: A Follow-up to Hawaiʻi’s HOPE Evaluation, report submitted to the National Institute of Justice, May 17, 2016.
[ii] Ibid, p. 24.
[iii] P.K. Lattimore, D.L. MacKenzie, G. Zajac, D. Dawes, E. Arsenault, and S. Tueller, “Outcome findings from the HOPE demonstration field experiment: Is swift, certain, and fair an effective supervision strategy?” Criminology & Public Policy, vol. 15, 2016, 1103-1141.
[iv] Mohammad I. Zia, Lillian L. Siu, Greg R. Pond, and Eric X. Chen, “Comparison of Outcomes of Phase II Studies and Subsequent Randomized Control Studies Using Identical Chemotherapeutic Regimens,” Journal of Clinical Oncology, vol. 23, no. 28, October 1, 2005, pp. 6982-6991.
[v] John P. A. Ioannidis, “Contradicted and Initially Stronger Effects in Highly Cited Clinical Research,” Journal of the American Medical Association, vol. 294, no. 2, July 13, 2005, pp. 218-228. John K. Chan et. al., “Analysis of Phase II Studies on Targeted Agents and Subsequent Phase III Trials: What Are the Predictors for Success,” Journal of Clinical Oncology, vol. 26, no. 9, March 20, 2008. Michael L. Maitland, Christine Hudoba, Kelly L. Snider, and Mark J. Ratain, “Analysis of the Yield of Phase II Combination Therapy Trials in Medical Oncology,” Clinical Cancer Research, vol. 16, no. 21, November 2010, pp. 5296-5302. Jens Minnerup, Heike Wersching, Matthias Schilling, and Wolf Rüdiger Schäbitz, “Analysis of early phase and subsequent phase III stroke studies of neuroprotectants: outcomes and predictors for success,” Experimental & Translational Stroke Medicine, vol. 6, no. 2, 2014.
[vi] U.S. Department of Health and Human Services, Food and Drug Administration, Guidance for Industry: Providing Clinical Evidence of Effectiveness for Human Drugs and Biological Products, May 1998, p. 4, linked here.
[vii] In the case of the Hawaii HOPE study, the effect on rearrests was statistically significant at the 1 percent level, so the chance that the effect was a false positive is lower than in the usual situation where statistical significance is achieved at the 5 percent level.
[viii] Coalition for Evidence-Based Policy, Practical Evaluation Strategies for Building a Body of Proven-Effective Social Programs: Suggestions for Research and Program Funders, 2013, p. 1, linked here.
[ix] John P. A. Ioannidis, “Why Most Published Research Findings Are False,” PloS Medicine, vol. 2, no. 8, August 2005, p. e124.