Highlights:

  • The Institute of Education Sciences’ (IES) What Works Clearinghouse (WWC) is a widely-cited repository of evidence on “what works” in education. Since its launch in 2003, the WWC has played a central role in advancing rigorous evaluation methods in the field of education.
  • However, we believe a flaw in the WWC review process is that it equates small preliminary studies, which are often unreliable, with larger more definitive evaluations, resulting in a WWC determination that certain programs are beneficial when the totality of the evidence shows the opposite. We illustrate the problem with the WWC’s review of Project CRISS, a program that aims to improve student literacy in grades three to 12.
  • WWC identified two studies of CRISS as meeting WWC evidence standards. The first was a preliminary randomized controlled trial (RCT) that found a large effect on reading achievement but had important limitations, such as a very small sample, use of researcher-designed outcome measures, and only an 18-week follow-up.
  • The second study was a major IES-sponsored RCT with a far bigger sample, and outcomes measured over a full school year with an independent, standardized reading test. It found no significant effects.
  • This example of positive findings from a small, preliminary study being overturned in a larger, more credible evaluation is not uncommon. It means that it would be unwise for schools to adopt CRISS to improve student reading.
  • The WWC, however, signals the opposite, reporting that CRISS has “potentially positive effects” that are quite large. The WWC reached this conclusion by simply averaging the effects found in the two studies as if they had equal evidentiary value and ignoring that the large IES-sponsored RCT was far more credible.
  • CRISS illustrates a more general problem: WWC’s over-weighting of preliminary studies unintentionally encourages schools to adopt programs that are unlikely to improve education in the mistaken belief that they are “evidence based.” We offer suggestions for improvement.

This is the second in a series of Straight Talk reports on how “official” evidence reviews can sometimes make ineffective programs appear effective. Our focus in this report is on reviews conducted by the Institute of Education Sciences’ (IES) What Works Clearinghouse (WWC).

As brief background, the WWC is a widely-cited repository of evidence on “what works” in education. Since its launch in 2003, the WWC has played a central role in promoting rigorous evaluation methods in the field of education by setting standards for rigor that underscore the importance of random assignment, low sample attrition, and valid methods of analysis. Researchers in education and other fields often design and conduct studies with the explicit goal of meeting WWC standards.

We intend this Straight Talk report as constructive criticism aimed at encouraging the IES leadership to correct what we believe is a flaw in the WWC process for reviewing evidence that partly undermines its mission. Our concern is that the WWC equates small preliminary studies, which are often unreliable, with larger, more definitive evaluations. In a number of cases, this results in the WWC determining that programs are beneficial when the evidence suggests exactly the opposite.

We illustrate the problem with the WWC’s review of Project CRISS, a teacher professional development program that aims to improve student literacy (although as we note below, the WWC contains many similar examples). The WWC review of CRISS, linked here, identifies two studies as meeting WWC evidence standards. The first, an unpublished randomized controlled trial (RCT),[i] found a positive effect on reading comprehension for students in fourth and sixth grade—an increase of 1.1 standard deviations. If that is true, that is a remarkably large impact: It means the program moves performance for the average student from the 50th to the 86th percentile, according to the WWC.

Importantly, however, this study was only preliminary in nature. It had a small sample (12 fourth- and sixth-grade classrooms, containing a total of 231 students, were randomly assigned to a treatment versus control group); it measured reading comprehension with a test designed by the research team, rather than an independent, standardized test (researcher-designed tests typically yield much larger effect sizes [ii] than independent measures); and it measured outcomes over a period of just 18 weeks, so the study could not determine whether the effect endures long enough to constitute meaningful improvement.

Preliminary studies like this are a valuable part of the evidence-building process as they can help identify programs that warrant further evaluation in larger, more definitive studies. But they cannot by themselves produce reliable evidence because, as we’ve previously discussed, in most cases, attempts to replicate positive findings from such studies in subsequent, more rigorous evaluations are not successful—i.e., the preliminary results are overturned.

This pattern of reversal occurs in many fields, including medicine,[iii] and is exactly what happened in the case of CRISS. IES sponsored a much larger and more rigorous RCT of CRISS as implemented in fifth grade (this is the second CRISS study that the WWC review identifies as meeting WWC evidence standards).[iv] The new study had a far bigger sample (38 schools, containing 2,338 fifth graders, were randomly assigned); measured reading comprehension using an independent, well-validated test (the Group Reading Assessment and Diagnostic Evaluation, or “GRADE,” Passage Comprehension subtest); enlisted the developer of CRISS to provide training and ongoing support to teachers in the treatment group to foster high-quality implementation of the program; and measured student outcomes over a full school year.

The IES-sponsored RCT unfortunately found no significant effect on GRADE reading comprehension scores at the end of the year. In fact, the control group’s scores were marginally higher than the treatment group’s scores, although the differences were not statistically significant. (After the WWC report was published, the study continued for an additional year, measuring CRISS’ effect on a new cohort of incoming fifth graders and its effect after two years on the first cohort, and again it found no significant effects.)

This was, in short, a common case in which positive findings from a preliminary study were overturned in a larger, more definitive evaluation. This may have occurred because the initial finding of an effect on reading comprehension was erroneous, which is a plausible explanation given the study’s preliminary nature. Another possibility is that the initial finding was accurate—that is, the program did produce positive effects when implemented on a small scale (i.e., the six treatment group classrooms in the initial study)—but it did not produce the desired effects when delivered on a larger scale under more typical implementation conditions (where, for example, program developers could not tightly control the program’s delivery). Or, there may be other possible reasons for the reversal.

In any case, the clear implication of these findings is that it would be unwise for schools and districts to adopt Project CRISS because it is very unlikely to produce meaningful gains in student reading ability when implemented in typical school settings. For the program developer, it means there is work to do to revise and retest the program, or perhaps the developer should try a different approach altogether. If this were a pharmaceutical drug, the Food and Drug Administration (FDA) would not allow it to be marketed because of the disappointing findings in a large, highly-credible RCT (what the FDA calls a “phase III” study).

The WWC, however, sends the opposite message. It reports that “Project CRISS was found to have potentially positive effects on comprehension for adolescent learners” and gives CRISS an improvement index score of +20, which means that the program increases performance for the average student from the 50th to the 70th percentile—a very large effect indeed.

How did the WWC reach this conclusion? It simply averaged the effect found in the small, preliminary study with that found in the IES-sponsored evaluation as if the two studies had equal evidentiary value and ignored the fact that the IES-sponsored RCT was far more credible. It thereby portrays CRISS as potentially having blockbuster-sized effects when the totality of the evidence shows the opposite.

Unfortunately, this is not an isolated example. The story is similar for Odyssey Math and Lessons in Character [v]—both programs portrayed positively by the WWC based on preliminary studies that were later found to have no significant effects in large IES-sponsored RCTs.

Even more problematic, the WWC routinely reports programs as having positive or potentially positive effects based solely on preliminary studies (i.e., when the WWC has not found a more rigorous evaluation of the program). Illustrative examples include Read Well, First Step to Success, and ClassWide Peer Tutoring. Although these programs may be good candidates for further evaluation based on the positive initial findings, most of these findings would likely be reversed in a more definitive evaluation, based on the long history of rigorous evaluations in education and other fields.

Why does this matter? Because the WWC is in effect encouraging schools and districts to adopt programs that are unlikely to improve student outcomes. Project CRISS’s website, for example, cites the WWC report on CRISS as evidence that their program is “research-validated,” and a school or district official who reads the WWC report would clearly get the impression that the evidence is at least highly promising. In addition, the WWC’s practice of identifying programs with only preliminary evidence as “positive” or “potentially positive” when the evaluation findings have been reversed or are at high risk of reversal makes it very difficult for schools and districts to identify the subset of programs that do have credible evidence of important effects (such as Summer Book Fairs in high-poverty elementary schools, and New York City’s Small Schools of Choice). Finally, the WWC approach gives programs like CRISS no incentive to revise and re-test their program with the aim of generating true evidence of impact.

For these reasons, we would encourage the IES leadership to change the WWC review process by reserving the labels “potentially positive” and “positive” for programs whose evidence provides moderate or high confidence of meaningful, sustained effects on important educational outcomes when the program is faithfully delivered in typical school settings. Left uncorrected, the current process will continue to encourage well-meaning school and district officials to adopt programs that do little or nothing to improve education under the mistaken belief that they are “evidence based.”


Response provided by the Institute of Education Sciences (IES)

We shared a draft of this report with Institute of Education Sciences (IES) officials and invited a written response. They were open to the feedback on the WWC and provided comments on our draft by phone, but as a matter of policy cannot provide written comments for publication on a non-governmental website. We revised parts of our report based on the conversation.


References

[i] S. Horsfall and C. Santa, C., Project CRISS: Validation report for the Program Effectiveness Panel, unpublished manuscript, 1994.

[ii] A. Cheung, and R. Slavin, “How methodological features affect effect sizes in education,” Educational Researcher, vol. 45, no. 5, pp. 283-292.

[iii] John P. A. Ioannidis, “Contradicted and Initially Stronger Effects in Highly Cited Clinical Research,” Journal of the American Medical Association, vol. 294, no. 2, July 13, 2005, pp. 218-228. Mohammad I. Zia, Lillian L. Siu, Greg R. Pond, and Eric X. Chen, “Comparison of Outcomes of Phase II Studies and Subsequent Randomized Control Studies Using Identical Chemotherapeutic Regimens,” Journal of Clinical Oncology, vol. 23, no. 28, October 1, 2005, pp. 6982-6991. John K. Chan et. al., “Analysis of Phase II Studies on Targeted Agents and Subsequent Phase III Trials: What Are the Predictors for Success,” Journal of Clinical Oncology, vol. 26, no. 9, March 20, 2008. Michael L. Maitland, Christine Hudoba, Kelly L. Snider, and Mark J. Ratain, “Analysis of the Yield of Phase II Combination Therapy Trials in Medical Oncology,” Clinical Cancer Research, vol. 16, no. 21, November 2010, pp. 5296-5302. Jens Minnerup, Heike Wersching, Matthias Schilling, and Wolf Rüdiger Schäbitz, “Analysis of early phase and subsequent phase III stroke studies of neuroprotectants: outcomes and predictors for success,” Experimental & Translational Stroke Medicine, vol. 6, no. 2, 2014. Vinayak K. Prasad and Adam S. Cifu, Ending Medical Reversal: Improving Outcomes, Saving Lives, Johns Hopkins University Press, 2015.

[iv] S. James-Burdumy, J. Deke, J. Lugo-Gil, J., N. Carey, A. Hershey, R. Gersten, R. Newman-Gonchar, J. Dimino, and K. Haymond, K., Effectiveness of Selected Supplemental Reading Comprehension Interventions: Findings from Two Student Cohorts, National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education, NCEE 2010-4015, 2010. The WWC reports only the first-year findings from this study, probably because the two-year findings had not yet been published at the time of the WWC review.

[v] The WWC report on Lessons in Character includes two preliminary studies but not the subsequent IES-sponsored RCT that found no significant effects, because the IES-sponsored study was published in 2012, several years after the WWC review. The IES-sponsored study is T. Hanson, B. Dietsch, and H. Zheng, Lessons in Character Impact Evaluation, National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education, NCEE 2012-4004, 2012.