Highlights:

  • This report, like the previous one, offers constructive criticism of the What Works Clearinghouse (WWC), a widely-cited repository of evidence on “what works” in education that has successfully advanced rigorous evidence standards but also has flaws that we believe partly undermine its mission.
  • The problem we highlight is that WWC reviews often give too much credence to effects found on outcomes that are either unimportant from a policy or practical standpoint or are preliminary in nature. In some cases, this leads to favorable WWC ratings for programs that are not effective.
  • We illustrate the problem with the WWC’s review of New Chance—a comprehensive education and training program for young, low-income mothers that was found in a large randomized controlled trial (RCT) to produce no significant positive effects on any important educational or life outcomes (e.g., reading proficiency, receipt of a high school diploma, employment, earnings, welfare receipt, childbearing, emotional well-being, children’s preschool readiness or behavior).
  • The WWC review of New Chance mentions none of these disappointing findings, aside from a brief reference to that for diploma receipt. Instead, the WWC gives New Chance a favorable rating based solely on a finding of minor importance—it increased participants’ receipt of a General Educational Development (GED) certificate. Studies have found that GED receipt is not a good predictor of important downstream outcomes (e.g., college attainment, workforce success), and the New Chance RCT findings themselves illustrate the GED’s limitations.
  • The WWC’s favorable review of New Chance, in effect, encourages policy officials to adopt a program strategy that is costly and ineffective.
  • We discuss other related examples, such as the WWC review of the federal Head Start program’s effect on children’s reading achievement, and we offer suggestions for improvement.

This is the third in a three-part Straight Talk series on how “official” evidence reviews sometimes make ineffective programs appear effective. Our first report discussed a Department of Health and Human Services (HHS) review of the evidence on home visiting programs for families with young children. We sought to illustrate how selective reporting in the HHS document—specifically, its omission of disappointing findings from high-quality randomized controlled trials (RCTs), and inclusion of findings from much weaker studies that are likely to be overstated—led to erroneous positive conclusions about the long-term effects of several home visiting program models.

The second report focused on the Institute of Education Sciences’ What Works Clearinghouse (WWC), and sought to illustrate (i) how the WWC evidence review process assigns too much weight to findings from small, preliminary studies, ranking them equal to findings from larger, more definitive evaluations; and (ii) how this has led to a WWC determination that certain education programs are beneficial when the totality of the evidence shows the opposite.

In this third report, we discuss a problem that we believe affects the WWC and many other government and nonprofit evidence review efforts: They often give too much credence to effects found on outcomes that are either unimportant from a policy or practical standpoint or are only preliminary in nature. As with our last report, we offer these comments as constructive criticism of the WWC, which has played a central role in advancing rigorous evidence standards in education over the past 15 years, but which we believe has important (though fixable) flaws that partly undermine its mission.

To illustrate the WWC’s over-weighting of unimportant or preliminary outcomes, and how it can cause ineffective programs to be shown as beneficial, we first discuss the WWC’s review of New Chance—a program for young, low-income mothers that provides comprehensive educational, job training, and other services. The WWC rates New Chance as having “potentially positive effects” on completing school and shows a sizable impact on school completion.[i] Most readers, seeing this rating on the WWC’s summary page (shown here), would probably assume that the program increased participants’ receipt of a high school diploma, but that’s not what happened. The WWC rating is based entirely on the finding from an RCT that the program produced a sizable effect on participants’ receipt of a General Educational Development (GED) certificate. Specifically, the program increased GED receipt by a statistically-significant 12 percentage points compared to the control group.

This effect would be valuable if GED receipt—like receipt of a regular high school diploma or college degree—were a good predictor of important downstream outcomes such as college attainment and workforce success, but unfortunately, it is not. Studies have found little correlation between GED receipt and such downstream outcomes, and in fact, the New Chance RCT’s full set of findings themselves illustrate the GED’s limitations. The RCT found that New Chance produced no significant positive effects on any of the ultimate outcomes that the study measured over the three and a half years after random assignment. The MDRC researchers who conducted the study summarize the key findings in their final report as follows:

The findings indicate that while experimental and control group members both advanced in many ways, experimental group members did not advance further than control group members in most respects. New Chance did boost participants’ levels of GED receipt above those of the control group. The added services provided by the program, however, did not help participants secure skills training credentials, get and maintain employment, or reduce their rates of welfare receipt or subsequent childbearing relative to outcomes for control group members. The program did not improve their children’s preschool readiness scores, and it had unexpected small but negative effects on participants’ emotional well-being and their ratings of their children’s behavior.[ii]

The WWC’s review mentions none of these disappointing findings on the ultimate outcomes, even those that are education-related (in addition to those described above, the study found no significant effect on participants’ reading proficiency at the 18-month follow-up). Furthermore, the WWC review only briefly mentions (in its full review, not on the summary page) another disturbing finding of the RCT—the program produced a statistically-significant adverse effect of 3.5 percentage points on receipt of a regular high school diploma, as the program apparently caused some participants to pursue GED receipt at the expense of regular high school completion.

In short, the totality of the study’s findings shows no improvement in any important educational or life outcome of the disadvantaged young women who participate, or their children. To give this program a favorable rating, as the WWC does, essentially encourages policy and program officials to adopt a program strategy that is costly and not effective.

Unfortunately, New Chance is not an isolated example. The WWC rates Head Start—a federal program that provides preschool and other services to low-income children aimed at boosting school readiness—as having “potentially positive effects on general reading achievement” for three- and four-year old children. The WWC rating, shown here, shows a fairly large effect size—an improvement index score of 13, which means that the program moves the reading performance of the average student from the 50th  to the 63rd percentile. (The WWC review also reports no discernible effects on children’s math achievement or social-emotional development.)

The WWC rating, however, is based on the finding from a nationwide RCT of Head Start that the program produced a sizable impact not on children’s actual reading ability, but on a preliminary outcome—parents’ rating of their child’s ability to recognize letters of the alphabet, count to 20, write his or her first name, and identify primary colors at the end of the preschool year.[iii] Meanwhile, later follow-ups of the Head Start RCT at the end of first grade[iv] and end of third grade[v] found that the program produced no significant effects on any reading outcome including, notably, comprehension, which is widely considered the ultimate measure of actual reading ability.[vi]

Thus, we believe the WWC’s favorable review of Head Start’s effect on general reading ability is not an accurate portrayal. (However, we think it would be a mistake to conclude, based on the Head Start RCT findings, that all preschool programs or all U.S. Head Start centers are ineffective in improving reading and other outcomes, and we encourage research to identify programs/centers that do produce meaningful effects, as discussed here.)

New Chance and Head Start are cases where the WWC rated the program favorably based on an unimportant or preliminary outcome when high-quality RCTs found no significant effects on more important and final educational outcomes. The broader problem is that the WWC rates programs favorably based on unimportant and/or preliminary outcomes when studies have not yet measured important, final outcomes. Illustrative examples include Too Good for Drugs and Violence (rated favorably for its effects on “knowledge, attitudes, and values” regarding violence, drugs, and similar outcomes, when effects on actual student violence, drug use, and academic achievement were not measured); and Earobics (rated favorably for its effects on preliminary literacy outcomes including alphabetics and fluency, when effects on actual reading including comprehension were not measured).

Such preliminary findings can be a valuable signal to researchers that further study is warranted to measure important, final outcomes. But we believe the WWC should make clear to education officials that such findings do not yet provide confidence that the program will produce improvements in ultimate outcomes of practical and policy importance. As New Chance, Head Start, and many other examples in education, social policy, and medicine illustrate, such improvements often do not materialize when studies measuring important, final outcomes are eventually carried out.

What is a solution? As discussed in our last report, we would encourage the leadership of the Institute of Education Sciences to change the WWC review process by reserving the labels “potentially positive” and “positive” for programs whose evidence provides moderate or high confidence of meaningful, sustained effects on outcomes that are of practical and policy importance when the program is faithfully delivered in typical school settings. The key concept in this context is important outcomes. The WWC could also report findings on preliminary outcomes (as a signal that further research may be warranted) while making clear that such findings do not yet constitute reliable evidence of effectiveness. Our concern is that the current WWC process, if uncorrected, will continue to encourage education officials to adopt programs that are unlikely to produce effects of practical or policy importance under a mistaken belief that the evidence is stronger than it is.


Response provided by the Institute of Education Sciences (IES)

We shared a draft of this report with Institute of Education Sciences (IES) officials and invited a written response. They were receptive to the feedback on the WWC and discussed the draft with us, but as a matter of policy cannot provide written comments for publication on a non-governmental website.


References

[i] The WWC review gives New Chance an improvement index score of eight, which means that the program moves the performance of the average student from the 50th to the 58th percentile.

[ii] Janet C. Quint, Johannes M. Bos, and Denise F. Polit, New Chance: Final Report on a Comprehensive Program for Young Mothers in Poverty and Their Children, MDRC, October 1997, p. ES-3, linked here.

[iii] Specifically, the WWC rating is based on the Parent Emergent Literacy Scale (PELS), which is a parent report on five literacy items: child can recognize most/all of the letters of the alphabet; child can count to 20; child pretended to write his/her name in the last month; child can write his/her first name; and child can identify the primary colors.

[iv] U.S. Department of Health and Human Services, Administration for Children and Families, Head Start Impact Study: Final Report, January 2010, linked here.

[v] Mike Puma, Stephen Bell, Ronna Cook, Camilla Heid, Pam Broene, Frank Jenkins, Andrew Mashburn, and Jason Downer, Third Grade Follow-up to the Head Start Impact Study Final Report, U.S. Department of Health and Human Services, Administration for Children and Families, Office of Planning, Research and Evaluation, 2012, linked here.

[vi] The effect sizes on language and literacy measures at the end of first grade and the end of third grade were generally close to zero. Regarding reading comprehension specifically, the study found no significant effect at the end of first grade on comprehension as measured with the Woodcock-Johnson III Passage Comprehension for either the children enrolled in the study at age three, who were offered two years of Head Start participation, or children enrolled in the study at age four, who were offered one year of Head Start participation. (The non-significant effect sizes were 0.03 standard deviations for the three-year-old cohort and 0.01 for the four-year-old cohort.) The study also found no significant effect at the end of third grade on reading comprehension as measured with the Early Childhood Longitudinal Study-Kindergarten (ECLS-K) Reading Assessment for either the three-year-old cohort or the four-year-old cohort. (The non-significant effect sizes were -0.01 standard deviations for the three-year-old cohort and 0.11 for the four-year-old cohort.)