Key Findings Misreported in Federal Evaluation of HHS Program for Low-Income Families

Highlights:

We discuss recent findings of a randomized controlled trial (RCT) of the federal Assets for Independence (AFI) program, designed to help low-income families increase savings so as to build asset ownership (e.g., ownership of a home or business, or postsecondary education or training).
We believe this was a quality RCT whose Overview and Executive Summary, unfortunately, include strong claims about AFI’s effectiveness that are not supported by the actual results.
The Overview states that AFI produced positive effects on home and business ownership for key subgroups of AFI participants. However, a review of the main report reveals that these subgroup effects are only suggestive; that is, they may not be true effects and could have appeared by chance.
A more accurate overview would state that the study found statistically significant effects on participant savings at the one-year mark but that, contrary to the study’s central hypothesis, this did not lead to a finding of significant effects on any of the primary targeted outcomes (i.e., asset ownership for the full sample) at the three-year mark.
This misreporting of findings – a pervasive problem in the evaluation literature – could easily lead policy officials to expand this program approach in the mistaken belief it has been shown effective.
A response from the study authors, and our rejoinder, follow the main report.

In this report, we discuss recent findings of a federally-sponsored RCT of the U.S. Department of Health and Human Services’ (HHS) Assets for Independence program. As detailed below, we believe this study illustrates an all-too-common problem in program evaluations: In summarizing its main findings, the study makes strong positive claims about the program’s effectiveness that are not supported by the actual results. Such misreporting could easily lead policy officials to expand this program approach in the mistaken belief it has been demonstrated effective.

This study is an RCT of Assets for Independence (AFI) – an HHS demonstration program designed to help low-income families increase their savings for specific investments, such as a first home, business capitalization, or postsecondary education and training. The program provides funding to match families’ personal savings for these investments through an individual development account (IDA).

The RCT was conducted by the Urban Institute with a sample of 807 low-income individuals. Based on our review of the latest study report (November 2019), which presents findings three years after individuals entered the study, we believe the study was generally well-conducted.[1]

The study Overview and Executive Summary state that AFI produced a number of positive effects, including increases in homeownership and business ownership for key subgroups. We focus on the study’s Overview and Executive Summary because they are designed to summarize the study’s main findings and, for the many readers who do not have time to review the full 74-page report, these sections may be all they ever read about the study results. Here is how the Overview section summarizes the study’s findings on its primary targeted outcomes:

This evaluation shows that among the full sample of study participants, AFI did not increase homeownership, business ownership, or postsecondary education or training three years after study enrollment. But our findings show that AFI participation had effects on certain subgroups. Specifically, AFI increased homeownership among renters at study enrollment and increased business ownership among non-business owners at study enrollment.

The Overview goes on to describe beneficial effects on several other (secondary) outcomes, and concludes as follows:

Although AFI did not result in greater asset ownership among the full sample, these third-year impact findings – that AFI participation results in greater asset ownership among renters and non–business owners, less material hardship, lower use of alternative (nonbank) check-cashing services, and more future-oriented time preferences – provide empirical evidence that savings and wealth-building opportunities can promote economic well-being and personal responsibility. This evaluation provides rigorous results that can inform the next stage of incentivized savings programs that benefit low-income earners. By encouraging low-income families to build assets, AFI eased economic hardship and increased asset ownership while providing a foundation for long-term upward mobility.

The study’s Executive Summary describes the main findings in a similar manner, as does a short brief that HHS and the Urban Institute published on the study results in April.

However, the study actually found no statistically significant effects on any of the primary hypothesized outcomes, if analyzed correctly. The study measured 13 effects of the program on the study’s “primary” outcomes – that is, outcomes that test the study’s central hypotheses and are the main basis for determining the program’s effectiveness.[2] According to the full study report, of these 13 primary effects, two were statistically significant at the study’s three-year mark: (i) the rate of homeownership among those who rented at study enrollment, and (ii) the rate of business ownership among those who were not business owners at enrollment. The report notes that these effects were statistically significant only at the 10 percent level, which is less stringent than the conventional 5 percent benchmark. The report appropriately states that the researchers “interpret significance at the 10 percent level as providing only suggestive evidence of AFI impacts” (page 27).

Importantly, however, this caveat that the primary effect findings are “only suggestive” appears in the full 74-page report, but not in the study Overview or Executive Summary. These sections instead make the unqualified claim that “AFI participation results in greater asset ownership among renters and non–business owners” and describe the findings as “rigorous” rather than “only suggestive” (as shown in the Overview text, above).

Furthermore, neither of these effects would reach statistical significance even at the 10 percent level had the study followed its own protocol for analyzing these effects. Per accepted scientific standards, that protocol was designed to guard against the possibility of false-positive findings that can occur by chance due a study’s measurement of numerous program effects, as described in the full report:

When examining several outcomes, some impact estimates may achieve statistical significance by chance. Our analysis also tests for statistical significance considering that some domains [i.e., outcome areas] include multiple outcomes. We adjust the statistical significance in the six domains that have multiple outcomes using a procedure developed by Benjamini and Hochberg (1995). In chapter 6, we note outcomes that lose statistical significance…when we adjust for multiple outcomes. [page 27]

The study applies this adjustment in some of its outcome domains but, without explanation, not in the primary outcome domains. When we independently applied the adjustment in the primary domains, the two effects were no longer statistically significant even at the less-stringent 10 percent level.

In short, when analyzed correctly according to the stated protocol, the study found no statistically significant effects (at either the 10 percent or 5 percent level) on any of the primary outcomes. The study’s central hypotheses, in other words, were not supported by the findings.

In the absence of effects on any primary outcomes, the study’s estimates of effects on other, non-primary outcomes are not reliable under established scientific standards. The study measured the program’s effects on numerous other outcomes, and found several that were statistically significant (as referenced in the Overview, above). However, per guidance of respected scientific bodies such as the Institute of Education Sciences (IES) and Food and Drug Administration (FDA), such findings on non-primary outcomes are considered only preliminary – rather than rigorous – and are best viewed as a source of hypotheses for testing in future rigorous evaluations (see IES guidance, pp. 4-6, and FDA draft guidance, pp. 9-21).[3]

The reasoning behind these standards, as summarized in the IES and FDA guidance and in our endnote,[4] is that in the absence of effects on primary outcomes, measurement of a program’s effects on additional outcomes has a high risk of generating false-positive findings.

This misreporting of key findings – an all-too-common problem in evaluation – could easily lead policy officials to expand this program approach in the mistaken belief it has been demonstrated effective.

Such misreporting is a pervasive problem in research and evaluation, and we have discussed other high-profile examples in prior Straight Talk reports. Because it has great potential to distort policy, practice, and future research, we believe – consistent with Nobel laureate Richard Feynman’s entreaty to graduating Caltech students in 1974 – that researchers have a special responsibility to present study findings in an impartial manner.

Thus, while we commend HHS and the research team for this high-quality RCT, we would urge them to summarize the findings in a balanced and objective manner. Such a summary would forthrightly state that the study found no statistically significant effects on the primary targeted outcomes, and that the AFI program’s effectiveness has not been rigorously demonstrated. It could then note that the study identified several preliminary positive effects that may warrant examination in future rigorous evaluations, to see if they can be confirmed. This approach would best serve the interest of science, policy, and – most importantly – low-income families in need of effective programs that can meaningfully improve their lives.

Response provided by the study authors

We thank the reviewers for the opportunity to respond to concerns regarding how we reported main AFI evaluation findings. We are pleased the reviewers believe the study “was generally well-conducted” and that this is a “high-quality RCT.” Of the reviewers three key concerns, we agree with one and disagree with two critical concerns.

The AFI evaluation study is presented in three publications: first-year impacts report, third-year impacts report, and summary brief. While the reviewers focus on the third-year report and mention the summary brief, the first- and third-year reports are two parts of one evaluation and should be considered together in a complete assessment of the program. We have a single conceptual model that discusses hypotheses for the short term (first year) and medium term (third year), and the third-year report summarizes the first-year results. We detail our response below.

We disagree with the first concern: that there are “no statistically significant effects on any of the primary hypothesized outcomes, if analyzed correctly.” Starting with the first-year findings, we hypothesize and find that AFI statistically significantly (p<0.01) increased liquid assets (savings) in the short term. Additionally, we do not hypothesize that AFI will increase liquid assets in the medium term (i.e., at the three-year follow-up). Rather, we hypothesize in the medium term that AFI will increase asset ownership associated with allowable AFI asset purchases. These hypotheses are clearly stated in the conceptual framework of the third-year report.

The reviewers highlight multiplicity adjustments. We did not conduct multiplicity adjustments for the subgroup analyses, as these analyses were exploratory (Schochet 2008). We examined subgroups because AFI cannot increase homeownership among homeowners or business ownership among business owners (e.g., a homeowner cannot assume the status they already possess). While an evaluation of AFI under ideal conditions would randomize participants within asset ownership status (e.g., renters versus homeowners), such randomization was not feasible when designing the study. Because we did not conduct random assignment within strata defined by subgroups, the subgroup analysis was exploratory by nature.

We agree with the reviewers’ second concern that, given the exploratory nature of the subgroup analysis and significance at the 10 percent level, we should have referred to evidence for the subgroups as “suggestive” in the study Overview and Executive Summary. However, we clearly and forthrightly state in the three major components of the third-year report—Overview, Executive Summary, and main report—that among the full sample, we do not find evidence that AFI increased homeownership, business ownership, or postsecondary education three years after study enrollment.

The reviewers’ third concern that “In the absence of effects on any primary outcomes, the study’s estimates of effects on other, non-primary outcomes are not reliable under established scientific standards”—is not supported. In addition to the suggestive evidence of primary hypothesized effects in year three, the AFI evaluation finds primary hypothesized effects in the first year (increased savings). These primary effects support the non-primary outcomes. Our conceptual model shows that many of the non-primary outcomes result from an increase in savings.

Rejoinder by Arnold Ventures’ Evidence-Based Policy team

We appreciate the study authors’ thoughtful response, and believe we are in general agreement on most key issues. To summarize what we see as the areas of agreement:

This was a highly-quality RCT. To the authors’ credit, the study Overview and Executive Summary forthrightly state that – for the full sample – the study did not find evidence to support the primary targeted effects on homeownership, business ownership, or education or training at the three-year follow-up.
Per our report and the authors’ response, the subgroup effects highlighted in the Overview and Executive Summary – that “AFI participation results in greater asset ownership among renters and non–business owners” – are only suggestive. In other words, they may not be true effects and could have appeared by chance; future rigorous evaluations may be warranted to see if these effects can be confirmed.
The study found that the program produced a significant increase in participants’ liquid assets (i.e., savings) at the one-year follow-up, which was the primary targeted short-term effect. Contrary to the study’s central hypothesis, however, this short-term effect did not lead to a finding of significant effects on any of the primary targeted outcomes (i.e., asset ownership for the full sample) at the three-year follow-up, as noted above.

As to the question of whether the positive effects found on other, non-primary outcomes for the full sample are reliable, we note that, at the three-year follow-up, the study measured a total of 37 effects on pre-specified targeted outcomes (primary and non-primary) for the full sample and found only 3 that were statistically significant (p<0.05). This is roughly the pattern one might expect if the program produced no true effects on any of these outcomes, since approximately 1 out of 20 statistical significance tests can be expected to produce a false finding. For this reason, we believe the non-primary effects – like the subgroup effects – are best viewed as only suggestive until they can be reproduced in a subsequent RCT.

We thank the authors for engaging in this colloquy, and hope it is helpful to readers.

References:

[1] For example, the study had successful random assignment (as evidenced by similar treatment and control groups), modest sample attrition, and valid analyses. One possible weakness is the study’s limited sample size (807 individuals), which enabled it to detect sizable but not modest program effects if such effects existed, as described on pages 62-64 of the study report. Thus, the fact that the study found no statistically significant effects on its primary outcomes does not rule out the possibility that the program produced modest positive effects.

[2] The study measured the program’s effects on nine primary outcomes. Four of these outcomes were measured for both the full sample and for targeted subgroups, resulting in measurement of 13 primary effects in all.

[3] The IES guidance states that such findings “do not provide rigorous evidence of a [program’s] overall effectiveness.” The FDA guidance provides that “Positive results on secondary [outcomes] can be interpreted only if there is first a demonstration of a treatment effect on the primary [outcomes]” and that, in the absence of a primary effect, “secondary [outcomes] cannot be assessed statistically.”

[4] For each effect that a study measures, there is roughly a 1 in 20 chance that the test for statistical significance at the 5 percent level will produce a false-positive result when the program’s true effect is zero. So if a study examines numerous outcomes, it becomes a near certainty that the study will produce some false-positive findings. To minimize the chances of this, IES, FDA, and other respected scientific authorities recommend that studies pre-specify a relatively small number of primary outcomes against which the program’s effectiveness will be judged (IES refers to these as “confirmatory,” rather than “primary,” outcomes). If the study then goes on to measure additional outcomes, the resulting findings are considered only preliminary because of the high chance that they are erroneous.

Policy Area

Study Report Accuracy

Credible Positive Finding

Archives