Overstated findings, published in Science, on long-term health effects of a well-known early childhood program

Highlights:

We review an article, published in a leading scientific journal (Science), that reported on long-term health effects of the Abecedarian early childhood program as measured in a randomized trial.
The article reported that the program, which provided low-income children with educational childcare and preschool, substantially boosted health at age 35 for male sample members.
The main problem: About 40 percent of the males in the sample were lost to follow-up, and the loss was much more severe in the control group than the treatment group. As a result, the final control group on which the results are based included just 9 to 12 men, depending on the outcome. Such sample loss undermines the original randomized study design and places the findings at high risk of error.
Remarkably, the size of the final sample and the amount of sample loss were not reported in the study abstract or main paper as is customary—only in the 100-page online supplement.
The response from study authors James J. Heckman and Rodrigo Pinto, and our brief rejoinder, follow the main report.

By guest author Perry Wilson, MD, MSCE (bio)

I was once told that the most important question in epidemiology is, “What is the question?” The meaning behind this maxim is that you really need to know what you are asking of the data before you ask it. Then you determine whether your data can answer that question.

In this paper,[i] published in Science in 2014, researchers had a great question: Would an intensive, early-childhood intervention focusing on providing education, medical care, and nutrition lead to better health outcomes later in life?

The data they used to answer this question might appear promising at first, but looking under the surface, one can see that the dataset can’t handle what is being asked of it. This is not a recipe for a successful study, and the researchers’ best course of action might have been to move on to a new dataset or a new question.

What the authors of this Science paper did instead was to torture the poor data until it gave them an answer.

Let’s take a closer look.

The researchers used data from a randomized controlled trial called the Abecedarian study. The study randomized 122 children to control (usual care) or a three-pronged early childhood intervention (comprised of an educational, a health care, and a nutritional component).

In their mid-30’s, some of these children visited a physician and had a set of measurements performed (blood pressure, weight, cholesterol, etc.).

The question: Was randomization to the intensive childhood intervention associated with better health in terms of these measurements? Per the researchers: “We find that disadvantaged children randomly assigned to treatment have significantly lower prevalence of risk factors for cardiovascular and metabolic diseases in their mid-30’s.”

Per the data: “Aaaaaahhhh! Please stop!!!”

Here are two very red flags:

Red Flag 1: The study does not report the sample size.

I couldn’t believe this when I read the paper the first time. In the introduction, I read that 57 children were assigned to the intervention and 54 to control. But then I read that there was substantial attrition between enrollment and age 35 (as you might expect). But all the statistical tests were done at age 35. I had to go deep into the supplemental files to find out that, for example, they had lab data on 12 of the 23 males in the control group and 20 of the 29 males in the treatment group. That’s a very large loss-to-follow-up. It’s also a differential loss-to-follow-up, meaning more people were lost in one group (the controls in this case) than in the other (treatment). If this loss is due to different reasons in the two groups (it likely is), you lose the benefit of randomizing in the first place.

The authors state that they accounted for this using inverse probability weighting. The idea here is that you create a model that predicts the chance that you’ll follow-up in your mid-30s. Men with fewer siblings were more likely to follow-up for example. Then you look at all the people who did follow-up, and weight their data according to how unlikely it was that they would have followed-up. The “surprise” patients get extra weight—because they now need to represent all those people who didn’t show up. This might sound good in theory, but it is entirely dependent on how good your model predicting who will follow-up is. And, as you might expect, predicting who will show up for a visit 30 years after the fact is a tall order. Without a good model, the inverse-probability weighting doesn’t help at all.

In the end, the people who showed up to this visit self-selected. The results may have been entirely different if the 40 percent or so of individuals who were lost to follow-up had been included.

Red Flag 2: Multiple comparisons accounted for! (Not Really)

Referring to challenges with this type of analysis, the authors write in their introduction:

“Numerous treatment effects are analyzed. This creates an opportunity for ‘cherry picking’—finding spurious treatment effects merely by chance if conventional one-hypothesis-at-a-time approaches to testing are used. We account for the multiplicity of the hypotheses being tested using recently developed stepdown procedures.”

Translation: We are testing a lot of things. False positives are an issue. We’ll fix the issue using this new statistical technique.

The stepdown procedure they refer to does indeed account for multiple comparisons. But only if you use it on, well, all of your comparisons. The authors did not do this and instead divided their many comparisons into “blocks,” most of which have only two or three variables. Vitamin D deficiency, for example, stands all alone in its block—its p-value gets adjusted from 0.021 to 0.021. In other words, no adjustment at all is made for the fact that it is one of many things being tested. Correcting your findings for multiple comparisons only works if you account for all the comparisons.

Where are we left after all this? Right where we started. With a really interesting question and no firm answer. Do early childhood interventions lead to better health later in life? Maybe. I can’t tell from this study. And that’s sad because if the hypothesis is true, it’s really important.

Response provided by study authors James J. Heckman and Rodrigo Pinto

Dr. Perry Wilson’s “Straight Talk” dismisses our study—the first to study the benefits of an early childhood program on adult health—as a statistical artifact, where we “torture the poor data” to get findings we liked. His accusation that we tortured data is false. Our paper, especially our detailed 100-page appendix, documents our extensive sensitivity and robustness analyses and contradicts his claims. (For the benefit of dispassionate readers, we give links to our paper and appendix.) We would also direct interested readers to the over 100 peer-reviewed publications on the impact of the Carolina Abecedarian Project on participants across the lifecycle, many of which are available from the project’s website. We respond to his two “Red Flags” in detail below.

Self-selection

Our paper accurately reports our effective sample size and addresses the issue of self-selection and attrition. We report two sources of data loss: (a) attrition in our longitudinal sample (from age 8 weeks to age 34), and (b) nonparticipation in the health survey. We have baseline data for all participants (at age 8 weeks). We carefully document the differences—if any—between the baseline characteristics of sampled and non-sampled (at age 34) respondents. Wilson correctly notes that we lose more controls than treatments. He claims that this, by itself, automatically creates selection bias. That is simply false. What creates selection bias is differential values of characteristics that affect outcomes. Using the experimental longitudinal data, we document these values and report any systematic differences, adjusting for these differences using the standard method of inverse probability weighting. Wilson further claims that our model for estimated probabilities of attrition and nonresponse is somehow flawed, ignoring the extensive discussion of model selection and sensitivity analysis in the appendix. He suggests there may be important variables we did not use in constructing our probability models. This point is trivial. We may have missed some important variable and omitting it may bias our estimates. This is true of any observational study and is even true of many experimental studies with good substitutes for controls that is adjusted using observational methods. He offers no discussion of what, if anything, we omitted or the resulting bias, if any. In regard to our small sample size, we used both exact and asymptotic inference and found general agreement with tests based on each approach.

Multiple Comparisons

Wilson faults us for not reporting step-down adjustments over all categories of variables, ignoring his own maxim. Uncritical application of multiple hypothesis testing across all categories tests an uninterpretable hypothesis. Our goal was to examine treatment effects by interpretable categories (hypertension, weight, metabolic syndrome, etc.), not some overall assessment if the program “worked” for any category. That is a meaningless question. Elsewhere, we aggregate the estimates of ABC across all categories into a single economically interpretable parameter: the economic rate of return (Garcia et al., 2016). We report high rates of return that are statistically significant at 13.7 percent per annum.

Rejoinder by Perry Wilson

I appreciate the attention to detail the authors put into the analysis of the data; however, my concerns remain. To recap: The study reports significant effects on health outcomes for males (not for females), yet the findings for males suffer from high sample attrition at age 35—about 40 percent—resulting in a final sample of between 28 and 32 men, depending on the outcome measure. Moreover, the attrition rates differ markedly between the treatment and control groups, as follows:

	Treatment Group Attrition (Males)	Control Group Attrition (Males)
Lab tests	31%	48%
Physical exam measures	34%	61%

Remarkably, this basic information on sample size and attrition, which bears directly on the validity of the key findings, was not reported in the study abstract or main paper as is customary—only in the 100-page online supplement.

According to the Institute of Education Sciences’ What Works Clearinghouse, which (based on simulation studies in education randomized trials) has published standards for acceptable attrition, the above rates of overall and differential attrition place the study clearly in the zone of “unacceptable threat of bias under both optimistic and cautious assumptions” (link, see page 9).

This classification reflects the fact that such high and differential attrition could easily cause differences between the treatment and control groups in their measured characteristics (such as percent born prematurely) and unmeasured characteristics (such as level of family support or psychological resilience). Although the authors adjust statistically for differences in a number of measured characteristics, there are many unmeasured characteristics they cannot adjust for.

For this reason, I agree with the authors’ statement in their response: “We may have missed some important variable and omitting it may bias our estimates [emphasis theirs]. This is true of any observational study and is even true of many experimental studies…” This highlights that the study, as presented, is observational. There has been enough attrition that the benefits of randomization in terms of ensuring that the treatment and control groups are balanced in unmeasured baseline characteristics are lost. As such, this should be interpreted as an observational study and strong causal language—such as the title “Early Childhood Investments Substantially Boost Adult Health”—is inappropriate.

Regarding multiple comparisons: The established method of correcting for a study’s measurement of numerous outcomes—described for example in the latest FDA draft guidance—is to conduct a statistical adjustment across all of the study’s primary outcomes. The authors instead took the unorthodox step of dividing the 20 main health outcomes measured for males into 10 categories, and applying an adjustment within each category. As a result, no adjustment was made for outcomes such as vitamin D deficiency that stood alone in a category. This can’t be a valid approach because the study’s measurement of numerous outcomes[ii] clearly raises the risk that the vitamin D and other findings are false positives.

In short, while the hypothesis that this early intervention program benefits adult health may be true, the data do not provide convincing evidence one way or the other.

References

[i] Frances Campbell, Gabriella Conti, James J. Heckman, Seong Hyeok Moon, Rodrigo Pinto, Elizabeth Pungello, and Yi Pan, “Early Childhood Investments Substantially Boost Adult Health,” Science, vol. 343, March 28, 2014, pp. 1478-1485.

[ii] The study actually reported a total of 40 health outcomes at age 35 for males and females, and 30 additional health-related outcomes at earlier ages.

Overstated findings, published in Science, on long-term health effects of a well-known early childhood program

Highlights:

By guest author Perry Wilson, MD, MSCE (bio)

Red Flag 1: The study does not report the sample size.

Red Flag 2: Multiple comparisons accounted for! (Not Really)

Response provided by study authors James J. Heckman and Rodrigo Pinto

Self-selection

Multiple Comparisons

Rejoinder by Perry Wilson

References

Policy Area

Study Report Accuracy

Credible Positive Finding

Archives

Overstated findings, published in Science, on long-term health effects of a well-known early childhood program

Highlights:

By guest author Perry Wilson, MD, MSCE (bio)

Red Flag 1: The study does not report the sample size.

Red Flag 2: Multiple comparisons accounted for! (Not Really)

Response provided by study authors James J. Heckman and Rodrigo Pinto

Self-selection

Multiple Comparisons

Rejoinder by Perry Wilson

References

Policy Area

Study Report Accuracy

Credible Positive Finding

Archives

Subscribe to Our Newsletter