At the end of an intervention (or at least the evaluation period for the intervention), endline data must be collected to measure final outcomes. Assuming the integrity of the random assignment was maintained, and data collection was well-administered, it is time to analyze the data. The simplest method is to measure the average outcome of the treatment group and compare it to the average outcome of the control group. The difference represents the program’s impact. To determine whether this impact is statistically significant, one can test the equality of means, using a simple t-test. One of the many benefits of randomized evaluations is that the impact can be measured without advanced statistical techniques. More complicated analyses can be performed. For example, regressions controlling for other characteristics can be run to add precision. However, as the complexity of the analysis mounts, the number of potential missteps increases. Therefore, the evaluator must be knowledgeable and careful when performing such analyses.
It is worth noting that when a result is obtained, we have not uncovered the truth with 100 percent certainty. We have produced an estimate that is close to the truth with a certain degree of probability. The larger our sample size, (the smaller our standard errors will be and) the more certain we are. But we can never be 100 percent certain.
This fact leads to two very common pitfalls in analysis:
1) Multiple Outcomes: Randomization does not ensure the estimated impact is perfectly accurate. The measured impact is unbiased, but it is still an estimate. Random chance allows for some margin for error around the truth. Quite frequently the estimate will be extremely close to the truth. Occasionally, the estimate will deviate slightly more. Rarely, it will deviate significantly. If we look at one outcome measure, there is some chance that it has deviated significantly from the truth. But this is highly unlikely. If we look at many outcome measures, most will be close, but some will deviate. The more indicators we look at, the more likely one or more will deviate significantly. For example, assume the chlorine pills we distributed to fight waterborne illness in our water purification program were faulty or never used—if twenty outcome measures are compared, it is in fact very likely that one comparison will suggest significant improvement in health and one will indicate significant decline due to our program. So if we look at enough outcome measures eventually we will stumble upon one that is significantly different between the treatment and control groups. This is not a problem, per se. The problem arises when the evaluator “data mines,” looking at outcomes until she finds a significant impact, reports this one result, and fails to report the other insignificant results that were discovered in the search.
2) Sub-group analysis: Similarly, just as an evaluator can data mine by looking at many different outcome measures, the evaluator can also dig out a significant result by looking at different sub-groups in isolation. For example, it might be that the chlorine has no apparent impact on household health as a whole. It may be reasonable to look at whether it has an impact on children within the household, or girls in particular. But we may be tempted to compare boys and girls of different age groups, of different compositions of households, in different combinations. We may discover that there is a significantly better health in the treatment group for the subgroup of boys between the ages of 6 and 8, who happen to have one sister, one grandparent living in the household, and where there the household owns a TV and livestock. We could even concoct a plausible story for why this subgroup would be affected and other subgroups not. But if we stumbled upon this one positive impact after finding a series of insignificant impacts for other subgroups, it is likely that the difference is due simply to random chance—not our program.