How to Implement an Evaluation

PDF version

Once an evaluation design has been finalized, the evaluator must remain involved to monitor data collection as well as the implementation of the intervention being evaluated. If respondents drop out during the data collection phase the results are susceptible to attrition bias, compromising their validity. Attrition is covered in this section. Other threats in the data collection phase such as poor measurement instruments, reporting bias, etc, are equally important, but are not covered here. For best practices on data collection see:

  • Deaton, A. (1997): The Analysis of Household Surveys. World Bank, International Bank for Reconstruction and Development

In the implementation of the intervention, the integrity of the randomization should remain intact.  Unless intentionally incorporated into the study’s design, spillovers and crossovers should be minimized, or at the very least, thoroughly documented. (See Threats to the design for background.)

1.   Threats to Data Collection

a)    Attrition

Attrition occurs when evaluators fail to collect data on individuals who were selected as part of the original sample. Note that the treatment and control groups, through random assignment, are constructed to be statistically identical at the beginning. The control group is meant to resemble the counterfactual—what would have happened to the treatment group had the treatment not been offered. (See: Why Randomize?). If individuals who drop out of the study are “identical” in both the treatment and control groups, meaning the depleted control group still represents a valid counterfactual to the depleted treatment group, this will reduce our sample size, and could truncate the target population to which our results can be generalized, but it will not compromise the “truth” of the results (at least as applied to the restricted population).

For example, suppose our study area is rural, and that many household members spend significant portions of the year working in urban areas. Further suppose we created our sample and collected baseline data when migrant household members were home during the harvests and incidentally available for our study. If we collect our endline data during off-peak season, the migrant family members will have returned to their city jobs and will be unavailable for our survey. Assuming these are the same people in both the treatment and control groups, our study will now be restricted to only non-migrants. If the non-migrant population in the control group represents a good counterfactual to the non-migrant population in the treatment group, our impact estimates will be perfectly valid—but only applicable to the non-migrant population.

However, if attrition takes a different shape in the two groups, and the remaining control group no longer serves as a good counterfactual, this could bias our results. Using our example of waterborne illness, suppose that in the control group more children and mothers are ill. As a result, the young men who typically migrate to the cities during off-peak seasons stay back to help the family. Households that were assigned to the control group contain more migrants during our endline. The baseline demographics of the treatment and control groups are now different (whereas originally, they were balanced). It is entirely feasible that these migrants, of peak working age, are typically healthier. Now, even though our treatment succeeded in producing healthier children and mothers on average, our control group contains more healthy migrant workers, on average. When measuring the incidence of diarrhea, outcomes of the healthy migrants in the control group could offset those of their sicker family members. Then, when comparing the treatment and control groups, we could see no impact at all and may conclude the treatment was ineffective. This result would be false and misleading.

In this simplified example, we could forcibly reintroduce balance between the comparison and experimental groups by removing all migrants from our sample. Frequently, however, characteristics that could dependably identify both real and would-be attritors (those who disappear) have not been measured, or are impossible to observe. Predicting attrition can be as difficult as predicting participation in non-randomized trials. Similarly, attrition bias can be as damaging as selection bias when making causal inference.

2.   Spillovers and Crossovers

Spillovers occur when individuals in the control group are somehow affected by the treatment. For example, if certain children are in the control group of a chlorine dispensing study, but play with children who are in the treatment group, they now have friends who are less likely to be sick, and are therefore less likely to become sick themselves. In this case, they are indirectly impacted by the program, even though they have been assigned to the control group. Individuals who “crossover” are control those who find a way to be directly treated. For example, if the mother of a control group child sends her child to drink from the water supply of a treatment group household, she is finding her way into the treatment group. Impartial compliance is a broader term that encapsulates crossovers, and also treatment individuals who deliberately choose not to participate (or chlorinate their water, in this example).

When a study suffers from spillovers and crossovers, in many circumstances it is still possible to produce valid results, using statistical techniques. But these come with certain assumptions—many of which we were trying to avoid when turning to randomization in the first place. For example, if spillovers can be predicted using observed variables, they can be controlled for. With impartial compliance, if we assume that those who did not comply were unaffected by the intervention, and by the same token, the individuals who crossed over were affected in the same way as those from the treatment group who were treated, we can infer the impact of our program. But as discussed in the Why Randomize section, the more assumptions we make, the less firm ground we stand on when claiming the intervention caused any measured outcomes.