How to Design an Evaluation
An evaluation design requires a considerable amount of thought. First comes the conceptual pieces: what do we plan to learn from this evaluation? What are the relevant questions? What outcomes are expected? How can they be measured?
(See Planning an Evaluation) Next, come the design questions:
- What is the appropriate level or unit of randomization?
- What is the appropriate method of randomization?
- Beyond the political, administrative and ethical constrains, what technical issues could compromise the integrity of our study, and how can we mitigate these threats in the design?
- How would we implement the randomization?
- What is the necessary sample size to answer our questions? (How many people do we need to include in the study, both as participants, but also as survey respondents?)
1. Unit of Randomization
In designing our evaluation we must decide at what level we will randomize: what unit will be subject to random assignment? Will it be individuals or groupings of individuals, such as households, villages, districts, schools, clinics, church groups, firms, and credit associations? (When we randomize groups of individuals—even though we care about and measure individual outcomes—this is referred to as a cluster randomized trial.) For example, if we managed to secure enough chlorine pills for one thousand households to treat contaminated water (out of, say, ten thousand households who use the same contaminated source of drinking water), do we expect to randomly assign households to the treatment and control groups? This means that some households will be given chlorine pills, but some of their immediate neighbors will be denied chlorine pills. Is that feasible? Ethical?
For this type of program, it probably wouldn’t be feasible to randomize at an even smaller unit than the household, for example the individual level. It would imply that some children within a household are given chlorine pills and some of their siblings are not. If all household members drink from the same treated tank of water, individual randomization would be physically impossible, regardless of the ethical considerations.
Perhaps the appropriate unit of randomization is the community, where some communities will receive chlorine, other communities will not, but within a “treatment” community all households (implying all neighbors) are eligible to receive the pills. There are many things to consider when determining the appropriate level of randomization, of which ethics and feasibility are only two. Seven considerations are listed below.
- What unit does the program target for treatment?
- What is the unit of analysis?
- Is the evaluation design fair?
- Is a randomized evaluation politically feasible?
- Is a randomized evaluation logistically feasible?
- What spillovers and other effects will need to be taken into account?
- What sample size and power do we require to detect effects of the program?
1. What unit does the program target for treatment: If chlorine tablets are meant to be dissolved in water storage tanks that in our region all households typically already own, then some households could be selected to receive chlorine, and others not. In this case, the unit of randomization would be at the household level. However, if the storage tank is typically located outside and used by a cluster of households, then it would be impossible to randomly assign some households in that cluster to the control group—they all drink the same (treated) water as the treatment households. Then, the most natural unit of randomization may be the “clusters of households” that use a common water tank.
2. What is the unit of analysis: If the evaluation is concerned with community level effects then the most natural level of randomization is probably the community. For example, imagine our outcome measure is incidence of “hospitalization” due to diarrhea, and it is most economical to measure this using administrative records at community clinics, and furthermore, those records remain anonymous. We would not be able to distinguish whether people who were hospitalized were from treatment households or control households. However, if the entire community is in the treatment group, we could compare the records from clinics in treatment communities against those of control communities.
3. Fairness: The program should be perceived as fair. If I’ve been denied chlorine pills, but my immediate neighbors receive them, I might be angry with my neighbors, angry with the NGO, and I might be less willing to fill out some questionnaire on chlorine usage when surveyors knock at my door. And the NGO might not be enthusiastic about upsetting its community members. On the other hand, if my entire community didn’t get it, but a neighboring community did, I might never hear of their program, so have nothing to complain about; or I could think that this was just a village-level choice and my village chose not to invest. Of course, people may be equally upset about a community-level design. We could try to expand the unit of randomization, or think of other strategies to mitigate people’s dissatisfaction. The fact that everyone is not helped may be unfair. (See ethical issues.) But given that we cannot help everyone (usually due to capacity constraints), and our desire to improve and evaluate, how can we allocate in a way that simultaneously allows us to create an equivalent control group, and is seen as fair by the people we’re trying to help.
4. Political Feasibility: It may not be feasible politically to randomize at the household level. For example, a community may demand that all needy people receive assistance, making it impossible to randomize at the individual or household level. In some cases, a leader may require that all members of her community receive assistance. Or she may be more comfortable having a randomly selected half be treated (with certainty) than risk having no one treated (were her village assigned to the control group). In one case she may comply with the study and in another, she may not.
5. Logistical Feasibility: Sometimes it is logistically impossible to ensure that some households remain “control households”. For example, if chlorine distribution requires hiring a merchant within each village, setting up a stall where village members pick up their pills, it may be inefficient to ask the distribution agent to screen out control households. It could add bureaucracy, waste time, and distort what a real program should actually look like. Or even if the merchant could easily screen, households may simply share the pills with their “control group neighbors”. Then the control group would be impacted by the program, and no longer serve as a good comparison group. (Remember, the comparison group is meant to represent life without the program. (See: What is… an impact evaluation?) In this case, it would make sense to randomize at the village level, and then simply hire merchants in treatment villages and not in control villages.
6. Containing spillovers and other effects: Even if it is feasible to randomize at the household level—to give some households chlorine tablets and not others—it may not be feasible to contain the impact within just the treatment households. If control group individuals are affected by the presence of the program—they benefit from fewer sick neighbors (spillover effects), or drink the water from treatment neighbors (don’t comply with the random assignment, and cross over to the treatment group), they no longer represent a good comparison group. (See: What is… an impact evaluation?) (For more details on spillover and crossover effects, see: Threats to the design)
7. Sample size and power: The ability to detect real effects depends on the sample size. When more people are sampled from a larger population, statistically, they better represent the population (see Sample Selection and sample size). For example, if we survey two thousand households, and randomize at the household level (one thousand treatment, one thousand control), we effectively have a sample size of two thousand households. But if we randomized at the village level, and each village has one hundred households, then we would have only ten treatment villages and ten control. In this case, we may be measuring diarrhea at the household level, but because we randomized at the village level, it is possible we have an effective sample size closer to ten (even though we are surveying two thousand households)! In truth, the effective sample size, could be anywhere from ten to two thousand, depending on how similar households within villages are to their fellow villagers. (See: sample size.) With an effective sample size of ten, we may not be able to detect real effects. This may influence our choice as to the appropriate level of randomization.
There are many considerations when determining the appropriate level of randomization. Evaluators cannot simply sit at a computer, press a button, produce a list, and impose an evaluation design on an organization from thousands of miles away. Evaluators must have a deep and broad understanding of the implementing organization, their program, and the context and work in partnership to determine the appropriate level of randomization given the particular circumstances.
2. Different Methods of Randomization
If my organization can secure one thousand chlorine pills per day, so I can treat one thousand out of an eligible two thousand households per day, I could choose to treat the same one thousand households in perpetuity. Alternatively I could rotate so that every other day, each household gets to drink clean water one of those days. I may feel that the latter option makes no sense. If everyone is drinking dirty water half the days, I may expect zero impact on anyone. So I may choose one thousand households that will receive the pills daily. If randomizing, I may perform a simple “lottery” to determine which thousand households get the pill: I write all two thousand names onto a small piece of paper, put those pieces of paper into a basket, shake the basket up, close my eyes and pull one thousand pieces of paper out. Intuitively, this would be called, a lottery design.
Alternatively, If I were to rotate households instead of every day, every year, and randomly assign the order in which they get treated, and then in one out of those two years households would be considered the treatment group, and in the other year, they would be part of the control group. If I were to measure outcomes at the end of each year, this would be called a rotation design.
Say I can secure five hundred pills per day this year, but next year I expect to one thousand per day, and the following year, two thousand per day. I could randomly choose five hundred households to get the pill in the first year, another five hundred to be added in the second year, and the remaining thousand get it in the third year. This would be called, a phase-in design.
There are seven possible randomization designs—the lottery design, phase-in design, rotation design, encouragement design, the varying levels of treatment design, and two-stage randomization. These designs are not necessarily mutually exclusive. Their advantages and disadvantages are summarized in the table below:
A table comparing strategies used to create randomized comparison groups can be found here.
A spillover effect occurs when a program intended to help targeted participants unintentionally impacts the comparison group as well (either positively or negatively). The comparison group is supposed to represent outcomes had the program not been implemented (see counterfactual). If this comparison group has been touched by the program, its role mimicking the counterfactual is now compromised, and the ensuing impact measure may be biased. There are ways of mitigating spillover effects, for example, changing the level of randomization.
For example, one source of sickness may be drinking contaminated water. But another source is playing with neighboring children who are themselves sick. If I am in the control group, and the program treats my neighbors, and those neighbors are no longer sick, that reduces my chance of getting sick. And even though I may be in the control group, I have now been affected by the treatment of my neighbors. I would no longer represent a good comparison group. This is known as a spillover effect, in particular a positive spillover. To mitigate this, we could randomize at the community level. Doing so would mean that if our community were assigned to the control group, I and all of my neighbors would share the same status. I would be less likely to play with children from a different community and therefore less likely be impacted by the intervention. Or if assigned to the treatment group, we would not positively impact others.
(Of course, we may actually want to know how these spillovers occur, and design accordingly. See methods of randomization.)
Another possibility is that my household has been assigned to the control group, but my neighbor is in the treatment group, and so my mother knows their water is clean, and she sends me to their house to drink. In a sense, I am finding my way into the treatment group, even though I was assigned to the control group. When people deliberately defy their treatment designation (knowingly or unknowingly), and as a result, outcomes are altered, this would be considered a crossover effect. As with spillovers, by crossing over I no longer represent a good comparison group—since I have clearly been affected by the existence of the program. As before, changing the level of randomization could mitigate crossover effects.
Once the unit and method of randomization have been determined, it is time to randomly assign individuals, households, communities, or any unit to either the treatment or control group.
a) Simple Lottery
Generally to start with, we need a list of (individual, household head, or village) names. Next, there are several ways to proceed. We could write all names onto a small piece of paper, put those pieces of paper into a basket, shake the basket up, close our eyes and pull one thousand pieces of paper out. Those will make up the treatment group and the remainder could be the control group. (or vice versa) We may do this as part of a public lottery. Similarly, we could go down the list, one-by-one and flip a coin to determine treatment status. However, we don’t always divide the study population exactly in half. We may wish to include 30 percent in the treatment group and 70 in the control. Or if we had a phase-in method with three periods, we may want to divide the population into three groups. Also very common, we will test multiple treatments at the same time—also requiring several groups. In these more sophisticated evaluation designs, a coin flip will not suffice.
Typically, we will write a computer program to randomly assign names to groups.
Sometimes we do not have a list beforehand. For example, if individuals enter a clinic with symptoms of malaria, the decision of whether to administer the World Health Organization’s standard “DOTS” treatment or an enhanced alternative must be made on-the-spot. The treatment could be determined by the nurse at the clinic using the flip of a coin. But we may be concerned that the nurse would ignore the random assignment if she has an opinion of which treatment is better and which patients are more “deserving” than others. Alternatives could include computerized or cell-phone based randomization.
c) Stratified Randomization
Frequently, the target population is divided into subgroups before randomizing. For example, a group of individuals can be divided into smaller groups based on gender, ethnicity, or age. Or villages could be divided into geographic regions. This division into subgroups before randomization is called stratification. Then the randomization exercise takes place within each of the strata (subgroups). This is done to ensure that the treatment and control groups have balanced proportions of treatment and control within each group. It is conceivable that with a small sample, we find that without stratifying, we end up with more females in our treatment group than males. The primary purpose of stratification is statistical and relates to sample size. The decision to stratify has no bearing on whether the results are biased.
5. Sample Selection and Sample Size
An experiment must be sensitive enough to detect outcome differences between the treatment and the comparison groups. The sensitivity of a design is measured by statistical power, which, among other factors, depends on the sample size – that is, the number of units randomly assigned and the number of units surveyed.
Once again, let’s take our example of waterborne illness in a community. And let us assume that we have chosen to distribute chlorine tablets to households to test their impact on the incidence of diarrhea. But let us also assume that we only have a very limited budget for our test phase, and so we would like to minimize the number of households that are included in the survey while still ensuring that we can know whether any change in incidence is due to the chlorine tablets and not to random chance. How many households should receive the tablets, and how many should be surveyed? Is five households enough? 100? 200? How many households should be in the control group? Tests for statistical power help us answer these questions.
For more information on how to estimate the required sample size, see:
- Duflo, Esther, Glennerster, Rachel, and Kremer, Michael, "Using Randomization in Development Economics Research: A Toolkit" (2006). MIT Department of Economics Working Paper No. 06-36.
- Bloom, H.S. (1995): "Minimum Detectable Effects: A simple way to report the statistical power of experimental designs," Evaluation Review 19, 547-56.