The elements of a randomized evaluation
This resource presents a high-level overview of the steps of a randomized evaluation, while showcasing a selection of our teaching and learning tools that were created as part of our online and in-person capacity building activities. Throughout this section, the reader will find written resources, video lectures, and case studies about different aspects of a randomized evaluation, derived from our professional education programs, and in particular, the course Evaluating Social Programs. For more of these materials, see our teaching resources.
1. Theory of change
The first step in an evaluation is to revisit the program’s goals and how we expect those goals to be achieved. A logical framework or theory of change model can help in this process, potentially building on other tools from the monitoring and evaluation framework, such as a needs assessment (which provides a systematic approach to identifying the nature and scope of a social problem) or a program theory assessment (which builds a theory of the working of the program and describes the logical steps by which the proposed policy fills the identified need).
Next, the researchers need to determine the precise research question they want to ask. Can their question be answered by testing a causal relationship or hypothesis? Which components can be randomized (e.g., who gets the program, who delivers the program, or when the program is delivered)? How do the program and theory of change map onto the research question? What data can the researchers collect that will answer their research question?
For more information on theory of change, see our What is Evaluation lecture and case studies.
Measurement in a randomized evaluation typically occurs at two key stages: baseline measurement, often used to gather descriptive statistics on the study sample (such as the average age, household income, or gender composition), and endline measurement, used to estimate the effect of the intervention. Some randomized evaluations may collect data during the middle of the intervention, known as midline data–this is typically used to monitor the implementation of the intervention.
Many randomized evaluations rely on conducting surveys to measure outcomes. Social science researchers often collect innovative and very specific measurements to get at the exact outcomes that the impact evaluation is interested in. For example, researchers can measure the strength of individual preferences, say, for taking on risk or for reciprocating a generous gesture; detailed location or GPS data; information on social networks and financial relationships in a village; quality and quantity of goods purchased and consumed; women’s and girls’ empowerment; and more.
In addition to survey data, the use of administrative data has also become increasingly common. Administrative data in randomized evaluations has enabled researchers to answer questions about a host of subjects at relatively low cost. For instance, J-PAL researchers have answered many important questions about health access and health financing for low-income households in the US using Medicare and Medicaid data. Administrative data is extremely useful for example to conduct long-term follow ups on the study sample. For more information on how J-PAL uses administrative data, see our Innovations in Data and Experiments in Action (IDEA) Initiative.
J-PAL affiliated professor Kelsey Jack explores these concepts in our 2016 Evaluating Social Programs course (video; slides). In addition, J-PAL has technical resources on measurement and survey design. Some evaluations by J-PAL affiliated researchers that use administrative data to measure outcomes can be found below:
3. Deciding on a research design that is ethical, feasible, and scientifically sound
In some cases, there may be ethical concerns about conducting a randomized evaluation–such as when the intervention is an entitlement, or the implementing partner has the resources to treat the full study sample. Design modification may help to address these concerns. It is possible to conduct a randomized evaluation without restricting access to the intervention; for example, as described in Introductions to randomized evaluations, we could randomly select people to receive encouragement to enroll in a program, such as reminder emails or phone calls, without denying any interested participants access. If the implementing partner has the resources to treat the full sample–but does not yet know if the intervention is effective–researchers can use a randomized phase-in design to initially treat part of the sample and learn the intervention effects before extending the treatment to the full sample.
For more details on how research design can solve challenges to randomization, see J-PAL North America’s guide to Real-World Challenges to Randomization, our Ethics resource, or our How to Randomize lecture. For information on how to write IRB applications for a research study and to navigate the legal and institutional requirements when working with an IRB see our Institutional review board proposals resource. Information on how J-PAL research projects comply with IRB ethical guidelines can be found in J-PAL’s Research Protocols.
In the most basic randomization design, researchers randomize who does and does not receive the program. If that is unfeasible, such as in the above examples due to ethical considerations, researchers can modify the design–such as randomizing when groups receive a treatment, or who delivers the treatment. Note, too, that the cost and effort in conducting surveys and implementing a program will have implications for the design and sample size of a randomized evaluation. These modifications may change the generalizability of the research findings (see below), but preserve the internal validity of the findings due to the randomization. See J-PAL North America’s guide to Real-World Challenges to Randomization and their Solutions, our Randomization resource, or our How to Randomize lecture for more information.
c. Choosing the sample size
Part of ensuring an evaluation is statistically sound is ensuring it has statistical power. The statistical power of an evaluation reflects how likely we are to detect any changes in an outcome of interest if the program truly has an impact. Researchers conduct power analyses to inform decisions such as how many units to include in the study, the proportion of units allocated to each group, and at which level to randomize (e.g., student versus class of students versus entire school). Power calculations can be conducted to ensure researchers have the necessary sample size to detect the main effect of a program, as well as to detect if the program effect differs between treatment arms or for different subgroups of the population. See then Executive Director Rachel Glennerster discuss power calculations below (slides here). In addition, J-PAL has a resource, exercise, and guide to conducting power calculations.
d. Threats to internal validity
Like all evaluation methodologies, randomized evaluations face threats to their internal validity, including spillovers, attrition, and partial non-compliance. To ensure the study remains statistically sound, researchers can design the intervention and data collection procedures to minimize or measure such threats. For example, researchers who are concerned with spillovers can randomize at a higher level (e.g., the school level versus student level) to avoid spillovers to the control group and measure the full effect in the treatment group, or vary the proportion treated within a village and measure spillovers on those not treated. For a guided exercise in dealing with threats and analysis, see the Threats section of our Threats and Analysis lecture, our Threats and Analysis case study (embedded below), or J-PAL North America’s Real-World Challenges to Randomization guide.
A key aspect of randomized evaluations, as the name implies, is the random assignment of units into the treatment and comparison groups. Researchers have a range of randomization schemes, ranging from simple randomization, where all units have an equal probability of receiving treatment, to the more complex stratified randomization, where units are partitioned into groups (called blocks or strata), then randomized within the stratum (for example, a study stratifying on gender would randomize within the men in the sample, and, separately, randomize within the women in the sample). An introduction to the different types of randomization can be found in our How to Randomize lecture. For more information on randomization–including when and why stratified randomization is necessary, as well as sample code to conduct randomization–see our Randomization resource.
5. Data analysis
At its simplest, the analysis of a randomized evaluation uses endline data to compare the average outcome of the treatment group to the average outcome of the comparison group after the intervention. This difference represents the program’s impact. To determine whether this impact is statistically significant, researchers can test the equality of means using a simple t-test. One of the many benefits of randomized evaluations is that the impact can be measured without advanced statistical techniques. However, more complicated analyses can also be performed, such as regressions that increase precision by controlling for characteristics of the study population that might be correlated with outcomes.
An introduction to analysis, including analysis in the presence of threats to the internal validity, can be found in the Analysis section of our Threats and Analysis lecture (see also slides and case study). A more technical resource covering topics such as subgroup analysis, multiple hypothesis testing, and the different types of treatment effects, can be found in our Data analysis resource, which includes example code, like that seen below:
tempfile tempfilename save `tempfilename' /* this can then be used as any other dataset, e.g.,can be merged, appended, etc. */
6. Cost-effectiveness analysis
Calculating the cost-effectiveness of a program—for instance, dollars spent per additional day of student attendance at school—can offer insights into which programs are likely to provide the greatest value for money. Cost-effectiveness analysis (CEA) summarizes complex programs in terms of a simple ratio of costs to impacts and allows us to use this common measure to compare programs across time and location. CEA may not, by itself, provide sufficient information to inform all policy or investment decisions, but it can be a useful starting point for choosing between different programs that aim to achieve the same outcome. To conduct CEA, you need two pieces of data: an estimate of the program’s impact and the cost of the program. In order to help other organizations conduct this type of analysis, J-PAL has developed costing guidelines and templates.
Unlike internal validity, which a well-conducted randomized evaluation can provide, external validity, or generalizability, is more difficult to obtain. J-PAL has developed a practical framework for addressing the generalizability puzzle, which breaks down the big question of “will this program work here?” into a series of smaller questions rooted in the theory behind a program. For more information, see our lecture (slides) or article on the generalizability framework.
Recent work by J-PAL affiliated professors explores the challenges in drawing conclusions from a localized randomized evaluation through a case study of the Teaching at the Right Level (TaRL) program. This program has been tested in different contexts, with different populations, under different implementation models, and through different implementing partners in order to test mechanisms and begin to answer questions on when results generalize. The size of the study population may influence the generalizability of results (Muralidharan & Niehaus 2017). See J-PAL’s Evidence to Policy page for an exploration of how evidence from one context can inform policies in other contexts.
Researchers can further shed light on questions on the external validity and generalizability of results by publishing data from their evaluations. Over the last decade, the number of funders, journals, and research organizations that have adopted data-sharing policies has increased considerably. When the American Economic Association adopted its first policy in 2005, it was among the first academic journals in the social sciences to require the publication of data alongside the research paper. Today, many top academic journals in economics and the social sciences require data to be published. Similarly, many foundations and government institutions, such as the Bill and Melinda Gates Foundation,1 the National Science Foundation,2 and the National Institutes of Health, have data publication policies. J-PAL, as both a funder and an organization that conducts research, adopted a data publication policy in 2015 that applies to all research projects that we fund or implement.3 To facilitate data publication, J-PAL created guides to data de-identification and publication to help research teams think about the steps involved in publishing research data.
Last updated July 2020.
These resources are a collaborative effort. If you notice a bug or have a suggestion for additional content, please fill out this form.
We thank Rohit Naimpally for helpful comments and Evan Williams for copy-editing this resource. All errors are our own.
Muralidharan, Karthik and Paul Niehaus. 2017. "Experimentation at Scale" NBER Working Paper 23957. https://www.nber.org/papers/w23957.pdf