Test, learn, adapt: Impact evaluations are only as strong as their assumptions

Posted on:
On the left side, a woman speaks to an audience. On the right, there is no speaker.
In a randomized evaluation of this training program, individuals (or groups) are randomly assigned to be offered the training (the treatment group) or not (the comparison group). For example, one group may hear from a speaker while another does not. Design credit: Lucy Nguyen

This blog is part of a series called “Test, learn, adapt,” in which impact evaluation experts explore research methods and answer common questions that come up in conversations with practitioners. To read the rest of the series, click here.

When engaging with partners and training participants, I often hear a perception that some impact evaluation methods are universally more “rigorous” than others. Specifically, randomized controlled trials (RCTs) are often referred to as the “most rigorous,” with quasi-experimental methods perceived as “almost as good.” 

What this language and perception misses is that an impact evaluation is only ever as good as its assumptions. Instead of choosing an impact evaluation method based on perceived rigor, the more relevant question to ask is: Are the assumptions that make each method credible likely to be met in my context?

What do I mean by assumptions?

Impact evaluations aim to answer a fundamental question: What is the causal impact of a program on an outcome of interest? To answer this, we need to try to understand what would have happened to program participants in the absence of the program—a scenario we can never observe directly unless we find a way to access parallel universes. 

All impact evaluation methods are, at their core, attempts to approximate this unobservable “counterfactual” as credibly as possible, and they do so by creating a comparison group—be it an actual group or an artificially created one—to approximate as closely as possible what would have happened to program participants in the absence of the program. 

Given that all of this happens in a world of potentials, all impact evaluations come with a set of assumptions that have to hold for the comparison group, whether randomized or not, to be a good representation of this unobserved counterfactual. 

In this blog post, I will outline some of the assumptions required in different evaluation designs and discuss how to critically assess these in order to determine what might be the best evaluation design for your specific program and context.

What does this look like in practice?

Imagine that you want to evaluate the success of a soft skills training program designed to help young people succeed in the labor market. We might see that participants’ employment outcomes improve after the training, but how do we know that it is because of the program rather than that they would have found jobs anyway?

To answer this, we need a valid comparison group that represents what would have happened without the program. Let’s look at three frequently used designs and discuss how to consider whether the assumptions are likely to hold in each of them.

Randomized evaluations

In a randomized evaluation of this training program, individuals (or groups) are randomly assigned to be offered the training (the treatment group) or not (the comparison group). Due to the randomization, the group will be similar on average before the program is implemented, and the program’s impact can be estimated by comparing the employment outcomes of those in the treatment group to those in the comparison group.

Several assumptions need to hold for this comparison to credibly estimate program impact. One is that the allocation is actually random. The random allocation can break down if there is a miscommunication in who should receive the program, or if the implementers or the people in the evaluation have discretion over who ultimately gets access to the program. Another assumption is that the people in the comparison group don’t change their behavior simply because they are part of the evaluation, e.g., because they are being observed or because they expect to get access to the program in the future.

When these (or other) assumptions fail, the comparison group might no longer be a good estimate of what the treatment group would look like without the program, and differences in employment outcomes may no longer only reflect the effect of the training.

Regression discontinuity designs (RDD)

In a regression discontinuity design for this training program, eligibility might be based on a test score cutoff—for example, participants who score above 70 out of 100 on a financial literacy test might qualify for the training and those below do not. If the likelihood of scoring either just below or just above the threshold (e.g., 69 vs. 71) is close to random, the people just below the threshold are likely to be similar to those just above the threshold, and program impact can be estimated by comparing employment outcomes of those just above the cut-off (who did receive the training) to just below (who didn’t).

One example of an assumption required for this design to be credible is that people cannot manipulate their test scores around the cutoff. This assumption can fail if participants can retake the test, influence their scores, or find other ways to get access to the training without meeting the threshold. In those cases, individuals near the cutoff may no longer be comparable, and the resulting estimate may be biased.

Difference‑in‑differences (DiD)

A difference‑in‑differences evaluation design for this training program would compare employment trends over time between participants and non-participants. If it seems plausible that the participants and non-participants would have developed similarly over time without the program, impact can be estimated by comparing the change in employment among training participants to the change among non-participants.

A central assumption is that, without the training program, both groups would have followed a similar—or parallel—trend in employment. This assumption can fail if, for example, the group receiving the program also lives in an area that experiences a localized economic shock, or if another organization delivers a similar training in areas where your comparison group lives. In those cases, differences in trends may reflect these external factors rather than the effect of your program.

Failed assumptions → challenging interpretations

Regardless of the impact evaluation method, if the assumptions are not met, the causal estimates of the program that are inferred from the evaluation should be interpreted with caution. And I can share from personal experience that it is incredibly frustrating—and an unfortunate use of resources—to have conducted an impact evaluation only to learn that you cannot fully trust the outcome.

You will never know with 100% certainty whether assumptions are met, but the more data you have the more confident you can be

A natural next question: How do we know whether the assumptions for a given design are met? Answering this question often involves a mix of data, logic, and deep contextual knowledge. For example, for an RCT, if you have data on training participation, you can check whether anyone in the comparison group attended the training, and if you interview the comparison group, you might be able to learn whether (and in that case how) they got access to the training materials. 

For an RDD, you can statistically compare any available demographic data for those who scored just below and just above the threshold to test that the two groups are indeed similar on characteristics that are unlikely to be affected by the program (like gender, age, prior education). You can also assess whether the training was indeed only offered to those who scored above the threshold.

However, data might also be imperfect, so while data can get you closer to determining the validity of the comparison, inference from impact evaluations always comes with a degree of uncertainty, which is why the result will always be called an impact estimate. This is also why academic papers in economics typically spend pages upon pages arguing for why the assumptions are met and dispelling alternative interpretations through robustness checks.

The benefit of randomized evaluations: more control over assumptions

As mentioned above, randomized evaluations are not universally “better” than other methods because the quality of each study’s conclusions depend on whether the assumptions are met and the available data. However, one of the main benefits of randomized evaluations is that because the evaluation is designed before the program is implemented, you have a degree of control over the assumptions during the design phase, which you will not have in an evaluation designed after the program has already been implemented.

So while randomization doesn’t eliminate the need for making assumptions, any prospective evaluation gives researchers more control over them and often reduces the number or complexity of assumptions compared to other impact evaluation designs.

In summary

Every impact evaluation design—whether a randomized evaluation, regression discontinuity, difference-in-differences, or another approach—relies on assumptions to estimate what would have happened in the absence of a program. The credibility of an impact estimate depends not only on whether the method is considered experimental, quasi-experimental, or non-experimental, but on whether the study’s assumptions are likely to hold in a given setting. This often gives an advantage to prospective evaluations designed before the program is implemented, but no method will be appropriate or is inherently rigorous in every context.

In practice, I recommend shifting the central question. Instead of asking, “Which method is the most rigorous?” a more useful set of questions is: “Given our program and data, which evaluation method is feasible and can provide us with credible results? And how can we test or strengthen the required assumptions?”

Framing method choice this way helps ensure that evaluation designs are not only feasible but truly informative, enabling researchers and partners to generate evidence they can trust and use.

At J-PAL, our work is grounded in randomized evaluations. But our approach to rigor is rooted in critical thinking, not categories or labels. A thoughtful assessment of assumptions, context, and implementation is essential—no matter the method.