This guide provides an overview of data analysis for randomized evaluations in order to estimate causal impact. It is intended to provide something of a starting point and orient individuals not familiar with all nuances of the literature; it does not aim to provide a comprehensive or “authoritative” treatment of these topics. We instead link to useful resources for further reading and provide sample Stata and R code for each topic.
Overview of resources
CSAE Coder’s Corner provides sample code (mostly Stata, though some R and Matlab) for procedures such as bootstrapping, adjusting standard errors for spatial correlation, and random forest.
EGAP’s methods guides contain sample R code.
Theory and intuition
Duflo et al.’s (2007) Using Randomization in Development Economics Research: A Toolkit is an accessible guide to using randomization in development economics. Includes some technical discussion and equations but can be understood by most readers.
Athey & Imbens (2017) The Econometrics of Randomized Experiments provides a more recent and more technical treatment of topics such as stratification and randomization inference. Best for readers with some graduate-level econometrics.
Mostly Harmless Econometrics (MHE) is not RCT-specific but is an accessible applied econometrics text that covers the math and intuition behind decisions made in econometric analysis. The text also includes sample code for certain examples. The MIT course 14.387 “Applied econometrics: Mostly harmless big data” roughly follows MHE, with slides from the fall 2014 course run posted on MIT OCW.
EGAP methods guides include topics ranging from reading regression tables to causal inference, heterogeneous treatment effects, treatment effects, local average treatment effects, and covariate adjustment. The guides do not provide a deep, comprehensive discussion of each topic but are good overviews and include sample R code.
The World Bank has a series of methods blog posts on a wide range of topics. These posts typically summarize and then link to important papers on these topics.
MIT’s OpenCourseWare includes lecture notes for many economics courses, including:
Josh Angrists’s undergraduate econometrics
Victor Chernozhukov’s graduate econometrics
Angrist & Chernozhukov’s graduate-level applied econometrics
We are interested in estimating the true effect of the treatment on the population from which the sample was drawn. As we only observe the study sample—not the full underlying population—we form estimates of the true treatment effect. Due to sampling variation, any given estimate of the treatment effect is unlikely to be exactly equal to the true effect, but if we were to repeat the study many times on different samples drawn from the same population, the average estimate would be equal to the treatment effect—this forms the basis for conducting power calculations.
Moreover, in any given sample, changes in outcomes resulting from the treatment are likely to vary between individuals or groups, i.e., there are likely to be heterogeneous treatment effects.1 For example, a program offering free prenatal care to eligible women may have a larger effect on birth outcomes for women at the bottom of the income distribution than those at the top, who may have access to other resources.
Formally, we can write the outcome of each unit in either the treatment or control group, using notation from Deaton & Cartwright (2018) as:
βi is the treatment effect (which may be unit-specific)
Ti is a treatment dummy
xij are j=1,...,J observable and unobservable, unit-specific factors that may affect outcome Y
γj indicates the effect of xj on Y and may be positive or negative
There are then two main approaches to estimating treatment effects. In the simplest regression framework, described below, treatment effects are assumed to be homogeneous, so that βi=β. We follow this “conventional” approach for the majority of this guide. A second approach, randomization inference, is gaining popularity in the analysis of experimental data in economics. With randomization inference, we can directly test the exact null that βi=0 for all i. As this approach is somewhat distinct, conceptually, from the “conventional” approach, we discuss randomization inference towards the end of this guide.
Average treatment effects (ATE)
Researchers are typically interested in the average treatment effect (ATE), which is the average causal effect of a variable (here, an intervention or program) on an outcome variable for the entire study population. In the classical experimental design, (i.e., the most simple experimental design with one treatment group; one control group; and perfect compliance, described further below), the ATE can be estimated as the difference in means of the outcome between the treatment and comparison groups.
Formally, with a given experimental sample of size N and given treatment group assignment, we can take the average of the treatment group (T=1) and the comparison group (T=0) to estimate the ATE:
β1 is the average treatment effect; the subscript indicates that this estimate calculates the average of the treatment effects in the treatment group. The second term is the “error term” of the ATE estimate: the average difference between treatment and control group that is unrelated to treatment (from observable and unobservable differences).
Perfect compliance is when all individuals in the treatment group take up treatment and no control individuals receive treatment. Under perfect compliance, treatment assignment (which is determined randomly by the research design) and treatment status (or the participation variable, which is determined by the individual or group) are identical. Then, a simple difference in means provides an unbiased estimate of the ATE in randomized studies, provided there is also no attrition or unaccounted-for spillovers. In this case, a linear model regressing the observed outcome on a treatment indicator can be used to estimate the effect of the treatment. Linear regression models can accommodate more flexible estimation techniques that rely on ordinary least squares (OLS). Examples include clustering of standard errors to account for within-group correlations (which is particularly important in a clustered design), or the inclusion of covariates to increase precision.
In Stata, the ATE can be estimated with the following code:
reg y treatment, robust
where y is the outcome of interest and treatment a binary variable indicating whether treatment was assigned. robust gives heteroskedasticity-robust standard errors, described further below and in White (1980), Angrist & Pischke (2009), and Wooldridge (2013) (see also chapter 8 accompanying Stata code for Wooldridge).
Intention to treat effects (ITT)
Imperfect compliance is when individuals do not follow their treatment assignment. While compliers are people who are induced to take up the treatment only because they were assigned to receive it (and do not take it if they are not assigned treatment), non-compliers are comprised of three groups:
Always-takers: always take the treatment even if assigned to the control group
Never-takers: always refuse treatment, even if assigned to the treatment group
Defiers: do the opposite of their treatment assignment
Non-compliance can be one- or two-sided. One-sided noncompliance is when individuals assigned to the treatment group refuse treatment OR individuals assigned to the control group take up treatment. Two-sided noncompliance is when both of these occur.
In many cases, researchers and policymakers care about identifying the impact of the offer of the program on the population that was offered it, even if some of them did not take it up, as this will resemble what will be likely to happen if the program is rolled out. The intention to treat (ITT) is an estimate of the effect of the program on those assigned to treatment, regardless of their take-up. That is, the ITT is obtained from regressing the outcome on treatment assignment for the whole sample.
It will often (though not always) provide a lower bound on the ATE, as it includes in the treatment group some individuals who did not receive the treatment (under the assumption that they would have benefited less from the treatment than those who took it up) and may include in the comparison group some individuals who did in fact receive the treatment.
In Stata, the ITT is obtained with the following code:
where assign_treatment=1 if the individual is assigned the treatment and 0 otherwise
reg y assign_treatment
Local average treatment effects (LATE)
The LATE provides an estimate of the treatment effect for compliers, i.e., those who are induced by their assignment to comply. Formally, it is given by:
As above, Y denotes the outcome for individual (or group, depending on the unit of analysis) i. z denotes treatment assignment and is 1 if the treatment was assigned and 0 otherwise, and d denotes whether the treatment was received (and is 1 if it was, 0 otherwise). That is, (random) treatment assignment is used as an instrument for treatment status.
The LATE is limited in that it is only well-defined for compliers and is specific to the instrument used; it cannot uncover the effects for always-takers, never-takers, or defiers. In addition to the standard independence assumption that follows from randomization (i.e., the instrument, in this case treatment assignment, is as good as randomly assigned), and the assumption that there is a positive share of compliers, it relies on two key assumptions:
Monotonicity: Assignment to treatment does not make one less likely to be treated
The exclusion restriction: Individuals respond to the treatment itself, not treatment assignment, so that the outcome is the same for those who would not have taken up the treatment, regardless of treatment assignment
Just as the ITT typically provides a lower bound on the ATE under imperfect compliance, the LATE typically provides an upper bound (though, again, this is not always the case). This is because it estimates the effect of the treatment on those who took it up--who are often more likely to benefit from the treatment than those who did not take it up. The higher the compliance, the closer the LATE will generally be to the true ATE.
In Stata the LATE is estimated with:
where treated=1 if the treatment was received and 0 otherwise and assign_treatment is defined as above.
ivreg2 y (treated=assign_treatment), robust first
Treatment on the treated (ToT)
The treatment on the treated (ToT) is the treatment effect on those who actually take up the treatment. The counterfactual of the ToT is control group members who would have accepted the treatment if they had been offered it, which cannot be observed. The ToT can be estimated when no one in the control group is treated, so non-compliance is one-sided. This can be a result of research design--if the control group is prevented from receiving or taking up the treatment--or simply from the realization that no one from the control group took up the treatment, even though it was possible for them to do so.
The ToT relies on the same assumptions as the LATE and is estimated in the same way: using an instrumental variables (IV) approach; the only difference is that for the ToT none of the comparison group members received the treatment. Intuitively, it is a weighted average of the effects for always-takers and compliers who were assigned the treatment (Angrist 2014, slide 12).
Quantile treatment effects
RCTs provide information on the differences in means between treatment and comparison groups. However, we might also care about how the treatment affected the distribution of outcomes across treatment and control groups. It is possible to use quantile regression to estimate the treatment’s effect on a specified quantile of the outcome variable (e.g., median, 10th percentile, 90th, etc.). That is, we can estimate the difference in outcomes at a given quantile. Quantile regression is useful for understanding how the treatment affected various points in the distribution of outcomes. It cannot, however, generally be used to estimate the distribution of treatment effects. Quantile coefficients can also only tell us about effects on distributions and not on individuals.
Quantile regression relies on the assumption that the outcome is a continuously-distributed random variable with well-behaved density (no gaps or spikes). Rank preservation is required if wants to determine whether individuals are better or worse off from the intervention, versus just finding an effect for the bottom decile, for example, without knowing whether people who were originally in the bottom decile are actually better or worse off than they would have been without the intervention (Angrist and Pischke, 2007).
Useful resources on quantile regression and treatment effects include:
Chapter 7 of Angrist & Pischke (2007) covers quantile regression in some detail, including how and when they can have a causal interpretation
Section 4.3 of Athey & Imbens (2017)
EGAP’s methods guide includes a brief discussion, sample R code, and an example of a case where the ATE is 0 but the treatment effect negative for low quantiles of the response and positive for high quantiles.
qreg outcome assign_treatment, quantile(0.9) vce(robust) // for 90th percentile
An important potential advance of RCTs is that it is not necessary for identification to include controls in the specification (since treatment assignment is by definition orthogonal to other covariates). Including covariates has both benefits and drawbacks. Covariates can increase precision and can also adjust for random imbalances between the treatment groups. However, unless the covariates are all indicators, partition the population, and are included as the full set of interactions, the ATE will typically be biased (though this decreases as the sample size increases) (Athey & Imbens 2017).3 If poorly chosen (not predictive of the outcome), they can also decrease precision.
Covariates can either be included additively or as a full set of interactions with the treatment variable. The former approach can increase precision, while including the full set of interactions with the treatment variable can allow testing for heterogeneous treatment effects (see more below). Moreover, there is some debate as to how to include imbalanced covariates in analysis: some researchers include them in the regression as is, while others recommend de-meaning the imbalanced covariates first (Imbens & Rubin 2015). You should select covariates that could not have been affected by the treatment—baseline controls are often used—but likely to affect your outcome variable. If you do include covariates, it is generally advisable to show results with and without covariate adjustment.
Heterogeneity analysis and multiple hypothesis testing
As written above, it is likely that a treatment will affect individuals or groups of individuals in different ways. If you believe some type of group will respond differently to the treatment based on some observable characteristic or set of characteristics, you may want to test for heterogeneous treatment effects. Ideally, this potential heterogeneity is considered in advance so that the study can be designed with sufficient power to detect them if they exist. This involves including the relevant subgroups as strata (defined below and in the resource on randomization) in the research design. Doing so allows for stratified randomization, described further below and in the resource on randomization. This approach combines the benefits of covariate inclusion (increased precision) without the associated drawbacks (biased ATE).
If instead after randomization and implementation you discover a particular group(s) is responding differently to the treatment, you may want to test for heterogeneous treatment effects using groups that are determined ex post (i.e., after randomization). Duflo et al. 2007 (page 64) provide an overview of how to discuss results from subgroups analysis when the groups are determined ex ante versus ex post. EGAP has a more detailed and technical discussion, including sample R code to test for heterogeneous treatment effects, and Ben Jann has useful slides from a 2015 University of Michigan workshop.
Stratification may be employed to improve balance at baseline across treatment and control groups for the stratifying variables. It is useful when you are interested in testing for heterogeneous treatment effects for some variable (such as gender) and want to ensure you are sufficiently powered to do so. The decision to stratify is made at the design stage and should be incorporated into power calculations and described in the trial registry entry and, if applicable, pre-analysis plan. As mentioned above, including strata can also improve precision (if they are correlated with the outcome) without introducing bias, since treatment status is by construction random conditional on the strata. Here, we assume strata have already been created and that randomization was stratified--see the resource on randomization for more.
If the probability of treatment varies by stratum, then treatment assignment is conditional on the strata, and analysis of the stratified randomization should control for the stratification variables by including them as controls in the regression specification (Duflo et al. 2007); if the probability of treatment does not vary by stratum, strata indicators can be included but are not necessary for unbiasedness. If stratifying on multiple variables and including strata indicators, you should include all stratifying variables in your regression and construct the variables like the categories you randomized within. For example, if you stratified by gender and urban/rural location and randomized within four distinct strata (female/urban, female/rural, male/urban, male/rural), you would have a stratum variable taking on values 1-4, each corresponding to a distinct stratum, and include it as i.stratum in a Stata regression). To test for heterogeneous treatment effects, interact the strata indicators with the treatment indicator. Standard errors are typically not clustered but should be adjusted to account for multiple hypothesis testing (more below).
Multiple treatment arms and multiple comparisons
For randomized evaluations with multiple treatments, researchers may be interested in all pairwise comparisons across the research design groups, pairwise comparisons with the control group, or pairwise comparisons with the best of the other treatments. F-tests can then be used in a regression model with separate dummies for each treatment arm to test whether any of the treatments is on average better (or worse) than the others or to conduct pairwise comparisons among the different groups. In Stata, this can be done by coding test _b[t1]=_b[t2] if testing that the effect of treatment 1 is equal to the effect of treatment 2.
Multiple hypothesis testing
Many treatments may affect a number of outcomes (a tutoring intervention, for example, could affect graduation rates, test scores, motivation, etc.). It is important to note that as the number of hypotheses tested increases, so does the probability of a false rejection of the null (aka a type I error)--see EGAP’s related methods guide or McKenzie (2020) for a longer description of this problem. One option in such instances is to aggregate several related outcomes into a single index--for example, Bergman et al. (2019) uses the Kirwan Child Opportunity Index, which aggregates education, health, and economic indicators, in their evaluation of the Moving to Opportunity program. However, this approach is limited in that it does not allow researchers to uncover the effect on individual indicators, including if the indicators have opposite effects or effects of different magnitudes.
Bonferroni correction (in Stata: test …, mtest(bonferroni), which involves multiplying each p-value by the number of tests performed (or, equivalently, dividing , the significance level, by the number of tests performed). This can be overly conservative and can yield calculated p-values greater than one.
Benjamini Hochberg method (practical application), which involves ordering p-values and adjusting the significance level α by (rank of p-value)/(total number of tests). For example, with 5 tests, the α for the 2nd test would be adjusted by ⅖ α. If the p-value for that test were less than ⅖ α, that hypothesis and those following it in the order (i.e., those with smaller p-values) would be rejected.
Stata code: David McKenzie provides an overview of multiple hypothesis testing commands in Stata in a 2020 blog post
R code: See EGAP’s methods guide
Statistical validity, which comprises both internal and external validity, refers to the degree to which conclusions about treatment effects can be considered to be reasonable. External validity is the degree to which study results can be generalized to other contexts, such as different populations, places, or time periods. Internal validity refers to the degree to which conclusions about causal relationships can be made (e.g., whether the estimated size of a treatment effect is correct), given a study’s research design and implementation.
As randomized controlled trials in the social sciences take place outside of tightly controlled laboratory settings, researchers often encounter potential threats to research design that may complicate or compromise the intervention. Common threats to the internal validity of randomized evaluations include spillovers and attrition.
Spillovers are violations of the “stable unit treatment value assumption” (SUTVA, or that the potential outcome of any individual does not depend on the treatment assignment of others). Spillovers can occur when those who do not receive the treatment are still affected by it, and can thus bias treatment effect estimates. Within the context of an RCT, potential spillovers are ideally considered in the research design by choosing the appropriate level at which to randomize--for example, an intervention that may have spillovers within schools could randomize at the school level instead of the student level. It is also possible to design the study to measure spillovers directly, such as by randomizing exposure to the treatment. This is discussed at greater length by Duflo et al. (2007) and Baird et al. (2014).
As with other design-related considerations, analysis of spillover effects will depend on the study design. If the study is designed to measure spillovers, the estimating equation should include a variable for exposure (e.g., whether the individual lives in the same neighborhood as other individuals who received the treatment). The direct effect of the intervention is captured by the treatment indicator, as above, while the indirect effect of exposure to the treatment is captured by the additional variable--an example of this is included in the accompanying sample code. Miguel & Kremer (2004) and Duflo & Saez (2003) are good examples of studies that measure treatment effects in the presence of spillovers.
Attrition occurs when study group members drop out of the study or data on them cannot be recovered. If characteristics of attrits (drop-outs) are correlated with their treatment assignment or effect size, the correlation may indicate systematic differences between the remaining program and control group members. This could lead to biased estimates of program effects, with the risk of bias increasing with the attrition rate. If attrition is uncorrelated with treatment assignment and outcomes, it will decrease power as a result of the sample size decreasing but will not affect the treatment effect on average. Attrition in field experiments is wide-spread, as discussed at length in a systematic review by Ghanem et al. (2020). It can often have substantially (negative) effects on the statistical power of experiments.
Wherever possible, researchers should consider whether they can reduce or eliminate attrition at reasonable cost. When dealing with attrition, it is important to try to understand why some sample members left by examining how the attrits’ characteristics are related to their group status or their outcomes. First, researchers should examine the overall rate of attrition in the study. Next, they should check for differential attrition: are there systematic differences in attrition rates between those in the treatment vs control groups? Do the characteristics of the attrits vary by treatment assignment, by subgroup, or by another observable characteristic? This can be done by regressing attrition on treatment assignment, a set of observables, or observables interacted with treatment assignment, using the main specification. That is, if the main specification clusters standard errors and includes strata fixed effects, weights, and covariates, so should the attrition test.
Using the terminology of Ghanem et al. (2020), a selective attrition test examines whether, conditional on attrit vs. non-attrit status, baseline observable characteristics differ between the treatment and comparison groups. Approaches to selective attrition tests vary, but in all of these tests the main specification (i.e., with appropriate fixed effects, weights, covariates, treatment of standard errors, etc.) should be used. A common approach is a simple test for differences in observable characteristics between treatment and comparison groups, conditional on attrit status. A joint test examines characteristics of the treatment and comparison groups among attrits and non-attrits.
In Stata, this could look like:
* Test for differential attrition: areg attrit treat, absorb (stratum) cluster(stratum) * Simple test for selective attrition: reg X treat if attrit==0 reg X treat if attrit==1 * Two options for a joint test for selective attrition: reg X treat attrit treat#attrit reg attrit treat X treat#X
One option for dealing with attrition is to bound the effects using non-parametric methods. Two common approaches in studies using random assignment are Horowitz-Manski bounds and Lee bounds. Both approaches use assumptions about who has left the sample but do not require that attrition be random.
Horowitz-Manski bounds try to bound the bias that comes from the fact that outcomes are correlated with attrition and assume that those with extreme outcomes attrit. The upper bound assumes that all attritors in the treatment group had the highest outcome, and all attritors in the comparison group had the lowest outcome. In practice, this means replacing the missing outcome values in the treatment group with the highest observed outcome and replacing the missing values in the comparison group with the lowest observed outcome. The lower bound assumes the opposite, i.e., that all attritors in the treatment group had the lowest outcome and all attritors in the comparison group had the highest outcome. This approach has the benefit of relying only on the assumption of a bounded support, but the bounds it provides can be large and thus not necessarily useful.
/* To create upper bounds: replace missing outcome values of treatment group with the highest observed outcome and missing outcome values of control group with lowest observed outcome values */ gen hm_upperbound = outcome quietly sum hm_upperbound replace hm_upperbound = r(max) /// if hm_upperbound == . & treatment == 1 replace hm_upperbound = r(min) /// if hm_upperbound == . & treatment == 0 /* To create lower bounds: replace missing outcome values of treatment group with lowest observed and replace missing control outcome values with highest observed outcome */ gen hm_lowerbound = outcome quietly sum hm_lowerbound replace hm_lowerbound = r(min) /// if hm_lowerbound == . & treatment == 1 replace hm_lowerbound = r(max) /// if hm_lowerbound == . & treatment == 0
Lee bounds require the stronger assumption of monotonicity, i.e., that treatment assignment can only affect attrition in one direction. Calculating Lee bounds involves trimming observations from the group that experienced less attrition. These bounds can additionally be tightened using a covariate that predicts attrition. Specifically, to calculate Lee bounds:
- Calculate the trimming fraction: p=(fraction remaining in less-attrited group - fraction remaining in more-attrited group)/fraction remaining in less-attrited group
- Drop the lowest p% of outcomes from the less-attrited group
- Re-calculate the mean outcomes for the trimmed, less-attrited group and compare to the mean outcomes in the group with fewer attrits. This is one bound.
- Repeat steps 2 and 3, but this time dropping the highest p% of outcomes from the less-attrited group to obtain the other bound.
In Stata, use the leebounds command. See the help file after installation for options to tighten bounds.
leebounds outcome treatment * where treatment denotes receipt of treatment
Useful resources on bounds include:
- MIT 14.771 recitation notes and Harald Tauchmann’s slides and paper on the leebounds command
- Pages 25-30 of the online appendix to McKenzie (2017) provide an example of implementing several different bounds to examine sensitivity of results to attrition
- The original papers are Horowitz & Manski (2000) and Lee (2009)
Inverse probability weighting
Inverse probability weighting (IPW) relies on the assumption that, conditional on a set of observable factors X, attrition is independent of the outcome. This is a stronger assumption than those required for either Horowitz-Manski bounds or Lee bounds; as such, its use has declined in recent years. IPW scales up the estimate of the treatment effect of the non-missing individuals who have a covariate profile X. For example, if, conditional on gender, attrition is random, and 1/4 of women attrited, the outcomes of women in the treatment group would be scaled up by 1/(3/4)=4/3.
teffects ipw (y) (treatment X) * where treatment denotes receipt of treatment
There are two broad sets of considerations to keep in mind when constructing standard errors to estimate treatment effects. The first relates to sampling method, namely, whether you conducted clustered random sampling and want to generalize your results to a population. The second is if you used cluster random assignment to assign treatment. In the latter case, difficulties in estimation arise if the number of clusters is small (less than 25-30). In that case, researchers may want to bootstrap.
An alternative approach that is becoming increasingly widespread is to use randomization inference, described below, to test the sharp null hypothesis that the treatment effect is zero for every participant.
Robust standard errors and clustering
Regardless of your sampling or assignment procedure, it is generally recommended to always use heteroskedasticity-robust standard errors in your regression specification (White (1980), Angrist & Pischke (2009), and Wooldridge (2013).
- When you have assigned treatment at another unit than the one at which you are measuring outcomes. Here, Abadie et al. (2017) recommend clustering at the level at which treatment was assigned. For example, if assigning treatment at the village level but measuring individual-level outcomes, standard errors would need to be clustered at the village level. Note that even without clustered random assignment, if you have repeated observations of the same unit (for example, if you have panel data), you will want to cluster standard errors by unit to account for correlations within the unit over time.
- When you have sampled units from a population using clustered sampling: Key considerations here are how the sample was selected, whether there are clusters in the population of interest that are not in the sample, and whether you want to say something about the population from which the sample was drawn. You may also want to use sample weights in order to generalize results to the population; see more below.
With too few (<25-30) clusters, cluster robust standard errors will underestimate the intra-class correlation. As a result, they will be biased down, leading to over-rejection of the null. Alternatives include bootstrapping standard errors or using randomization inference, discussed below.
In addition to Abadie et al. (2017) and McKenzie (2017), useful resources on clustering include Blattman (2015); Cameron & Miller (2015) (also has a useful discussion of clustering standard errors vs fixed effects); and MIT 14.771 recitation notes.
Non-parametric or resampling bootstrapping involves treating the original sample as a population from which to draw more random samples. In practice, units from the original sample are randomly re-sampled, typically with replacement, from the original sample. Each of these draws results in a new dataset in which some observations appear multiple times and others do not appear (the standard is to draw these pseudo-samples with the same number of observations N as the original sample). From the distribution of the estimates in all these pseudo-samples, we can compute a valid standard error for the original estimate, as long as the original sample is reasonably representative of the original population in terms of coverage (and thus the distribution of the parameter we wish to estimate can be reasonably assumed to be a nonparametric estimate of the distribution in the population) (Cameron & Miller 2015). With clustering, instead of drawing the observation units with replacement, one draws the cluster units with replacement.
Note that, as with randomization, it is important to set the seed when writing your code because it allows others (or your future self) to replicate your results. The default number of bootstrap replications is sometimes set at 50 to minimize computation time, but this is typically too low for results in a paper. Cameron & Trivedi (2005) suggest 400 replications when the bootstrap is used to calculate standard errors, though for other uses (e.g., confidence intervals), more replications are typically needed. While very large numbers of replications are theoretically beneficial, since the procedure assumes an infinite number of observations and replications, in practice the bootstrap converges in terms of the number of replications quickly and a finite number of replications is sufficient.
The bootstrap assumes approximate linearity and normal distribution of the estimator (Horowitz 2001). It also assumes independent observations (though the cluster option allows for dependence within clusters, provided the clusters are independent). If the errors are independent but not identically distributed, a wild bootstrap is more appropriate.
The wild bootstrap allows for heteroskedasticity based on the residuals by creating pseudo-samples of draws of the fitted βx+e and βx-e (where the usual hats on β and e are omitted due to website functionality) (Wu 1986). Wild bootstraps are also useful in settings where the size of clusters varies or where there are few clusters and have been shown to behave well with as few as five clusters (Cameron et al. 2008 and Berk Ozler’s blog post). However, they may also be used in settings with few treated clusters or weak instruments. The wild bootstrap is especially useful when conventional inference methods are unreliable because large-sample assumptions do not hold.
Useful resources on bootstraps include:
- The Stata help file, this post on bootstrap sampling and estimation on the Stata blog, and Guan (2003)
- Horowitz (2001)
- Cameron et al. (2008)
- Lecture notes: Orloff & Bloom (2014) for MIT 18.05 and part 2 of Orloff & Bloom (2014) for MIT 18.05 (both are fairly accessible); Chernozhukov & Fernandez-Val (2017) (rather technical)
- To bootstrap coefficients, use the vce option in the regress/xtregress commands:
regress y x1 x2 x3, vce(bootstrap reps(50) seed(20200121)) /* You can cluster the bootstrapped standard errors by adding the cluster option inside the vce brackets, as in vce(bootstrap, cluster(clustervar)), where clustervar denotes the clustering variable (e.g., school or village) */
- For non-estimation commands or user-written programs, use the bootstrap prefix, which also allows for clustering:
/* For non-estimation commands such as obtaining the mean, use e.g., */ bootstrap, mean=r(mean), reps(200) seed(20200121): summarize x1 /* The bootstrap prefix can also be used with your own programs. Using the example from Cameron & Trivedi (2005), the point estimates and standard errors of a user-written program called poissrobust can be bootstrapped using: */ bootstrap _b _se, reps(100) seed(20200121): poissrobust /* _b returns the point estimate, _se the bootstrapped standard error, and poissrobust is a user-written program (omitted here). A similar approach should be taken for two-step estimators (e.g., using a control function approach), where the standard errors for the two stages need to be bootstrapped together. */
Randomization inference considers the study sample to make up the full universe of eligible units. This is in contrast with the conventional approach to calculating p-values for t-tests, where the study sample is considered to be drawn from a larger population.
With randomization inference, variation in outcomes comes from treatment assignment. Since there is no sampling variation, differences in outcomes are exact (i.e., differences in the population). Randomization inference then tests the sharp null hypothesis that the treatment effect is zero for every participant i.e., the null that the treatment is irrelevant (and had no effect on the mean, variance, etc.) (Young 2018).
In practice, exact p-values are calculated by holding covariates and outcomes as fixed but randomly re-assigning treatment in the data. The exact p-value is then the fraction of outcomes out of all potential configurations of treatment assignment that yielded a treatment effect estimate at least as large as that in the actual assignment. This is contrasted with p-values under the conventional approach, where the p-value gives the probability of observing the difference in outcome means, if there were actually no difference in outcome means in the sample frames from which the groups were drawn.
As of the writing of this guide, randomization inference is increasingly used in favor of using the conventional approach, described below, to estimate standard errors and p-values. This is because t-tests are less reliable when group sizes are unequal or when the distribution of outcomes is skewed, as may be the case with RCT data (Young 2018, Gerber & Green 2012, Green n.d.)
Useful resources on randomization inference include:
- Alwyn Young’s 2018 paper is a useful overview of and motivation for randomization inference
- Athey & Imbens (2017) discuss issues with regression-based approaches to analyzing RCT data
- Don Green’s EGAP methods guide provides more on the uses, as well as sample R code
- Gerber & Green (2012) takes a randomization inference-based approach to p-values
- The World Bank DIME wiki provides an overview of randomization inference
In Stata: See McKenzie’s 2017 Development Impact blog post for a discussion of how to apply randomization inference procedures in Stata
In R: See sample code from Don Green’s EGAP methods guide
Using sample weights
If the study sample consists of a random sample drawn from a larger population, weights can be used to generalize results (either analytic results or summary statistics) to the larger population. This is particularly important when sampling was disproportionate, i.e., some groups were purposely over/undersampled so that the sample is not representative of the larger population without adjusting for groups’ probability of selection in the sample. Designing a sample to be representative of the larger population is expensive on a large scale, and smaller scale RCTs may not be designed as such. Examples of datasets that are designed for population-level inference are the World Bank’s LSMS and the US Census Bureau’s CPS. Stata has a full suite of survey commands for analyzing survey data, and this guide from UCLA has more information on how to conduct applied survey analysis in Stata.
Weights can also be used to adjust for nonresponse or attrition. If (survey) participation decisions are known and can be explained by observed variables, such differences can be overcome by reweighting. More commonly, however, (survey) participation may depend on unobserved variables. The US Department of Health and Human Services has a guide to nonresponse adjustments, and Reig (2017) covers steps to weight a sample, including constructing weights and sample R code.
In Stata: When conducting disproportionate stratified sampling, you can use pweight:
reg y x1 x2 [pweight=n] /* where n (the weight) is equal to the inverse of the probability of being included in the sample due to the sampling design. */ /* note: when using pweight, standard errors are robust */
Researchers frequently need to make decisions when conducting analyses, ranging from the construction of variables to the choice of the incorporation of covariates as controls to the use of sampling weights. These decisions can have considerable implications for the magnitude and significance (and sometimes even sign) of results. Some decisions may be foreseeable prior to data collection and can be specified in advance, though many will depend on unforeseeable circumstances of the research implementation in the field.
Sensitivity analysis is used to show how results change under different models or assumptions. This can allow you to assess the robustness of results and dispel concerns about specification searching. Common types of sensitivity analysis include analyzing the impact of outliers, non-compliance, attrition, non-response, changes in outcome definitions, different methods for accounting for clustering, and inclusion of covariates as controls.
There are emerging norms around demonstrating robustness of results to different assumptions. This includes showing distributions of outcomes and other key variables as sanity checks. Authors also increasingly show several specifications alongside each other in the same table—this may include, for example, the inclusion vs. exclusion of fixed effects or covariates, or different bounds or assumptions regarding attrits—in order to show the extent to which treatment effect estimates change under different assumptions.
- estout package
Overview of Stata to LaTeX output
graph export for graphics
- write.table() and fwrite to export tables
- outreg package to make regression tables
- memisc package for survey data regression and summary statistic tables
- stargazer package for statistical output in text, HTML, or LaTeX: using stargazer
- Texreg and xtable packages for statistical output in HTML or LaTeX
- Saving R graphics
Last updated July 2020.
These resources are a collaborative effort. If you notice a bug or have a suggestion for additional content, please fill out this form.
We thank Shawn Cole for helpful review and Jack Cavanagh for sample code and reviewing. All errors our own.
Abadie, Alberto, Susan Athey, Guido W. Imbens, and Jeffrey Wooldridge. 2017. "When Should You Adjust Standard Errors for Clustering?" NBER Working Paper No. 24003.
Angrist, Joshua David, and Jörn-Steffen Pischke. 2009. Mostly harmless econometrics: an empiricist's companion.
Angrist, Joshua David. 2014. "Instrumental Variables (Take 2): Causal Effects in a Heterogeneous World." Delivered as part of MIT 14.387.
Athey, Susan and Guido Imbens. 2017. “The Econometrics of Randomized Experiments a.” Handbook of Field Experiments Handbook of Economic Field Experiments, 73–140. doi:10.1016/bs.hefe.2016.10.003
Baird, Sarah, Aislinn Bohren, Craig McIntosh, and Berk Ozler. 2014. "Designing experiments to measure spillover effects," Policy Research Working Paper Series 6824, The World Bank.
Benjamini, Yoav, and Yosef Hochberg. 1995 "Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing." Journal of the Royal Statistical Society.
Bettinger, Erice, Bridget Long, Philip Oreopoulos, and Lisa Sanbonmatsu. 2012. "The Role of Application Assistance and Information in College Decisions: Results from the H&R Block Fafsa Experiment." Quarterly Journal of Economics, 127(3), 1205-1242.
Bergman, Peter, Raj Chetty, Stefanie DeLuca, Nathaniel Hendren, Lawrence Katz, and Christopher Palmer. "Creating Moves to Opportunity: Experimental Evidence on Barriers to Neighborhood Choice." Working Paper.
Blattman, Chris. 2015. "Clusterjerk, the much anticipated sequel." Blog post. Last accessed June 20, 2020
Bruhn, Miriam, and David McKenzie. 2009. "In Pursuit of Balance: Randomization in Practice in Development Field Experiments." American Economic Journal: Applied Economics, 1 (4): 200-232.
Cameron, A. Colin and Douglas L. Miller. 2015. "A Practitioner’s Guide to Cluster-Robust Inference" J. Human Resources
Cameron, A. Colin, and Pravin K. Trivedi. Microeconometrics: Methods and Applications. Cambridge: Cambridge University Press
Cameron, A. Colin, Jonah B. Gelbach, and Douglas L. Miller. 2008. "Bootstrap-Based Improvements for Inference with Clustered Errors." The Review of Economics and Statistics.
Chernozhukov, Victor and Ivan Fernandez-Val. "L5. Bootstrapping" 14.382 Econometrics. Spring 2017. Massachusetts Institute of Technology: MIT OpenCourseWare.
Coppock, Alexander. n.d. "10 Things to Know About Multiple Comparisons." EGAP methods guides.
Deaton, Angus and Nancy Cartwright. 2018. "Understanding and Misunderstanding Randomized Controlled Trials." Social Science & Medicine 210: 2-21. https://doi.org/10.1016/j.socscimed.2017.12.005.
Dolan, Lindsay. n.d. "10 Things to Know About Covariate Adjustment." EGAP methods guides.
Duflo, Esther, Rachel Glennerster, and Michael Kremer. 2008. “Using Randomization in Development Economics Research: A Toolkit.” T. Schultz and John Strauss, eds., Handbook of Development Economics. Vol. 4. Amsterdam and New York: North Holland.
Duflo, Esther, and Emmanuel Saez. 2003. "The Role of Information and Social Interactions in Retirement Plan Decisions: Evidence from a Randomized Experiment." The Quarterly Journal of Economics.
Fang, Albert. n.d. "10 Things to Know About Heterogeneous Treatment Effects." EGAP methods guides.
Freedman, David. 2008. "On regression adjustments to experimental data." Advances in Applied Mathematics.
Friedman, Jed. 2013. "Tools of the trade: when to use those sample weights." World Bank Development Impact Blog. Last accessed June 20, 2020.
Froelich, Markus and Blaise Melly. 2010. "Estimation of quantile treatment effects with Stata." The Stata Journal.
Gerber, Alan S., and Donald P. Green. 2012. Field experiments: design, analysis, and interpretation. New York: W.W. Norton.
Ghanem, Dalia, Sarojini Hirshleifer, and Karen Ortiz-Becerra. "Testing Attrition Bias in Field Experiments." CEGA Working paper.
Green, Donald P. and Peter Michael Aronow. 2011. "Analyzing Experimental Data Using Regression: When is Bias a Practical Concern?"
Green, Donald. n.d. "10 Things to Know About Randomization Inference." EGAP methods guides.
Horowitz, Joel and Charles Manski. 2000. "Nonparametric Analysis of Randomized Experiments with Missing Covariate and Outcome Data." Journal of the American Statistical Association.
Imbens, Guido W., and Donald B. Rubin. 2015. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge: Cambridge University Press.
Imbens, Guido W. and Jeffrey Wooldridge. 2007. "Instrumental Variables with Treatment Effect Heterogeneity: Local Average Treatment Effects," delivered as a lecture in the NBER's "What's New in Econometrics?" series.
Jann, Ben. 2015. "Heterogeneous Treatment Effect Analysis in Stata," delivered as a lecture in the "Heterogeneous Treatment Effects Project Workshop", University of Michigan.
Lee, David. 2009. "Training, Wages, and Sample Selection: Estimating Sharp Bounds on Treatment Effects." The Review of Economic Studies.
Loftus, Joshua. 2015. "Primer on multiple testing." Lecture.
McKenzie, David. 2012. "Tools of the Trade: A quick adjustment for multiple hypothesis testing." World Bank Development Impact Blog. Last accessed June 20, 2020.
McKenzie, David. 2017. "Identifying and Spurring High-Growth Entrepreneurship: Experimental Evidence from a Business Plan Competition." American Economic Review
McKenzie, David. 2017 (b). "When should you cluster standard errors? New wisdom from the econometrics oracle." World Bank Development Impact Blog. Last accessed June 20, 2020.
McKenzie, David. 2017 (c). "Finally, a way to do easy randomization inference in Stata!" World Bank Development Impact Blog. Last accessed June 20, 2020.
McKenzie, David. 2020. "An overview of multiple hypothesis testing commands in Stata." World Bank Development Impact Blog. Last accessed June 20, 2020.
Miguel, Edward, and Michael Kremer. 2004. "Worms: Identifying Impacts on Education and Health in the Presence of Treatment Externalities." Econometrica
Özler, Berk. 2017. "Dealing with attrition in field experiments." World Bank Development Impact Blog. Last accessed June 20, 2020.
Özler, Berk. "Beware of studies with a small number of clusters." World Bank Development Impact Blog. Last accessed June 20, 2020.
Reig, Josep. 2017. "Step 1: Design Weights," in (Very) basic steps to weight a survey sample.
Schaner, Simone. 2008. "Regression Discontinuity, Attrition/Bounds, and Education." Recitation handout in MIT 14.771: Development Economics: Microeconomic Issues and Policy Models
Solon, Gary, Steven J. Haider, and Jeffrey Wooldridge. 2013. "What Are We Weighting For?" NBER Working Paper No. 18859.
Tauchmann, Harold. 2014. "Lee (2009) treatment-effect bounds for nonrandom sample selection." The Stata Journal.
UCLA Institute for Digital Research & Education. n.d. "Applied Survey Data Analysis in Stata 13."
U.S. Department of Health and Human Services. 2002. "Studies of Welfare Populations: Data Collection and Research Issues. Common Nonresponse Adjustment Measures in Surveys."
Van der Windt, Peter. n.d. "10 Things to Know About the Local Average Treatment Effect." EGAP methods guides.
White, Halbert. 1980. "A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity." Econometrica 48, no. 4: 817-38.
Wooldridge, Jeffrey M. 2013. Introductory Econometrics: A Modern Approach. United Kingdom, Cengage Learning.
Young, Alwyn. 2019. "Channeling Fisher: Randomization Tests and the Statistical Insignificance of Seemingly Significant Experimental Results." The Quarterly Journal of Economics, Volume 134, Issue 2