## Research Resources

Introduction to Randomized Evaluations
Project Planning
Data Collection and Access
Processing and Analysis

# Data analysis

Authors
Contributors
##### Summary

This guide provides an overview of data analysis for randomized evaluations in order to estimate causal impact. It is intended to provide something of a starting point and orient individuals not familiar with all nuances of the literature; it does not aim to provide a comprehensive or “authoritative” treatment of these topics. We instead link to useful resources for further reading and provide sample Stata and R code for each topic.

## Overview of resources

### Theory and intuition

• Duflo et al.’s (2007) Using Randomization in Development Economics Research: A Toolkit is an accessible guide to using randomization in development economics. Includes some technical discussion and equations but can be understood by most readers.

• Athey & Imbens (2017) The Econometrics of Randomized Experiments provides a more recent and more technical treatment of topics such as stratification and randomization inference. Best for readers with some graduate-level econometrics.

• Mostly Harmless Econometrics (MHE) is not RCT-specific but is an accessible applied econometrics text that covers the math and intuition behind decisions made in econometric analysis. The text also includes sample code for certain examples. The MIT course 14.387 “Applied econometrics: Mostly harmless big data” roughly follows MHE, with slides from the fall 2014 course run posted on MIT OCW.

• EGAP methods guides include topics ranging from reading regression tables to causal inference, heterogeneous treatment effects, treatment effects, local average treatment effects, and covariate adjustment. The guides do not provide a deep, comprehensive discussion of each topic but are good overviews and include sample R code.

• The World Bank has a series of methods blog posts on a wide range of topics. These posts typically summarize and then link to important papers on these topics.

## Treatment effects

We are interested in estimating the true effect of the treatment on the population from which the sample was drawn. As we only observe the study sample—not the full underlying population—we form estimates of the true treatment effect. Due to sampling variation, any given estimate of the treatment effect is unlikely to be exactly equal to the true effect, but if we were to repeat the study many times on different samples drawn from the same population, the average estimate would be equal to the treatment effect—this forms the basis for conducting power calculations.

Moreover, in any given sample, changes in outcomes resulting from the treatment are likely to vary between individuals or groups, i.e., there are likely to be heterogeneous treatment effects.1 For example, a program offering free prenatal care to eligible women may have a larger effect on birth outcomes for women at the bottom of the income distribution than those at the top, who may have access to other resources.

Formally, we can write the outcome of each unit in either the treatment or control group, using notation from Deaton & Cartwright (2018) as:

$$Y_i = \beta_iT_i +\sum_{j=1}^J \gamma_ix_{ij}$$

where

• βi is the treatment effect (which may be unit-specific)

• Ti is a treatment dummy

• xij are j=1,...,J observable and unobservable, unit-specific factors that may affect outcome Y

• γj indicates the effect of xj on Y and may be positive or negative

There are then two main approaches to estimating treatment effects. In the simplest regression framework, described below, treatment effects are assumed to be homogeneous, so that βi. We follow this “conventional” approach for the majority of this guide. A second approach, randomization inference, is gaining popularity in the analysis of experimental data in economics. With randomization inference, we can directly test the exact null that βi=0  for all i. As this approach is somewhat distinct, conceptually, from the “conventional” approach, we discuss randomization inference towards the end of this guide.

### Average treatment effects (ATE)

Researchers are typically interested in the average treatment effect (ATE), which is the average causal effect of a variable (here, an intervention or program) on an outcome variable for the entire study population. In the classical experimental design, (i.e., the most simple experimental design with one treatment group; one control group; and perfect compliance, described further below), the ATE can be estimated as the difference in means of the outcome between the treatment and comparison groups.

Formally, with a given experimental sample of size N and given treatment group assignment, we can take the average of the treatment group (T=1) and the comparison group (T=0) to estimate the ATE:

$$\bar{Y_1} - \bar{Y_0} = \bar{\beta_1}+\sum_{j=1}^J (\bar{x}_{1j}-\bar{x}_{0j})$$

β1 is the average treatment effect; the subscript indicates that this estimate calculates the average of the treatment effects in the treatment group. The second term is the “error term” of the ATE estimate: the average difference between treatment and control group that is unrelated to treatment (from observable and unobservable differences).

Perfect compliance is when all individuals in the treatment group take up treatment and no control individuals receive treatment. Under perfect compliance, treatment assignment (which is determined randomly by the research design) and treatment status (or the participation variable, which is determined by the individual or group) are identical. Then, a simple difference in means provides an unbiased estimate of the ATE in randomized studies, provided there is also no attrition or unaccounted-for spillovers. In this case, a linear model regressing the observed outcome on a treatment indicator can be used to estimate the effect of the treatment. Linear regression models can accommodate more flexible estimation techniques that rely on ordinary least squares (OLS). Examples include clustering of standard errors to account for within-group correlations (which is particularly important in a clustered design), or the inclusion of covariates to increase precision.

In Stata, the ATE can be estimated with the following code:

**** Stata Code ****

reg y treatment, robust

And in R:

#### R Code ####

# Produces regression estimates: insert dataset  name in "data" field
ATEestimate <- lm(y ~ treatment, data = “”)

# Produces robust standard errors
ATERobust <- coeftest(ATEest, vcov = vcovHC(ATEest, "HC1"))

where y is the outcome of interest and treatment a binary variable indicating whether treatment was assigned. robust gives heteroskedasticity-robust standard errors, described further below and in White (1980), Angrist & Pischke (2009), and Wooldridge (2013) (see also chapter 8 accompanying Stata code for Wooldridge).

### Intention to treat effects (ITT)

Imperfect compliance is when individuals do not follow their treatment assignment. While compliers are people who are induced to take up the treatment only because they were assigned to receive it (and do not take it if they are not assigned treatment), non-compliers are comprised of three groups:

• Always-takers: always take the treatment even if assigned to the control group

• Never-takers: always refuse treatment, even if assigned to the treatment group

• Defiers: do the opposite of their treatment assignment

Non-compliance can be one- or two-sided. One-sided noncompliance is when individuals assigned to the treatment group refuse treatment OR individuals assigned to the control group take up treatment. Two-sided noncompliance is when both of these occur.

In many cases, researchers and policymakers care about identifying the impact of the offer of the program on the population that was offered it, even if some of them did not take it up, as this will resemble what will be likely to happen if the program is rolled out. The intention to treat (ITT) is an estimate of the effect of the program on those assigned to treatment, regardless of their take-up. That is, the ITT is obtained from regressing the outcome on treatment assignment for the whole sample.

It will often (though not always) provide a lower bound on the ATE, as it includes in the treatment group some individuals who did not receive the treatment (under the assumption that they would have benefited less from the treatment than those who took it up) and may include in the comparison group some individuals who did in fact receive the treatment.

In Stata (top) and R (bottom), the ITT is obtained with the following code:

**** Stata Code ****

reg y assign_treatment

#### R Code ####

# Produces regression estimates
ITTestimate <- lm(y ~ assign_treatment, data = “”)

# Produces robust standard errors
ITTRobust <- coeftest(ITTest, vcov = vcovHC(ITTest, "HC1"))
where assign_treatment=1 if the individual is assigned the treatment and 0 otherwise

### Local average treatment effects (LATE)

The LATE provides an estimate of the treatment effect for compliers, i.e., those who are induced by their assignment to comply. Formally, it is given by:

$$LATE = \frac{E(Y_i \mid z_i = 1) - E(Y_i \mid z_i = 0)}{E(d_i \mid z_i = 1) - E(d_i \mid z_i = 0)}$$

As above, Y denotes the outcome for individual (or group, depending on the unit of analysis) i. z denotes treatment assignment and is 1 if the treatment was assigned and 0 otherwise, and d denotes whether the treatment was received (and is 1 if it was, 0 otherwise). That is, (random) treatment assignment is used as an instrument for treatment status.

The LATE is limited in that it is only well-defined for compliers and is specific to the instrument used; it cannot uncover the effects for always-takers, never-takers, or defiers. In addition to the standard independence assumption that follows from randomization (i.e., the instrument, in this case treatment assignment, is as good as randomly assigned), and the assumption that there is a positive share of compliers, it relies on two key assumptions:

1. Monotonicity: Assignment to treatment does not make one less likely to be treated

2. The exclusion restriction: Individuals respond to the treatment itself, not treatment assignment, so that the outcome is the same for those who would not have taken up the treatment, regardless of treatment assignment

Just as the ITT typically provides a lower bound on the ATE under imperfect compliance, the LATE typically provides an upper bound (though, again, this is not always the case). This is because it estimates the effect of the treatment on those who took it up--who are often more likely to benefit from the treatment than those who did not take it up. The higher the compliance, the closer the LATE will generally be to the true ATE.

Useful resources on the LATE include Imbens & Wooldridge (2007), Angrist (2014) lecture notes and Peter Hull’s corresponding recitation notes, and EGAP’s methods guide on the LATE.

In Stata (top) and R (bottom) the LATE is estimated with:

**** Stata Code ****

ivreg2 y (treated=assign_treatment), robust first

#### R Code ####

# Produces regression estimates
LATEestimate <- ivreg(y~treated | assign_treatment, data = "")

# Produces robust standard errors
LATERobust <- coeftest(LATEest, vcov = vcovHC(LATEest, "HC1"))
where treated=1 if the treatment was received and 0 otherwise and assign_treatment is defined as above.

### Treatment on the treated (ToT)

The treatment on the treated (ToT) is the treatment effect on those who actually take up the treatment. The counterfactual of the ToT is control group members who would have accepted the treatment if they had been offered it, which cannot be observed. The ToT can be estimated when no one in the control group is treated, so non-compliance is one-sided. This can be a result of research design--if the control group is prevented from receiving or taking up the treatment--or simply from the realization that no one from the control group took up the treatment, even though it was possible for them to do so.

The ToT relies on the same assumptions as the LATE and is estimated in the same way: using an instrumental variables (IV) approach; the only difference is that for the ToT none of the comparison group members received the treatment. Intuitively, it is a weighted average of the effects for always-takers and compliers who were assigned the treatment (Angrist 2014, slide 12).

### Quantile treatment effects

RCTs provide information on the differences in means between treatment and comparison groups. However, we might also care about how the treatment affected the distribution of outcomes across treatment and control groups. It is possible to use quantile regression to estimate the treatment’s effect on a specified quantile of the outcome variable (e.g., median, 10th percentile, 90th, etc.). That is, we can estimate the difference in outcomes at a given quantile. Quantile regression is useful for understanding how the treatment affected various points in the distribution of outcomes. It cannot, however, generally be used to estimate the distribution of treatment effects. Quantile coefficients can also only tell us about effects on distributions and not on individuals.

Quantile regression relies on the assumption that the outcome is a continuously-distributed random variable with well-behaved density (no gaps or spikes). Rank preservation is required if wants to determine whether individuals are better or worse off from the intervention, versus just finding an effect for the bottom decile, for example, without knowing whether people who were originally in the bottom decile are actually better or worse off than they would have been without the intervention (Angrist and Pischke, 2007).

Useful resources on quantile regression and treatment effects include:

• Chapter 7 of Angrist & Pischke (2007) covers quantile regression in some detail, including how and when they can have a causal interpretation

• Section 4.3 of Athey & Imbens (2017)

• EGAP’s methods guide includes a brief discussion, sample R code, and an example of a case where the ATE is 0 but the treatment effect negative for low quantiles of the response and positive for high quantiles.

In Stata: Quantile regression is done using the qreg command, with an example below. See more in Froelich and Melly (2010), the Stata help file and a helpful guide from UCLA on interpretation.

In R: Quantile regression is done using the quantreg package. See the CRAN documentation for more information.

**** Stata Code ****

qreg outcome assign_treatment, quantile(0.9) vce(robust) // for 90th percentile 

#### R Code ####

# For 90th percentile
QTEest <- quantreg::rq(outcome ~ assign_treatment,tau = .9, data =””)

# For producing different types of standard errors;
## The options for "se=" include (but aren't limited to) "rank," "iid," and "boot"
QTERob <- quantreg::summary.rq(QTE, se = “boot”)


An important potential advance of RCTs is that it is not necessary for identification to include controls in the specification (since treatment assignment is by definition orthogonal to other covariates). Including covariates has both benefits and drawbacks. Covariates can increase precision and can also adjust for random imbalances between the treatment groups. However, unless the covariates are all indicators, partition the population, and are included as the full set of interactions, the ATE will typically be biased (though this decreases as the sample size increases) (Athey & Imbens 2017).3 If poorly chosen (not predictive of the outcome), they can also decrease precision.

Covariates can either be included additively or as a full set of interactions with the treatment variable. The former approach can increase precision, while including the full set of interactions with the treatment variable can allow testing for heterogeneous treatment effects (see more below). Moreover, there is some debate as to how to include imbalanced covariates in analysis: some researchers include them in the regression as is, while others recommend de-meaning the imbalanced covariates first (Imbens & Rubin 2015). You should select covariates that could not have been affected by the treatment—baseline controls are often used—but likely to affect your outcome variable. If you do include covariates, it is generally advisable to show results with and without covariate adjustment.

For more on including covariates, see the papers listed above as well as EGAP’s methods guide on covariate adjustment (RCT-specific) and Athey & Imbens (2017).

## Heterogeneity analysis and multiple hypothesis testing

### Heterogeneity analysis

As written above, it is likely that a treatment will affect individuals or groups of individuals in different ways. If you believe some type of group will respond differently to the treatment based on some observable characteristic or set of characteristics, you may want to test for heterogeneous treatment effects. Ideally, this potential heterogeneity is considered in advance so that the study can be designed with sufficient power to detect them if they exist. This involves including the relevant subgroups as strata (defined below and in the resource on randomization) in the research design. Doing so allows for stratified randomization, described further below and in the resource on randomization. This approach combines the benefits of covariate inclusion (increased precision) without the associated drawbacks (biased ATE).

If instead after randomization and implementation you discover a particular group(s) is responding differently to the treatment, you may want to test for heterogeneous treatment effects using groups that are determined ex post (i.e., after randomization). Duflo et al. 2007 (page 64) provide an overview of how to discuss results from subgroups analysis when the groups are determined ex ante versus ex post. EGAP has a more detailed and technical discussion, including sample R code to test for heterogeneous treatment effects, and Ben Jann has useful slides from a 2015 University of Michigan workshop.

### Stratification

Stratification may be employed to improve balance at baseline across treatment and control groups for the stratifying variables. It is useful when you are interested in testing for heterogeneous treatment effects for some variable (such as gender) and want to ensure you are sufficiently powered to do so. The decision to stratify is made at the design stage and should be incorporated into power calculations and described in the trial registry entry and, if applicable, pre-analysis plan. As mentioned above, including strata can also improve precision (if they are correlated with the outcome) without introducing bias, since treatment status is by construction random conditional on the strata. Here, we assume strata have already been created and that randomization was stratified--see the resource on randomization for more.

If the probability of treatment varies by stratum, then treatment assignment is conditional on the strata, and analysis of the stratified randomization should control for the stratification variables by including them as controls in the regression specification (Duflo et al. 2007); if the probability of treatment does not vary by stratum, strata indicators can be included but are not necessary for unbiasedness. If stratifying on multiple variables and including strata indicators, you should include all stratifying variables in your regression and construct the variables like the categories you randomized within. For example, if you stratified by gender and urban/rural location and randomized within four distinct strata (female/urban, female/rural, male/urban, male/rural), you would have a stratum variable taking on values 1-4, each corresponding to a distinct stratum, and include it as i.stratum in a Stata regression). To test for heterogeneous treatment effects, interact the strata indicators with the treatment indicator. Standard errors are typically not clustered but should be adjusted to account for multiple hypothesis testing (more below).

### Multiple treatment arms and multiple comparisons

For randomized evaluations with multiple treatments, researchers may be interested in all pairwise comparisons across the research design groups, pairwise comparisons with the control group, or pairwise comparisons with the best of the other treatments. F-tests can then be used in a regression model with separate dummies for each treatment arm to test whether any of the treatments is on average better (or worse) than the others or to conduct pairwise comparisons among the different groups. In Stata, this can be done by coding test _b[t1]=_b[t2] if testing that the effect of treatment 1 is equal to the effect of treatment 2.

##### Multiple hypothesis testing

Many treatments may affect a number of outcomes (a tutoring intervention, for example, could affect graduation rates, test scores, motivation, etc.). It is important to note that as the number of hypotheses tested increases, so does the probability of a false rejection of the null (aka a type I error)--see EGAP’s related methods guide or McKenzie (2020) for a longer description of this problem. One option in such instances is to aggregate several related outcomes into a single index; for example, Bergman et al. (2019) uses the Kirwan Child Opportunity Index, which aggregates education, health, and economic indicators, in their evaluation of the Moving to Opportunity program. However, this approach is limited in that it does not allow researchers to uncover the effect on individual indicators, including if the  indicators have opposite effects or effects of different magnitudes.

Another option is to follow one of several methods of controlling the family-wise error rate or false discovery rate by adjusting either standard errors or the rejection rule. These methods include:

• Bonferroni correction (in Stata: test …, mtest(bonferroni), which involves multiplying each p-value by the number of tests performed (or, equivalently, dividing the significance level α by the number of tests performed). This can be overly conservative and can yield calculated p-values greater than one.

• Benjamini Hochberg method (practical application), which involves ordering p-values and adjusting the significance level α by (rank of p-value)/(total number of tests). For example, with 5 tests, the α for the 2nd test would be adjusted by ⅖ α. If the p-value for that test were less than ⅖ α, that hypothesis and those following it in the order (i.e., those with smaller p-values) would be rejected.

Stata code: David McKenzie provides an overview of multiple hypothesis testing commands in Stata in a 2020 blog post

R code: See EGAP’s methods guide

## Threats

Statistical validity, which comprises both internal and external validity, refers to the degree to which conclusions about treatment effects can be considered to be reasonable. External validity is the degree to which study results can be generalized to other contexts, such as different populations, places, or time periods. Internal validity refers to the degree to which conclusions about causal relationships can be made (e.g., whether the estimated size of a treatment effect is correct), given a study’s research design and implementation.

As randomized controlled trials in the social sciences take place outside of tightly controlled laboratory settings, researchers often encounter potential threats to research design that may complicate or compromise the intervention. Common threats to the internal validity of randomized evaluations include spillovers and attrition.

### Spillovers

Spillovers are violations of the “stable unit treatment value assumption” (SUTVA, or that the potential outcome of any individual does not depend on the treatment assignment of others). Spillovers can occur when those who do not receive the treatment are still affected by it, and can thus bias treatment effect estimates. Within the context of an RCT, potential spillovers are ideally considered in the research design by choosing the appropriate level at which to randomize--for example, an intervention that may have spillovers within schools could randomize at the school level instead of the student level. It is also possible to design the study to measure spillovers directly, such as by randomizing exposure to the treatment. This is discussed at greater length by Duflo et al. (2007) and Baird et al. (2014)

As with other design-related considerations, analysis of spillover effects will depend on the study design. If the study is designed to measure spillovers, the estimating equation should include a variable for exposure (e.g., whether the individual lives in the same neighborhood as other individuals who received the treatment). The direct effect of the intervention is captured by the treatment indicator, as above, while the indirect effect of exposure to the treatment is captured by the additional variable--an example of this is included in the accompanying sample code. Miguel & Kremer (2004) and Duflo & Saez (2003) are good examples of studies that measure treatment effects in the presence of spillovers.

### Attrition

Attrition occurs when study group members drop out of the study or data on them cannot be recovered. If characteristics of attrits (drop-outs) are correlated with their treatment assignment or effect size, the correlation may indicate systematic differences between the remaining program and control group members, which could lead to biased estimates of program effects. If attrition is uncorrelated with treatment assignment and outcomes, it will decrease power as a result of the sample size decreasing but will not affect the treatment effect on average. Attrition in field experiments is wide-spread, as discussed at length in a systematic review by Ghanem et al. (2020). It can often have substantially (negative) effects on the statistical power of experiments.

Wherever possible, researchers should consider whether they can reduce or eliminate attrition at reasonable cost. When dealing with attrition, it is important to try to understand why some sample members left by examining how the attrits’ characteristics are related to their group status or their outcomes. First, researchers should examine the overall rate of attrition in the study. Next, they should check for differential attrition: are there systematic differences in attrition rates between those in the treatment vs control groups? Do the characteristics of the attrits vary by treatment assignment, by subgroup, or by another observable characteristic? This can be done by regressing attrition on treatment assignment, a set of observables, or observables interacted with treatment assignment, using the main specification. That is, if the main specification clusters standard errors and includes strata fixed effects, weights, and covariates, so should the attrition test.

Using the terminology of Ghanem et al. (2020), a selective attrition test examines whether, conditional on attrit vs. non-attrit status, baseline observable characteristics differ between the treatment and comparison groups. Approaches to selective attrition tests vary, but in all of these tests the main specification (i.e., with appropriate fixed effects, weights, covariates, treatment of standard errors, etc.) should be used. A common approach is a simple test for differences in observable characteristics between treatment and comparison groups, conditional on attrit status. A joint test examines characteristics of the treatment and comparison groups among attrits and non-attrits.4

Useful resources on attrition include: Ghanem et al. (2020), Berk Ozler’s Dealing with attrition in field experiments blog post, page 58 of Duflo et al. (2007), and chapter 7 of Gerber & Green

In Stata (top) and R (bottom), attrition tests could look like:

**** Stata Code ****

* Test for differential attrition:
areg attrit treat, absorb (stratum) cluster(stratum)

* Simple test for selective attrition:
reg X treat if attrit==0
reg X treat if attrit==1

* Two options for a joint test for selective attrition:
reg X treat attrit treat#attrit
reg attrit treat X treat#X



#### R Code ####

# Testing for differential attrition using the "lfe" package
## regresses attrition on treatment, absorbs "stratum,"
## is not an IV specification ("0"), and uses "stratum"
## as the cluster for cluster robust standard errors
DiffattEst<-lfe:: felm(attrit~treat | stratum | 0 | stratum, data = "")

# Simple test for selective attrition:
SimpAttrEst1 <- lm(x~treat, data = “”, subset = (attrit ==0))
SimpAttrRobust1 <- coeftest(SimpAttrEst1, vcov = vcovHC(SimpAttrEst1, "HC1"))
SimpAttrEst2 <- lm(x~treat, data = “”, subset = (attrit ==1))
SimpAttrRobust2 <- coeftest(SimpAttrEst2, vcov = vcovHC(SimpAttrEst2, "HC1"))

# Two options for a joint test for attrition:
JointAttrEst1 <- lm(x~treat + attrit + treat:attrit, data = “”)
JointAttrRobust1 <- coeftest(JointAttrEst1, vcov = vcovHC(JointAttrEst1, "HC1"))

JointAttrEst2 <- lm(attrit ~ treat + x + attrit:x, data = “”)
JointAttrRobust2 <- coeftest(JointAttrEst2, vcov = vcovHC(JointAttrEst2, "HC1"))

##### Bounds

One option for dealing with attrition is to bound the effects using non-parametric methods. Two common approaches in studies using random assignment are Horowitz-Manski bounds and Lee bounds. Both approaches use assumptions about who has left the sample but do not require that attrition be random.

Horowitz-Manski bounds try to bound the bias that comes from the fact that outcomes are correlated with attrition and assume that those with extreme outcomes attrit. The upper bound assumes that all attritors in the treatment group had the highest outcome, and all attritors in the comparison group had the lowest outcome. In practice, this means replacing the missing outcome values in the treatment group with the highest observed outcome and replacing the missing values in the comparison group with the lowest observed outcome. The lower bound assumes the opposite, i.e., that all attritors in the treatment group had the lowest outcome and all attritors in the comparison group had the highest outcome. This approach has the benefit of relying only on the assumption of a bounded support, but the bounds it provides can be large and thus not necessarily useful.

In Stata (top) and R (bottom):

**** Stata Code ****

/* To create upper bounds: replace missing outcome values of
treatment group with the highest observed outcome and missing
outcome values of control group with lowest observed outcome values */

gen hm_upperbound = outcome
quietly sum hm_upperbound
replace hm_upperbound = r(max) ///
if hm_upperbound == . & treatment == 1
replace hm_upperbound = r(min) ///
if hm_upperbound == . & treatment == 0

/* To create lower bounds: replace missing outcome values of
treatment group with lowest observed and replace missing control
outcome values with highest observed outcome */

gen hm_lowerbound = outcome
quietly sum hm_lowerbound
replace hm_lowerbound = r(min) ///
if hm_lowerbound == . & treatment == 1
replace hm_lowerbound = r(max) ///
if hm_lowerbound == . & treatment == 0


#### R Code ####

# Generating minimum and maximum values of the outcome
hm_upperbound <- max(data$outcome) hm_lowerbound <- min(data$outcome)

# To create upper bounds: replace missing outcome values of
## treatment group with the highest observed outcome and missing
## outcome values of control group with lowest observed outcome values
Data<-Data %>%
mutate(hm_upperbound = ifelse(is.na(outcome) & treatment == 1,hm_upperbound,outcome))
Data<-Data %>%
mutate(hm_upperbound = ifelse(is.na(outcome) & treatment == 0,hm_lowerbound,outcome))

# To create lower bounds: replace missing outcome values of
## treatment group with lowest observed and replace missing control
## outcome values with highest observed outcome
Data<-Data %>%
mutate(hm_lowerbound = ifelse(is.na(outcome) & treatment == 1,hm_lowerbound,outcome))
Data<-Data %>%
mutate(hm_lowerbound = ifelse(is.na(outcome) & treatment == 0,hm_upperbound,outcome))

Lee bounds require the stronger assumption of monotonicity, i.e., that treatment assignment can only affect attrition in one direction. Calculating Lee bounds involves trimming observations from the group that experienced less attrition. These bounds can additionally be tightened using a covariate that predicts attrition. Specifically, to calculate Lee bounds:

1. Calculate the trimming fraction: p=(fraction remaining in less-attrited group - fraction remaining in more-attrited group)/fraction remaining in less-attrited group
2. Drop the lowest p% of outcomes from the less-attrited group
3. Re-calculate the mean outcomes for the trimmed, less-attrited group and compare to the mean outcomes in the group with fewer attrits. This is one bound.
4. Repeat steps 2 and 3, but this time dropping the highest p% of outcomes from the less-attrited group to obtain the other bound.

In Stata, use the leebounds command. See the help file after installation for options to tighten bounds. In R, see Vira Semenova's leebounds package and documentation for implementation instructions.

**** Stata Code ****

leebounds outcome treatment
* where treatment denotes receipt of treatment

Useful resources on bounds include:

##### Inverse probability weighting

Inverse probability weighting (IPW) relies on the assumption that, conditional on a set of observable factors X, attrition is independent of the outcome. This is a stronger assumption than those required for either Horowitz-Manski bounds or Lee bounds; as such, its use has declined in recent years. IPW scales up the estimate of the treatment effect of the non-missing individuals who have a covariate profile X. For example, if, conditional on gender, attrition is random, and 1/4 of women attrited, the outcomes of women in the treatment group would be scaled up by 1/(3/4)=4/3.

In Stata (top) and R (bottom):

**** Stata Code ****

teffects ipw (y) (treatment X)
* where treatment denotes receipt of treatment

#### R Code ####

# Uses package "causal weight" to estimate the ATE or LATE w/IPW;
## y is outcome variable, treatment is binary treatment indicator,
## and x is vector of confounders.
# No option for "data=", so if want individual variables have to call them in the function

TweightEst <- causalweight:: treatweight(data$y, data$treatment, data$x)  ## Standard errors There are two broad sets of considerations to keep in mind when constructing standard errors to estimate treatment effects. The first relates to sampling method, namely, whether you conducted clustered random sampling and want to generalize your results to a population. The second is if you used cluster random assignment to assign treatment. In the latter case, difficulties in estimation arise if the number of clusters is small (less than 25-30). In that case, researchers may want to bootstrap. An alternative approach that is becoming increasingly widespread is to use randomization inference, described below, to test the sharp null hypothesis that the treatment effect is zero for every participant. ### Robust standard errors and clustering Regardless of your sampling or assignment procedure, it is generally recommended to always use heteroskedasticity-robust standard errors in your regression specification (White (1980), Angrist & Pischke (2009), and Wooldridge (2013). As advised by Abadie et al. (2017) and summarized by David McKenzie, there are two reasons why you would want to cluster your standard errors: 1. When you have assigned treatment at another unit than the one at which you are measuring outcomes. Here, Abadie et al. (2017) recommend clustering at the level at which treatment was assigned. For example, if assigning treatment at the village level but measuring individual-level outcomes, standard errors would need to be clustered at the village level. Note that even without clustered random assignment, if you have repeated observations of the same unit (for example, if you have panel data), you will want to cluster standard errors by unit to account for correlations within the unit over time. 2. When you have sampled units from a population using clustered sampling: Key considerations here are how the sample was selected, whether there are clusters in the population of interest that are not in the sample, and whether you want to say something about the population from which the sample was drawn. You may also want to use sample weights in order to generalize results to the population; see more below. With too few (<25-30) clusters, cluster robust standard errors will underestimate the intra-class correlation. As a result, they will be biased down, leading to over-rejection of the null. Alternatives include bootstrapping standard errors or using randomization inference, discussed below. In addition to Abadie et al. (2017) and McKenzie (2017), useful resources on clustering include Blattman (2015); Cameron & Miller (2015) (also has a useful discussion of clustering standard errors vs fixed effects); and MIT 14.771 recitation notes. ### Bootstrapping Non-parametric or resampling bootstrapping involves treating the original sample as a population from which to draw more random samples. In practice, units from the original sample are randomly re-sampled, typically with replacement, from the original sample. Each of these draws results in a new dataset in which some observations appear multiple times and others do not appear (the standard is to draw these pseudo-samples with the same number of observations N as the original sample). From the distribution of the estimates in all these pseudo-samples, we can compute a valid standard error for the original estimate, as long as the original sample is reasonably representative of the original population in terms of coverage (and thus the distribution of the parameter we wish to estimate can be reasonably assumed to be a nonparametric estimate of the distribution in the population) (Cameron & Miller 2015). With clustering, instead of drawing the observation units with replacement, one draws the cluster units with replacement. Note that, as with randomization, it is important to set the seed when writing your code because it allows others (or your future self) to replicate your results. The default number of bootstrap replications is sometimes set at 50 to minimize computation time, but this is typically too low for results in a paper. Cameron & Trivedi (2005) suggest 400 replications when the bootstrap is used to calculate standard errors, though for other uses (e.g., confidence intervals), more replications are typically needed. While very large numbers of replications are theoretically beneficial, since the procedure assumes an infinite number of observations and replications, in practice the bootstrap converges in terms of the number of replications quickly and a finite number of replications is sufficient. The bootstrap assumes approximate linearity and normal distribution of the estimator (Horowitz 2001). It also assumes independent observations (though the cluster option allows for dependence within clusters, provided the clusters are independent). If the errors are independent but not identically distributed, a wild bootstrap is more appropriate. The wild bootstrap allows for heteroskedasticity based on the residuals by creating pseudo-samples of draws of the fitted βx+e and βx-e (where the usual hats on β and e are omitted due to website functionality) (Wu 1986). Wild bootstraps are also useful in settings where the size of clusters varies or where there are few clusters and have been shown to behave well with as few as five clusters (Cameron et al. 2008 and Berk Ozler’s blog post). However, they may also be used in settings with few treated clusters or weak instruments. The wild bootstrap is especially useful when conventional inference methods are unreliable because large-sample assumptions do not hold. Useful resources on bootstraps include: In Stata, CSAE Coder’s Corner has sample code for a three-way bootstrap. For wild bootstraps: see the user-written command boottest (and sample code from CSAE Coder’s Corner) • To bootstrap coefficients, use the vce option in the regress/xtregress commands. In R, the boot package is a flexible tool for bootstrapping. See John Fox & Sanford Weisberg's "Bootstrapping Regression Models in R" (2018) for more information on bootstrapping coefficients, as well as the sample code below. **** Stata Code **** regress y x1 x2 x3, vce(bootstrap reps(50) seed(20200121)) /* You can cluster the bootstrapped standard errors by adding the cluster option inside the vce brackets, as in vce(bootstrap, cluster(clustervar)), where clustervar denotes the clustering variable (e.g., school or village) */  #### R Code #### # Sets up the standard error bootstrap function: # The first two arguments (data, indices) are necessary for the "boot" function, ## and then the rest of the function is customizable based on which ## statistic you would like to bootstrap boot_function <- function(data, indices, formula){ d <- data[indices, ] obj <- lm(formula, d) ##Sets up the object being bootstrapped coefs <- summary(obj)$coefficients ##Pulls the coefficients from the regression
coefs[, "Std. Error"] ##Pulls the standard errors from the coefficients
}
set.seed( ) ##Set the seed so that the code is replicable
# Performs the bootstrap with "R" number of replications
seboot <- boot(data, boot_function, R = 1000, formula = y ~ x1+x2+x3)

• For non-estimation commands or user-written programs in Stata, use the bootstrap prefix, which also allows for clustering.
• For non-estimation commands in R, modify the initial function passed into the boot command.
**** Stata Code ****

/* For non-estimation commands such as obtaining the mean, use e.g., */

bootstrap, mean=r(mean), reps(200) seed(20200121): summarize x1

/* The bootstrap prefix can also be used with your own programs.
Using the example from Cameron & Trivedi (2005), the point estimates
and standard errors of a user-written program called poissrobust
can be bootstrapped using: */

bootstrap _b _se, reps(100) seed(20200121): poissrobust

/* _b returns the point estimate, _se the bootstrapped standard
error, and poissrobust is a user-written program (omitted here).
A similar approach should be taken for two-step estimators (e.g.,
using a control function approach), where the standard errors for the
two stages need to be bootstrapped together. */

#### R Code ####

# As in example above, first sets up a function with
## the necessary first two arguments
my.mean = function(data, indices) {
return(mean(data[indices])) ##Returns the mean
}
# Then bootstraps the mean using the "boot" function
Boot.mn<-boot(Treat\$outcome, my.mean,R=400 )


## Randomization inference

Randomization inference considers the study sample to make up the full universe of eligible units. This is in contrast with the conventional approach to calculating p-values for t-tests, where the study sample is considered to be drawn from a larger population.

With randomization inference, variation in outcomes comes from treatment assignment. Since there is no sampling variation, differences in outcomes are exact (i.e., differences in the population). Randomization inference then tests the sharp null hypothesis that the treatment effect is zero for every participant i.e., the null that the treatment is irrelevant (and had no effect on the mean, variance, etc.) (Young 2019).

In practice, exact p-values are calculated by holding covariates and outcomes as fixed but randomly re-assigning treatment in the data. The exact p-value is then the fraction of outcomes out of all potential configurations of treatment assignment that yielded a treatment effect estimate at least as large as that in the actual assignment. This is contrasted with p-values under the conventional approach, where the p-value gives the probability of observing the difference in outcome means, if there were actually no difference in outcome means in the sample frames from which the groups were drawn.

As of the writing of this guide, randomization inference is increasingly used in favor of using the conventional approach, described below, to estimate standard errors and p-values. This is because t-tests are less reliable when group sizes are unequal or when the distribution of outcomes is skewed, as may be the case with RCT data (Young 2019, Gerber & Green 2012, Green n.d.). However, a 2021 blog post concludes that randomization inference may not be inferior to the conventional approach if one uses a more appropriate way to compute robust standard errors instead of the default setting in Stata (Simonsohn 2021).

Useful resources on randomization inference include:

In Stata: See McKenzie’s 2017 Development Impact blog post for a discussion of how to apply randomization inference procedures in Stata

In R: See sample code from Don Green’s EGAP methods guide

## Miscellaneous

### Using sample weights

When and how to use weights in analysis is a subject of great debate; one useful resource is the discussion by Solon et al. (2013), summarized in a blog post by Jed Friedman.

If the study sample consists of a random sample drawn from a larger population, weights can be used to generalize results (either analytic results or summary statistics) to the larger population. This is particularly important when sampling was disproportionate, i.e., some groups were purposely over/undersampled so that the sample is not representative of the larger population without adjusting for groups’ probability of selection in the sample. Designing a sample to be representative of the larger population is expensive on a large scale, and smaller scale RCTs may not be designed as such. Examples of datasets that are designed for population-level inference are the World Bank’s LSMS and the US Census Bureau’s CPS. Stata has a full suite of survey commands for analyzing survey data, and this guide from UCLA has more information on how to conduct applied survey analysis in Stata.

Weights can also be used to adjust for nonresponse or attrition. If (survey) participation decisions are known and can be explained by observed variables, such differences can be overcome by reweighting. More commonly, however, (survey) participation may depend on unobserved variables. The US Department of Health and Human Services has a guide to nonresponse adjustments, and Reig (2017) covers steps to weight a sample, including constructing weights and sample R code.

In Stata: When conducting disproportionate stratified sampling, you can use pweight.

In R: When conducting disproportionate stratified sampling, you can use the survey package.

**** Stata Code ****

reg y x1 x2 [pweight=n]

/* where n (the weight) is equal to the inverse of the probability
of being included in the sample due to the sampling design. */

/* note: when using pweight, standard errors are robust */

#### R Code ####

# Uses package "survey"

# First step sets up the survey design;
## id= is for the cluster variable: if there is none put either ~1 or ~0.
## Weights is for the weighting variable.
SurvDesign <- survey:: svydesign(id= ~1, weights = n, data = “”)

# Second step is running the regression w/the appropriate survey design
SurvEst <- survey:: svyglm(y ~ x1+x2, design = SurvDesign)


Researchers frequently need to make decisions when conducting analyses, ranging from the construction of variables to the choice of the incorporation of covariates as controls to the use of sampling weights. These decisions can have considerable implications for the magnitude and significance (and sometimes even sign) of results. Some decisions may be foreseeable prior to data collection and can be specified in advance, though many will depend on unforeseeable circumstances of the research implementation in the field.

Sensitivity analysis is used to show how results change under different models or assumptions. This can allow you to assess the robustness of results and dispel concerns about specification searching. Common types of sensitivity analysis include analyzing the impact of outliers, non-compliance, attrition, non-response, changes in outcome definitions, different methods for accounting for clustering, and inclusion of covariates as controls.

There are emerging norms around demonstrating robustness of results to different assumptions. This includes showing distributions of outcomes and other key variables as sanity checks. Authors also increasingly show several specifications alongside each other in the same table—this may include, for example, the inclusion vs. exclusion of fixed effects or covariates, or different bounds or assumptions regarding attrits—in order to show the extent to which treatment effect estimates change under different assumptions.

## Exporting results

### In R

Last updated August 2022.

These resources are a collaborative effort. If you notice a bug or have a suggestion for additional content, please fill out this form.

Acknowledgments

We thank Shawn Cole and Ben Morse for helpful review. All errors our own.

1.
More on heterogeneous treatment effects is discussed below and is covered in EGAP’s corresponding methods guide.
2.
An encouragement design, in which an intervention is available to everyone but those in the treatment group receive some additional encouragement, is an example of a design in which two-sided noncompliance may occur. An example of such a design is given in Bettinger et al. (2012), where in one treatment arm H&R Block tax professionals walked participants through forms to file for federal student aid (FAFSA).
3.
See also Freedman (2008) showing bias, Green & Aronow (2011) showing bias tends to be negligible in samples exceeding 20 units, and Lin’s blog posts (1, 2) discussing Freedman’s conclusions).
4.
Note that testing for selective attrition on observables does not necessarily test for selective attrition on unobservables. A recent paper by Dutz et al. (2021) using administrative and survey data from Norway finds large nonresponse bias even after correcting for observable differences between participants and nonparticipants, with bias increasing with the participation rate.

Abadie, Alberto, Susan Athey, Guido W. Imbens, and Jeffrey Wooldridge. 2017. "When Should You Adjust Standard Errors for Clustering?" NBER Working Paper No. 24003.

Angrist, Joshua David, and Jörn-Steffen Pischke. 2009. Mostly harmless econometrics: an empiricist's companion.

Angrist, Joshua David. 2014. "Instrumental Variables (Take 2): Causal Effects in a Heterogeneous World." Delivered as part of MIT 14.387.

Athey, Susan and Guido Imbens. 2017. “The Econometrics of Randomized Experiments a.” Handbook of Field Experiments Handbook of Economic Field Experiments, 73–140. doi:10.1016/bs.hefe.2016.10.003

Baird, Sarah, Aislinn Bohren, Craig McIntosh, and Berk Ozler. 2014. "Designing experiments to measure spillover effects," Policy Research Working Paper Series 6824, The World Bank.

Benjamini, Yoav, and Yosef Hochberg. 1995  "Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing." Journal of the Royal Statistical Society.

Bettinger, Erice, Bridget Long, Philip Oreopoulos, and Lisa Sanbonmatsu. 2012. "The Role of Application Assistance and Information in College Decisions: Results from the H&R Block Fafsa Experiment." Quarterly Journal of Economics, 127(3), 1205-1242.

Bergman, Peter, Raj Chetty, Stefanie DeLuca, Nathaniel Hendren, Lawrence Katz, and Christopher Palmer. "Creating Moves to Opportunity: Experimental Evidence on Barriers to Neighborhood Choice." Working Paper.

Blattman, Chris. 2015. "Clusterjerk, the much anticipated sequel." Blog post. Last accessed June 20, 2020

Bruhn, Miriam, and David McKenzie. 2009. "In Pursuit of Balance: Randomization in Practice in Development Field Experiments." American Economic Journal: Applied Economics, 1 (4): 200-232.

Cameron, A. Colin and Douglas L. Miller. 2015. "A Practitioner’s Guide to Cluster-Robust Inference" J. Human Resources

Cameron, A. Colin, and Pravin K. Trivedi. Microeconometrics: Methods and Applications. Cambridge: Cambridge University Press

Cameron, A. Colin, Jonah B. Gelbach, and Douglas L. Miller. 2008. "Bootstrap-Based Improvements for Inference with Clustered Errors." The Review of Economics and Statistics.

Chernozhukov, Victor and Ivan Fernandez-Val. "L5. Bootstrapping" 14.382 Econometrics. Spring 2017. Massachusetts Institute of Technology: MIT OpenCourseWare.

Coppock, Alexander. n.d. "10 Things to Know About Multiple Comparisons." EGAP methods guides.

Deaton, Angus and Nancy Cartwright. 2018. "Understanding and Misunderstanding Randomized Controlled Trials." Social Science & Medicine 210: 2-21. https://doi.org/10.1016/j.socscimed.2017.12.005.

Dolan, Lindsay. n.d. "10 Things to Know About Covariate Adjustment." EGAP methods guides.

Duflo, Esther, Rachel Glennerster, and Michael Kremer. 2008. “Using Randomization in Development Economics Research: A Toolkit.” T. Schultz and John Strauss, eds., Handbook of Development Economics. Vol. 4. Amsterdam and New York: North Holland.

Duflo, Esther, and Emmanuel Saez. 2003. "The Role of Information and Social Interactions in Retirement Plan Decisions: Evidence from a Randomized Experiment." The Quarterly Journal of Economics.

Dutz, Deniz, Ingrid Huitfeldt, Santiago Lacouture, Magne Mogstad, Alexander Torgovitsky, and Winnie van Dijk. 2021. "Selection in surveys." NBER Working Paper 29549. https://www.nber.org/papers/w29549.

Fang, Albert. n.d. "10 Things to Know About Heterogeneous Treatment Effects." EGAP methods guides.

Freedman, David. 2008. "On regression adjustments to experimental data.Advances in Applied Mathematics.

Friedman, Jed. 2013. "Tools of the trade: when to use those sample weights." World Bank Development Impact Blog. Last accessed June 20, 2020.

Froelich, Markus and Blaise Melly. 2010. "Estimation of quantile treatment effects with Stata.The Stata Journal.

Gerber, Alan S., and Donald P. Green. 2012. Field experiments: design, analysis, and interpretation. New York: W.W. Norton.

Ghanem, Dalia, Sarojini Hirshleifer, and Karen Ortiz-Becerra. "Testing Attrition Bias in Field Experiments." CEGA Working paper.

Green, Donald P. and Peter Michael Aronow. 2011. "Analyzing Experimental Data Using Regression: When is Bias a Practical Concern?"

Green, Donald. n.d. "10 Things to Know About Randomization Inference." EGAP methods guides.

Horowitz, Joel. 2001. "The Bootstrap." James J. Heckman and Edward Learner, eds., Handbook of Econometrics. Vol. 5. Amsterdam: Elsevier. doi:10.1016/S1573-4412(01)05005-X.

Horowitz, Joel and Charles Manski. 2000. "Nonparametric Analysis of Randomized Experiments with Missing Covariate and Outcome Data." Journal of the American Statistical Association.

Imbens, Guido W., and Donald B. Rubin. 2015. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge: Cambridge University Press.

Imbens, Guido W. and Jeffrey Wooldridge. 2007. "Instrumental Variables with Treatment Effect Heterogeneity: Local Average Treatment Effects," delivered as a lecture in the NBER's "What's New in Econometrics?" series.

Jann, Ben. 2015. "Heterogeneous Treatment Effect Analysis in Stata," delivered as a lecture in the "Heterogeneous Treatment Effects Project Workshop", University of Michigan.

Lee, David. 2009. "Training, Wages, and Sample Selection: Estimating Sharp Bounds on Treatment Effects." The Review of Economic Studies.

Lin, Winston. 2012. "Regression adjustment in randomized experiments: Is the cure really worse than the disease? (Parts I and II)." World Bank Development Impact Blog.

Loftus, Joshua. 2015.  "Primer on multiple testing." Lecture.

McKenzie, David. 2012. "Tools of the Trade: A quick adjustment for multiple hypothesis testing." World Bank Development Impact Blog. Last accessed June 20, 2020.

McKenzie, David. 2017. "Identifying and Spurring High-Growth Entrepreneurship: Experimental Evidence from a Business Plan Competition."  American Economic Review

McKenzie, David. 2017 (b). "When should you cluster standard errors? New wisdom from the econometrics oracle." World Bank Development Impact Blog. Last accessed June 20, 2020.

McKenzie, David. 2017 (c). "Finally, a way to do easy randomization inference in Stata!" World Bank Development Impact Blog. Last accessed June 20, 2020.

McKenzie, David. 2020. "An overview of multiple hypothesis testing commands in Stata." World Bank Development Impact Blog. Last accessed June 20, 2020.

Miguel, Edward, and Michael Kremer. 2004. "Worms: Identifying Impacts on Education and Health in the Presence of Treatment Externalities." Econometrica

Orloff, Jeremy and Jonathan Bloom. 2014. "Bootstrap confidence intervals" and "Bootstrap and Linear Regression." Lectures delivered for MIT 18.05.

Özler, Berk. 2017. "Dealing with attrition in field experiments." World Bank Development Impact Blog. Last accessed June 20, 2020.

Özler, Berk. "Beware of studies with a small number of clusters." World Bank Development Impact Blog. Last accessed June 20, 2020.

Reig, Josep. 2017. "Step 1: Design Weights,"  in (Very) basic steps to weight a survey sample.

Schaner, Simone. 2008. "Regression Discontinuity, Attrition/Bounds, and Education." Recitation handout in MIT 14.771: Development Economics: Microeconomic Issues and Policy Models

Simonsohn, Uri. 2021. "Hyping Fisher: The most cited 2019 QJE paper relied on an outdated Stata default to conclude regression p-values are inadequate." Data Colada. Last accessed August 10, 2022.

Solon, Gary, Steven J. Haider, and Jeffrey Wooldridge. 2013. "What Are We Weighting For?" NBER Working Paper No. 18859.

Tauchmann, Harold. 2014. "Lee (2009) treatment-effect bounds for nonrandom sample selection." The Stata Journal.

UCLA Institute for Digital Research & Education. n.d. "Applied Survey Data Analysis in Stata 13.

U.S. Department of Health and Human Services. 2002. "Studies of Welfare Populations: Data Collection and Research Issues. Common Nonresponse Adjustment Measures in Surveys."

Van der Windt, Peter. n.d. "10 Things to Know About the Local Average Treatment Effect." EGAP methods guides.

White, Halbert. 1980. "A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity." Econometrica 48, no. 4: 817-38.

Wooldridge, Jeffrey M. 2013. Introductory Econometrics: A Modern Approach. United Kingdom, Cengage Learning.

Young, Alwyn. 2019. "Channeling Fisher: Randomization Tests and the Statistical Insignificance of Seemingly Significant Experimental Results." The Quarterly Journal of Economics, Volume 134, Issue 2