Randomization

Authors

Contributors

Summary

Randomization for causal inference has a storied history. Controlled randomized experiments were invented by Charles Sanders Peirce and Joseph Jastrow in 1884. Jerzy Neyman introduced stratified sampling in 1934. Ronald A. Fisher expanded on and popularized the idea of randomized experiments and introduced hypothesis testing on the basis of randomization inference in 1935. The potential outcomes framework that formed the basis for the Rubin causal model originates in Neyman’s Master’s thesis from 1923.

In this section, we briefly sketch the conceptual basis for using randomization before outlining different randomization methods and considerations for selecting the randomization unit. We then provide code samples and commands to carry out more complex randomization procedures, such as stratified randomization with several treatment arms.

Conceptual basis for randomization

Assume for the moment that the unit of randomization is given (call it i). The unit might be individuals or groups of people, households, or geographic areas; this is discussed more below under "Choosing the Unit of Randomization." Assume also that we are in the simplest case of just one treatment and that the goal is to estimate the average treatment effect (ATE) on some outcome Y.

Formally, we can write the outcome of each unit in either the treatment or control group (using notation from Deaton & Cartwright 2018) as:

$$ Y_i = \beta_iT_i +\sum_{j=1}^J \gamma_jx_{ij} $$

where:

β_i is the treatment effect (which may be unit-specific)
T_i is a treatment dummy
x_ij are j=1,...,J observable and unobservable, unit-specific factors that may affect outcome Y
𝛾_j indicates the effect of x_j on Y and may be positive or negative.

With a given experimental sample of size N and given treatment group assignment, we can take the average of the treatment group (T=1) and the comparison group (T=0) to estimate the ATE (see also the Data analysis resource).

$$ \bar{Y_1} - \bar{Y_0} = \bar{\beta_1}+\sum_{j=1}^J \gamma_j(\bar{x}_{1j}-\bar{x}_{0j}) $$

𝛽₁ is the average treatment effect; the subscript indicates that this estimate calculates the average of the treatment effects in the treatment group. The second term is the “error term” of the ATE estimate: the average difference between treatment and control group that is unrelated to treatment (from observable and unobservable differences).

Randomization: Randomized assignment of treatment and control ensures that the x_j are uncorrelated with the treatment assignment, and so the ATE estimate is ex ante unbiased: the error term is zero and 𝛽₁ is equal to the true ATE in expectation. If we were to repeat the experiment with many N-sized samples, the average error term would be zero and the average of 𝛽₁ would be equal to the ATE.

Assignment shares: In any given sample, the error term will likely not be zero and 𝛽₁ will not be equal to the ATE. However, as the sample size N grows, the variance of both around their true means decreases. As a result, a larger sample size increases the statistical power for tests about the ATE.

In the simplest possible regression framework, it is assumed that the treatment effects are homogenous, so that 𝛽_i=𝛽. We can then test the null that 𝛽=0, assuming that the sample ATE above will be approximately normally distributed in large samples. Standard power calculation formulas apply.
With randomization inference, we can directly test the exact null that 𝛽_i=0 for all i. To do so, we construct the distribution of the ATE by enumerating all possible treatment assignments in the sample and calculating the hypothetical ATE for each possible combination of treatment assignments using the realized sample outcomes. The exact p-value is then the probability of observing at least the realized sample ATE, i.e., the fraction of hypothetical ATEs that are at least as large as the realized sample ATE.

Improved balance: Any change to the randomization design that can reduce the variance in the error term, $ \sum_{j=1}^J \gamma_j(\bar{x}_{1j}-\bar{x}_{0j}) $, will improve precision and statistical power.

If all the effect sizes 𝛾_jwere known and all x_j observed, the researcher could assign units to the treatment and control group so that the total error term is as close as possible to zero (see Kasy 2016). Note that this would actually mean not randomizing. However, in most settings, the elements of the error term are not known or completely observed which makes randomization valuable as it ensures that the error term is zero in expectation. Furthermore, the researcher can make sure that the units in the treatment group and the control group are balanced on the observable covariates that are likely to be correlated with Y. Balance means that the distribution of those x_ij in the treatment group is similar to that in the control group. The main methods used for this are stratification, also called blocking, and re-randomization; both are described below.

Randomization methods

Conceptually, randomization simply means that every experimental unit has the same probability of being assigned to a given group. Below, we describe common randomization methods.

Simple randomization/basic lottery: One could in actuality implement simple randomization with a lottery. For example, one could set up an urn with half black and half red balls, designating red to be the treatment, and draw a ball from this 50/50 urn for each unit to determine treatment assignment. Typically, this would be done in statistical software such as Stata or R to ensure the randomization is replicable.

In some cases, the sample is not known at the time of random assignment, and a basic lottery for treatment assignment is the only option. For example, units could be assigned on arrival (such as when children enroll in school, patients arrive at the clinic, etc.), and the exact number to be enrolled may be unknown at the time of random assignment. Another example is random digit dialing in phone-based experiments, where it is unknown at the outset how many numbers called will actually be in service. The randomization might be built into the contact, consent, and enumeration protocol on arrival, for example using a coin flip or a draw from an urn, or using SurveyCTO’s built-in randomization engine.¹

However, for any given finite sample of size N, random variation means that the share of units assigned to the treatment group may not be exactly 50% with this method, and in most cases, permutation randomization is used.

Permutation randomization: If the targeted sample size N is known, one can improve on simple randomization by assigning exactly half of the experimental units to each group (assuming N is even; see the discussion below on misfits). This could be achieved with an urn with N balls like above, but for each unit the red or black ball is drawn “without replacement,” i.e., not returned to the urn (whereas for simple randomization the ball would be drawn “with replacement,” that is, returned into the urn). This is called permutation randomization, because any random assignment is just a permutation of the assignment of balls (treatment status) to experimental units.

Implementing random treatment assignment is easiest when the sample is known, i.e., there is a pre-existing list of experimental units. In this case, researchers typically perform permutation randomization based on that list, using a software program such as Stata and a data file containing a unit ID, cluster ID, and stratification variables (discussed below under “Stratified randomization”) if applicable. This approach has the considerable benefit of verifiability and replicability, assuming certain steps (described below) are taken. However, it is also possible to conduct permutation sampling in a more transparent way, e.g., by performing a public lottery using an urn with a set number of balls and drawing balls without replacement. This method may hence be preferable in instances where it is desirable or necessary to actually show participants that their treatment assignment is random. However, the main disadvantage is that it is not replicable.

Choosing the unit of randomization

Normally, the most natural choice of randomization unit is the observational unit, which may not always be the individual. For example, we may not be able to measure individual consumption, but only household consumption; we may only have data on hospital-level occurrence of complications in surgery, but no patient-level data; and so on.

In some cases, however, the unit of randomization contains multiple units of observation. This is called cluster randomization. With cluster randomization, observational units are sorted into groups (clusters), and the randomization occurs at the cluster level rather than at the unit level.

A number of considerations related to the validity of the experiment may go into choosing a unit of randomization different from the unit of observation. Note that the unit of randomization will affect the sample size needed to detect an effect. In general, moving up in level of randomization decreases the study’s effective sample size, meaning that more observational units are required to achieve a given level of power. For more information see the Power calculations resource and these J-PAL slides on Choosing the Right Sample Size.

At a minimum, the unit of randomization should be the unit at which the treatment will be delivered, and ideally at a unit that minimizes spillovers (see below) as well as evaluation-driven effects:

Unit at which the treatment will be delivered: It is not always possible to assign the treatment at the observational-unit level. For example, an evaluation of a new curriculum may only be possible to assign at the classroom rather than individual level, necessitating classroom-level cluster randomization even though the unit of observation would be the individual student.
Evaluation-driven effects: Evaluation-driven effects are those caused by participation in the evaluation, rather than by the treatment itself. For example, the John Henry effect occurs when individuals in the control group react in some form to being in the experiment and in particular might emulate the treatment group.

Spillovers

Spillovers mean that the outcomes of untreated units are indirectly affected by the treatment given to others.² Spillovers are often positive, but they may also be negative. For example, if the beneficiaries of a job matching program fill all the available positions, this puts untreated job seekers at a disadvantage.

Formally, spillovers violate the stable unit treatment value assumption (SUTVA). SUTVA means that one unit’s treatment status has no effect on the outcomes of other units. A research design that does not account appropriately for spillovers invalidates ATE estimates. Consider an example where positive spillovers on food consumption from a cash benefit program arise because treated units use some of the cash they receive to increase the food consumption of others (e.g., by inviting them to meals or giving them gifts). There are a couple of ways in which spillovers could affect the ATE estimates, including:

Missing unintended spillovers on the untreated control (most common): If individuals in the control group are affected by the presence of the program, they no longer represent a good comparison, challenging the internal validity of the treatment effect estimate. For example, if some of the cash given to treated households increases food consumption in the control group, we will underestimate the effect of the cash benefit.
Missing general equilibrium effects that may occur when the program is implemented at scale: People conducting impact evaluations are typically interested in understanding what the effect of a program might be when it is rolled out at scale. However, when only offering the treatment to some units as part of the RCT, you might miss general equilibrium effects that would occur if the program were to be rolled out at scale. For example, providing post-harvest storage to some farmers in a district will allow them to sell later in the season when prices are high. Providing post-harvest storage to all farmers in the district will give them all the same option, possibly driving prices back down.

If there is a high potential for spillovers at a given unit of randomization, the best solution can be to randomize treatment assignment at a higher level (i.e., use cluster randomization). As an example, spillovers may occur within a town, but not across towns, and so treatment can be clustered at the town level. In the cash benefit example, this could be the case if most households have their social network within their own geographic area. Note that all else equal, clustered designs require a larger sample size to achieve the same level of statistical power (see the Power calculations resource).

Randomization designs to estimate spillover effects

Under some assumptions, the randomization design can also help us measure spillover effects. This is done by randomizing at the cluster level and varying the “intensity” of treatment for the cluster, that is, the share of observational units in the cluster that receive the treatment, and then measuring treatment effects on both treated and untreated units within the cluster. For more information see Baird et al. (2014) or the discussion in 5.1 and 6.3 of Duflo et al. (2007). Crepon et al. (2013) also provide an example of a study designed to estimate spillover effects of a job placement assistance program.

Stratified randomization

Suppose we observe some covariate x_j, and we know (or suspect) that the outcome varies with x_j, that is, 𝛾_j≠0. Then any difference in the covariate between the treatment and control groups will lead to a difference in the average outcome, unrelated to the actual treatment effect (i.e., captured by the error term above). For example, if x is gender and Y is income, and men earn higher incomes than women on average (unfortunately still true in many settings), then a treatment group with a larger share of men than the control group will have on average higher income, even if the treatment had no effect on income.

Ex ante, this issue can be minimized by balancing treatment assignment on these covariates x_j. This approach is called block randomization or stratified randomization, and it simply means partitioning the sample into groups (strata or blocks) of different x_j, and then carrying out permutation randomization within each stratum. For the gender example above, it would mean conducting the randomization within the sample of men and within the sample of women separately. With just one treatment, this means that exactly half of all men and half of all women are assigned to the treatment group, and results in the treatment and comparison groups being balanced on the covariate gender (for simplicity, this is assuming that the number of units in each block is even, more on this below in the discussion about misfits).

There may also be unobserved factors that affect outcome Y. While we cannot form blocks based on unobserved factors, if pre-trial observations of the outcome variable itself are available, and the outcome is reasonably stable, then the research design could simply stratify on pre-trial levels of the outcome variable.

In general, stratification increases the precision of the ATE estimate and therefore increases power; stratification on pre-trial outcome levels is no different.³ Because stratification reduces the degrees of freedom—in the data analysis, you would typically use stratum fixed effects—stratification when outcome differences between groups are small could even reduce statistical power. Stratification variables should thus be ones you expect to be correlated with the outcome of interest, or those for which you are interested in differential or heterogeneous treatment effects.

Misfits

Misfits occur when the number of units in a given stratum is not a multiple of the number of treatments to assign (Bruhn & McKenzie 2011). In the simplest case of two groups—a single treatment and the control—and no strata, a misfit occurs when there is an odd number of units. In the two-group case, this can be resolved easily by randomly determining the misfit’s status after other units have been assigned to treatment/control. However, whenever determining the allocation of one or more misfits, one has to consider whether to attempt to achieve the intended assignment shares globally–across strata–or locally–within strata.

A simple example: Suppose that you have two treatment arms, T1 and T2, 30 units to be assigned in ⅓ and ⅔ proportion to the two groups, and three strata of size 10. In this case, there would be 3 misfits—one in each stratum:

	N	T1	T2	Misfits
Stratum 1	10	3	6	1
Stratum 2	10	3	6	1
Stratum 3	10	3	6	1
All	30 (100%)	9 (30%)	18 (60%)	3 (10%)

Suppose first that units are assigned to preserve treatment allocation ratios locally, i.e., within strata. The closest allocation to a 33.3% and 66.7% assignment within each stratum is 3 and 7 units, respectively. But for the total sample, this means that the assignment is 30% vs. 70%, so globally, the intended assignment share has not been reached.

	N	T1	T2
Stratum 1	10	3	7
Stratum 2	10	3	7
Stratum 3	10	3	7
All	30 (100%)	9 (30%)	21 (70%)

Alternatively, suppose misfits are assigned to achieve global balance, i.e., to preserve treatment allocation ratios globally.⁴ For at least one stratum, this results in less-than-ideal within-stratum allocation, which can lead to challenges with balance, e.g., in this case there will be disproportionately many units from Stratum 2 in the T1 group relative to T2:

	N	T1	T2
Stratum 1	10	3	7
Stratum 2	10	4	6
Stratum 3	10	3	7
All	30 (100%)	10 (33%)	20 (67%)

This simple example included only three strata, but a general rule is that the more strata, the more misfits (for a given sample size), and the more misfits, the more difficult it becomes to maintain balance in assignment shares within strata and globally. Furthermore, note that the number of strata is multiplicative, e.g., if you stratify on both gender (e.g., 2 strata) and location (e.g., 3 strata), the number of strata is 6. So while stratified randomization can be a valuable tool to maximize baseline balance in important dimensions, there is value in being conservative when choosing which variables (and values) to stratify on.

It will almost never be the case that each stratum has a number of units that is exactly a multiple of the number of treatment arms, not least because this is at odds with proportionate stratified sampling, so any randomization procedure needs to deal with this issue (see some solutions below).

Balance tests and solutions to imbalance

Balance tests

A balance test checks whether the randomization “worked” beyond just assigning the right number of units in each treatment arm, by formally testing for differences in observable characteristics between the treatment and control groups. Balance can be examined individually by covariate or jointly, using an omnibus test for overall balance.

Balance tests may be most useful when there are reasons to doubt that the randomization was done correctly (see McKenzie 2017 or Altman 1985), and there is some discussion of how informative and useful they can be when the research team did the randomization. Furthermore, if you do trust the randomization, there can be drawbacks to checking balance if doing so requires conducting baseline surveys, because additional interactions between participants and research teams can lead to Hawthorne effects, as shown in Evans (2014) and Friedman & Gokul (2014). If baseline data will only be collected for balance tests, not to generate descriptive statistics, stratify the sample, or use as controls in the final regression to decrease the residual variance and hence increase power, an alternative is to collect characteristics that are mostly time-invariant in the endline (e.g., race and gender), and check those for balance ex post.

To test for differences, regression is generally preferred to t-tests, because it allows for the correction of standard errors (e.g., clustering, making robust to heteroskedasticity, and bootstrapping) and the inclusion of fixed effects (e.g., enumerator and strata). Balance test regressions should use the same specification as your final regression when possible. For example, if you stratified randomization, you will include strata fixed effects in your main regression and should also include them when checking for balance on covariates not used for stratification--this way, you are checking for balance within strata, not across strata.

Individual covariate tests

Suppose for example we conducted stratified cluster randomization. In Stata, testing for individual covariate balance would look like:

reg covariate treatment i.stratum, robust cluster(cluster_id)

where the coefficient on treatment indicates whether there is within-strata balance on average (though determining whether there is balance within a given stratum requires either interacting the treatment and stratum variables or restricting the sample to the stratum or strata of interest).

Questions to consider when using individual covariate tests:

How many statistically significant differences are there (and are there more than you’d expect)? That is, does the result of the balance check make you question whether the allocation procedure was indeed random? If you are testing for balance at the 5% level, you would expect to see statistically significant differences between the treatment and control groups in roughly 5% of your covariates simply by chance. Note, there is no clear rule for how many imbalanced variables is too many.
What are the magnitudes of the differences? Are the differences economically/practically significant? A given difference will be more likely to be statistically significant the larger the sample, but the sample size does not affect whether the difference is economically significant.
In which variables are the imbalances? Imbalances are most likely to challenge the internal validity of the ATE when:
- Covariates are correlated with treatment take-up.
- Covariates are correlated with attrition based on previous literature or ex-post observed attrition, which could lead to attrition that is differential by treatment status.
- Covariates are correlated with the main outcome variable: Imbalanced covariates are frequently used as controls in analysis.
- The imbalanced variables are the main outcomes: If you find a pre-trial imbalance in a main outcome, you may want to change your analysis to a difference-in-difference approach or consider controlling for the baseline level of the outcome variable in your final regression.

Overall balance tests

Researchers can test overall balance by regressing the treatment indicator on the full list of baseline variables and examining the p-value of an omnibus F-test of joint orthogonality. Omnibus tests are a useful complement to individual tests, as they address the ambiguity in counting individual imbalances. For example, there may be no overall imbalance even if a few individual covariates show chance imbalances. On the other hand, there could be overall imbalance if multiple differences trend in the same direction, even if no individual covariate shows a statistically significant difference between groups. A recent working paper (Kerwin, Rostom, and Sterck 2024) notes that conventional F-tests over-reject the null hypothesis, indicating imbalance too often, and recommends examining overall imbalance using randomization inference. More on this topic can be found on the J-PAL blog.

Solutions to imbalance

Due to random chance, balance may not be achieved on key variables with any given random sample draw. This introduces the risk of a random error term in the comparison of treatment and control outcomes (see above when 𝛾_j≠0, i.e., there are imbalances on variables that are likely to be correlated with the main outcome). Many papers solve this problem by re-randomizing.⁶ One approach is to carry out the randomization procedure many times, select only the draws that were balanced, and then randomly select one of these draws for the study (Glennerster & Takavarasha 2013). However, with re-randomization, not every combination of treatment assignments is equally probable; as conventional significance tests assume that each combination is equally likely, this should be accounted for in analysis.

Bruhn & McKenzie (2009) show that not adjusting for randomization method, including re-randomization, results in standard errors that are on average overly conservative (though in a nontrivial number of cases they tested this was not the case). They also recommend including the variables used to check balance as linear covariates in the regression, so that estimation of the treatment effect is conditional on the variables used for checking balance. A challenge with this approach is that controlling for the covariates used to check balance may still not perfectly account for how the assignment combination probabilities changed. This is problematic for calculation of exact p-values if using randomization inference, as the calculation requires knowing the probability with which all potential treatment/control assignments that were possible under the research design could have occurred (Athey & Imbens 2017). As a practical matter, Bruhn & McKenzie (2009) recommend making it clear in balance tables which covariates were targeted for balance, as overall balance will be overstated by only looking at covariates on which balance was achieved through re-randomization or stratification.

An alternative approach is to consider before implementing randomization whether there are covariates on which imbalance would not be acceptable and stratifying on them, so that balance on key covariates is achieved by construction (Athey & Imbens 2017). Moreover, Athey and Imbens make the point that re-randomization in practice turns into a form of stratification; for example, re-randomizing to achieve balance on gender becomes an indirect method of stratifying on gender. As with re-randomization, stratification means that not every combination of treatment assignments is equally probable. Unlike re-randomization, the researcher knows exactly how these probabilities have changed and can thus calculate exact p-values if desired.

Discussions on re-randomization include Bruhn & McKenzie (2009), Athey & Imbens (2017), and Glennerster & Takavarasha (2013). Theoretical papers include Morgan and Rubin (2012) and Banerjee, Snowberg, and Chassang (2017).

Solutions/programming

Basic coding procedure for randomization

Regardless of method, the randomization procedure should be verifiable, replicable, and stable, and the randomization outcome saved in a secure file or folder away from other project files.

The basic procedure is always the same:

Create a file that contains only one entry per randomization unit (e.g., one line per household or one line per cluster). This might mean creating a new file that retains cluster-level stratifying variables and temporarily drops all but one observational unit per cluster.
Sort this file in a replicable and stable way (use stable sort in Stata, i.e., sort varlist, stable).
Set the seed for the random number generator (in Stata, set seed). Make sure that the seed is:
1. Preserved: Some operations (such as preserve/restore in Stata) erase the seed, and then any random number sequence following that is not determined by the seed anymore and therefore not replicable.
2. Used only once across parallel operations: Every time the same seed is set, the exact same random number sequence will be produced afterwards. If for example you are assigning daily N sized batches of units to treatment arms and using the same seed for every batch, the way in which these units are assigned will be the same every day. This could introduce unwanted patterns and imbalances.
Randomly assign treatment groups to each randomization unit (see example randomization code below), then merge the random assignment back with the original file to obtain a list of all observational units with treatment assignments.
Save the list of observational units with treatment assignment in a secure location and program your routine so this list cannot be automatically overwritten.
For any even slightly more complex randomization procedure, extensively test balance:
1. In terms of sample size across treatment arms, within and across strata, to verify the correct handling of misfits (see below).
2. In terms of covariates across treatment arms, to understand power and sample balance and make sure the stratification was done right (see also below).

Manual randomization

As written above, the basic procedure for randomization involves assigning a random order to slots and treatment arms to those slots. In the most simple example, with two treatment arms (one treatment and one control) assigned in equal proportions, the procedure in Stata is as follows:

clear all
* Set the version for upward compatibility 
* (e.g., Stata 15 users can use Stata 14) 
version 14.2 
use “randomizationdata.dta”, clear
isid uniqueid // check for duplicates in the list of individuals
sort uniqueid, stable // sort on uniqueid for replicability
set seed 20200520 

/* generate a random (pseudo-random) variable consisting of 
draws from the uniform distribution */
generate random = runiform() 
bysort stratum: egen temp = rank(random), unique
bysort stratum: gen size=_N

* assign treatments to observations that fit evenly into treatment ratios:
gen treatment = temp>size/2+.5 

* randomly assign the misfit:
replace treatment = round(random) if temp==size/2+.5

As noted above, the procedure for assigning the misfits becomes increasingly tedious as the number of strata and treatments (and, hence, potential number of misfits) grow, or as treatment allocations change. For example, in the above case there will be at most one misfit per stratum. If the treatment allocation changed—1/3 assigned to treatment and 2/3 to control—there could be two misfits per stratum. The first misfit can be randomly assigned a treatment condition, but the assignment for the second misfit (if there is one) will depend on that of the first to preserve balance. An extreme example of this, with six treatments and 72 strata, is given by Bruhn & McKenzie (2011). When there will be a large number of misfits, an alternative is to use the randtreat command described below.

randtreat command

The user-written randtreat command (additional documentation) can perform random assignment of multiple treatments across strata, in equal or unequal ratios, and has several options for dealing with misfits. In particular, the user can decide whether to preserve balance within strata or globally (when the two are at odds), or can specify that misfits’ treatment status be set to missing and dealt with manually. Furthermore, the randtreat command allows the user the option to set the seed as part of the command to ensure replicability.

randtreat if in , generate(newvar) replace setseed(#) 
strata(varlist) multiple(#) unequal(fractions) 
misfits(missing | strata | wstrata | global | wglobal)

Last updated September 2023.

Acknowledgments

We thank Eitan Paul and Megan Lang for thoughtful suggestions and comments. Liz Cao copy-edited this resource. All errors are our own.

The World Bank additionally has sample SurveyCTO code for taking random draws of beneficiaries.

More about spillovers can be found on EGAP’s corresponding guide.

One exception is when the outcome measure is unstable over time: in the extreme case where e.g., income is independent and identically distributed (iid) and changes from period to period, sorting people by income in one period would not ensure balance in other periods.

In practice, this could be done by creating a new stratum of misfits, then randomly assigning treatments within it (Carril 2017).

The Hawthorne effect describes a situation where individuals alter their behavior simply as a reaction to being observed.

For example, Bruhn et al. (2018), Beaman et al. (2013), and Ashraf et al. (2010).

Additional Resources

Assigning randomization

J-PAL’s lecture on How to Randomize
EGAP’s corresponding methods guide

References

Altman, Douglas G. 1985. “Comparability of Randomised Groups.” The Statistician 34 (1): 125. doi:10.2307/2987510.

Angrist, Joshua D., and Jörn-Steffen Pischke. 2013. Mastering 'Metrics: The Path from Cause to Effect. Princeton University Press: Princeton, NJ.

Ashraf, Nava, James Berry, and Jesse M Shapiro. 2010. “Can Higher Prices Stimulate Product Use? Evidence from a Field Experiment in Zambia.” American Economic Review 100 (5): 2383–2413. doi:10.1257/aer.100.5.2383.

Athey, Susan and Guido Imbens. 2017. “The Econometrics of Randomized Experiments a.” Handbook of Field Experiments Handbook of Economic Field Experiments, 73–140. doi:10.1016/bs.hefe.2016.10.003.

Baird, Sarah, J. Aislinn Bohren, Craig Mcintosh, and Berk Ozler. 2014. “Designing Experiments to Measure Spillover Effects.” SSRN Electronic Journal. doi:10.2139/ssrn.2505070.

Banerjee, Abhijit V., Sylvain Chassang, Sergio Montero, and Erik Snowberg. 2020. “A Theory of Experimenters: Robustness, Randomization, and Balance.” American Economic Review 110 (4): 1206–30. doi:10.1257/aer.20171634.

Beaman, Lori, Dean Karlan, Bram Thuysbaert, and Christopher Udry. 2013. “Profitability of Fertilizer: Experimental Evidence from Female Rice Farmers in Mali.” doi:10.3386/w18778.

Biau, David Jean, Brigette M Jolles, and Raphaël Porcher. 2020. P Value and the Theory of Hypothesis Testing: An explanation for New Researchers. Clinical Orthopedic Related Research 468(3): 885-892. DOI: 10.1007/s11999-009-1164-4

Blimpo, Moussa. 2019. “Asymmetry in Civic Information: An Experiment on Tax Incidence among SMEs in Togo.” AEA Randomized Controlled Trials. doi:10.1257/rct.4394-1.0. Last accessed June 10, 2020.

Bruhn, Miriam, Dean Karlan, and Antoinette Schoar. 2018. The Impact of Consulting Services on Small and Medium Enterprises: Evidence from a Randomized Trial in Mexico. Journal of Political Economy 126(2): 635-687. https://doi.org/10.1086/696154

Bruhn, Miriam and David McKenzie. 2009. “In Pursuit of Balance: Randomization in Practice in Development Field Experiments.” American Economic Journal: Applied Economics 1 (4): 200–232. doi:10.1257/app.1.4.200.

Bruhn, Miriam and David McKenzie. “Tools of the trade: Doing Stratified Randomization with Uneven Numbers in Some Strata." World Bank Development Impact Blog, November 6, 2011. Last accessed June 10, 2020. https://blogs.worldbank.org/impactevaluations/tools-of-the-trade-doing-stratified-randomization-with-uneven-numbers-in-some-strata

Carril, Alvaro. 2017. “Dealing with Misfits in Random Treatment Assignment.” The Stata Journal: Promoting Communications on Statistics and Stata 17 (3): 652–67. doi:10.1177/1536867x1701700307.

Cartwright, Nancy and Angus Deaton. 2018. "Understanding and Misunderstanding Randomized Controlled Trials." Social Science & Medicine 210: 2-21. https://doi.org/10.1016/j.socscimed.2017.12.005 [ungated version]

Duflo, Esther, Rachel Glennerster, and Michael Kremer. 2007. “Using Randomization in Development Economics Research: A Toolkit.” Handbook of Development Economics, 3895–3962. doi:10.1016/s1573-4471(07)04061-2.

Evans, David. “The Hawthorne Effect: What Do We Really Learn from Watching Teachers (and Others)?” World Bank Development Impact (blog), February 17, 2014. https://blogs.worldbank.org/impactevaluations/hawthorne-effect-what-do-we-really-learn-watching-teachers-and-others. Last accessed June 10, 2020.

Fisher, Ronald. 1935. The Design of Experiments. Oliver and Boyd: Edinburgh, UK

Friedman, Jed and Brinda Gokul., “Quantifying the Hawthorne Effect” World Bank Development Impact (blog), October 16, 2014. http://blogs.worldbank.org/impactevaluations/quantifying-hawthorne-effect. Last accessed June 10, 2020.

Glennerster, Rachel and Kudzai Takavarasha. 2013. Running Randomized Evaluations: A Practical Guide. Princeton University Press: Princeton, NJ.

Heard, Kenya, Elizabeth O’Toole, Rohit Naimpally, and Lindsey Bressler. 2017. Real-World Challenges to Randomization and Their Solutions. J-PAL North America.

Imbens, Guido W. and Donald B. Rubin. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge: Cambridge University Press, 2015. doi:10.1017/CBO9781139025751.

Kasy, Maximilian. 2016. “Why Experimenters Might Not Always Want to Randomize, and What They Could Do Instead.” Political Analysis 24 (3): 324–38. doi:10.1093/pan/mpw012.

Kerwin, Jason, Nada Rostom, and Olivier Sterck. 2024. "Striking the Right Balance: Why Standard Balance Tests Over-Reject the Null, and How to Fix it." IZA Discussion Paper No. 17217. https://ssrn.com/abstract=4926535.

Lohr, Sharon, and J. N. K. Rao. 2006. “Estimation in Multiple-Frame Surveys.” Journal of the American Statistical Association 101 (475): 1019–1030. www.jstor.org/stable/27590779

McKenzie, David “Should we require balance t-tests of baseline observables in randomized experiments?” World Bank Development Impact (blog), June 26, 2017. https://blogs.worldbank.org/impactevaluations/should-we-require-balance-t-tests-baseline-observables-randomized-experiments. Last accessed June 10, 2020.

Morgan, Kari Lock and Donald B. Rubin. 2012. “Rerandomization to Improve Covariate Balance in Experiments.” The Annals of Statistics 40 (2): 1263–82. doi:10.1214/12-aos1008.

Neyman, Jerzy. 1923. “On the Application of Probability Theory to Agricultural Experiments. Essay on Principles. Section 9.” Statistical Science 5 (4): 465–472. Trans. Dorota M. Dabrowska and Terence P. Speed.

Neyman, Jerzy. "On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection." Journal of the Royal Statistical Society 97, no. 4 (1934): 558-625. Accessed June 15, 2020. doi:10.2307/2342192.

Rubin, Donald B. 2005. "Causal Inference Using Potential Outcomes: Design, Modeling, Decisions." Journal of the American Statistical Association 100(469): 322-331. DOI 10.1198/016214504000001880

Wu, Changbao. “Multiple-frame Sampling.” In Encyclopedia of Survey Research Methods,edited by Paul J. Lavrakas, 488-489. California: SAGE Publications, Inc., 2008. http://dx.doi.org/10.4135/9781412963947.

Research Resources