Research Resources

Randomization

Summary

Randomization for causal inference has a storied history. Controlled randomized experiments were invented by Charles Sanders Peirce and Joseph Jastrow in 1884. Jerzy Neyman introduced stratified sampling in 1934. Ronald A. Fisher expanded on and popularized the idea of randomized experiments and introduced hypothesis testing on the basis of randomization inference in 1935. The potential outcomes framework that formed the basis for the Rubin causal model originates in Neyman’s Master’s thesis from 1923.

In this section, we briefly sketch the conceptual basis for using randomization and stratified sampling before outlining the different randomization methods. We then provide code samples and commands to carry out more complex randomization procedures, such as stratified randomization with many treatment arms.

Assigning units in a sample to treatment groups

We begin with random treatment assignment. Assume for the moment that the unit of randomization is given (call it i). The unit might be individuals or groups of people, households, areas; this is discussed more below under choosing the unit of randomization. Assume also that we are in the simplest case of just one treatment and that the goal is to estimate the average treatment effect (ATE) on some outcome Y

Formally, we can write the outcome of each unit in either the treatment or control group (using notation from Deaton & Cartwright 2018) as:

$$ Y_i = \beta_iT_i +\sum_{j=1}^J \gamma_ix_{ij} $$

where

  • βi is the treatment effect (which may be unit-specific)1

  • Ti is a treatment dummy

  • xij are j=1,...,J observable and unobservable, unit-specific factors that may affect outcome Y

  • γj indicates the effect of xj on and may be positive or negative.

With a given experimental sample of size N and given treatment group assignment, we can take the average of the treatment group (T=1) and the comparison group (T=0) to estimate the ATE (see also the data analysis section):

$$ \bar{Y_1} - \bar{Y_0} = \bar{\beta_1}+\sum_{j=1}^J (\bar{x}_{1j}-\bar{x}_{0j}) $$

𝛽1 is the average treatment effect; the subscript indicates that this estimate calculates the average of the treatment effects in the treatment group. The second term is the “error term” of the ATE estimate: the average difference between treatment and control group that is unrelated to treatment (from observable and unobservable differences).

Randomization: Randomized assignment of treatment and control ensures that the xj are uncorrelated with the treatment assignment, and so the ATE estimate is ex ante unbiased: the error term is zero and 𝛽1 is equal to the true ATE in expectation. If we were to repeat the experiment with many N-sized samples, the average error term would be zero and the average of 𝛽1 would be equal to the ATE. 

Assignment shares: In any given sample, the error term will likely not be zero and 𝛽1 will not be equal to the ATE. However, as the sample size N grows, the variance of both around their true means decreases. As a result, a larger sample size increases the statistical power for tests about the ATE.

  • In the simplest possible regression framework, it is assumed that the treatment effects are homogenous, so that 𝛽i=𝛽. We can then test the null that 𝛽=0,  using that the sample ATE above will be approximately normally distributed in large samples. Standard power calculation formulas apply.1
  • With randomization inference, we can directly test the exact null that 𝛽i=0  for all i. To do so, we construct the distribution of the ATE by enumerating all possible treatment assignments in the sample and calculating the hypothetical ATE for each possible combination of treatment assignments using the realized sample outcomes. The exact p-value is then the probability of observing at least the realized sample ATE, i.e., the fraction of hypothetical ATEs that are at least as large as the realized sample ATE.

In both cases, the greatest statistical power to reject the null is achieved when the sample is assigned in equal proportions to the treatment and control group.2 More generally, the optimal randomization design typically means that the share of the sample assigned to each group is some fixed proportion (see the power calculations section and 4.1 of Duflo et al. (2007) for power calculations with cost considerations).

Improved balance: Any change to the randomization design that can reduce the variance in the error term, $ \sum_{j=1}^J \gamma_j(\bar{x}_{1j}-\bar{x}_{0j}) $, will improve precision and statistical power.

If all the effect sizes 𝛾j  were known and all xj  observed, the researcher could assign units to the treatment and control group so that the total error term is as close as possible to zero (see Kasy 2016). Note that this would actually mean not randomizing. However, in most settings, the elements of the error term are not known or completely observed.3 The researcher can, however, make sure that the units in the treatment group and the control group are balanced on the observable covariates that are likely to be correlated with Y. Balance means that the distribution of those xij  in the treatment group is the same as or similar to that in the control group. The main methods used for this are stratification, also called blocking, and re-randomization; both are described below.

Randomization methods

Conceptually, randomization simply means that every experimental unit has the same probability of being assigned to a given group (assuming 50% assignment shares). 

Simple randomization/basic lottery: One could literally implement simple randomization with a lottery. For example, one could set up an urn with half black and half red balls, designating red to be the treatment, and draw a ball from this 50/50 urn for each unit to determine treatment assignment. However, for any given finite sample of size N, random variation means that the share of units assigned to the treatment group may not be exactly 50% with this method. Simple randomization is sometimes necessary for selecting a sample (see below for assignment “on arrival”) or desirable to create transparency about the assignment process, but in most cases, permutation sampling is used.

Permutation randomization: If the targeted sample size N is known, one can improve on simple randomization by assigning exactly half of the experimental units to each group (assuming N is even; see the discussion below on misfits). This could be achieved with an urn with N balls like above, but for each unit the red or black ball is drawn “without replacement,” i.e., not returned to the urn (whereas for simple randomization the ball would be drawn “with replacement,” that is, returned into the urn).  This is called permutation randomization, because any random assignment is just a permutation of the assignment of balls (treatment status) to experimental units.

Stratified randomization: Suppose we observe some covariate xj, and we know (or suspect) that the outcome varies with xj, that is, 𝛾≠ 0. Then any difference in the covariate between the treatment groups will lead to a difference in the average outcome, unrelated to the actual treatment effect (the error term above). For example, if x is gender and Y is income, and men earn higher incomes than women on average (unfortunately still true in many settings), then a treatment group with a larger share of men than the control group will have on average higher income, even if the treatment had no effect on income. 

Ex ante, this issue can be prevented by balancing treatment assignment on these covariates. This approach is called block randomization or stratified randomization, and it simply means partitioning the sample into groups (strata or blocks) of different xj, and then carrying out permutation randomization within each stratum. For the gender example above, it would mean conducting the randomization within the sample of men and within the sample of women separately.  With just one treatment, this would imply that exactly half of all men and half of all women are assigned to the treatment group, and the treatment and comparison group are balanced on the covariate, gender (for simplicity, this is assuming the number of units in each block is even, more on this below in the discussion about misfits). 

There may be unobserved factors that affect outcome Y. In this case, we cannot form blocks based on these factors. However, if pre-trial observations of the outcome variable itself are available, and the outcome is reasonably stable, then the research design could simply stratify on pre-trial levels of the outcome variable

In general, stratification increases the precision of the ATE estimate and therefore increases power, and stratification on pre-trial outcome levels is no different. But note that income stability in the example above is important here: in the extreme case where income is instead independent and identically distributed (iid) and changes from period to period, sorting people by income in a previous period would create strata with levels of income that are identical  in expectation. Because stratification reduces the degrees of freedom–in the data analysis, you would typically use stratum fixed effects–stratification when outcome differences between groups are small could even reduce statistical power. 

Assigning units in a population to the sample

Randomization helps get an unbiased estimate of the ATE within the sample and thus supports internal validity, i.e., the estimated effect reflects the causal relationship between treatment and outcome.  External validity refers to the degree to which the estimated effect is generalizable, in other words, reflects a more general causal relationship “out of sample.”  So far we have taken the sample as given, but an RCT also requires selecting a sample from the population. External validity depends on how representative the sample is of the overall population and hence in part on how the sample for the study was selected.

Sometimes it is not necessary to select a sample, because the sample under study is the entire current population (for example, all households eligible for a specific social program at a given point in time). When this is not the case, the researcher should whenever possible attempt to determine the population of interest and then select the sample randomly from this underlying population (see below for some methods for determining the sampling frame, that is, the list of eligible units from which the sample is drawn).4 Up to random error, the sample will be representative of the underlying population and the ATE estimate for this population is unbiased by systematic selection effects.

Once the sampling frame is determined, from a conceptual standpoint, assigning units from the sampling frame to the RCT sample works exactly the same way as assigning units to the treatment versus control group. Simple randomization is possible, but in most cases, permutation sampling is used, that is, a sample of fixed size (or a given share of the sampling frame) will be randomly selected.

In some other cases, random sampling from the population of interest is not practical or not even possible. So-called “convenience samples” may be selected for logistical, cost, or other external reasons, rather than because they are representative of an underlying population of interest. This could be, for example, all households in a given city or area, everyone who signed up for the experiment on MTurk, or a set of schools that were already scheduled for the next roll-out wave of a program. 

It is worth noting that “convenience” samples are by far the most common in the social sciences. For example, most if not all laboratory experiments in economics are conducted with convenience samples (often students at the university). However,  “convenience” can be a bit of a misnomer, as these designs are not always chosen purely for convenience or to reduce costs. While significant cost and information constraints on the part of the researcher or the partner organization can play a role, in field experiments, important ethical or legal considerations also often determine the sample, or at least restrict the population from which the sample can be selected. 

For example, there may be ethical reasons to not withhold a treatment from a comparison group who would otherwise be eligible to receive it (such as the set of schools already scheduled for the next roll-out wave of a program in the example above). In these instances, other randomization designs may be used; examples of research designs that address such constraints by changing the sampling frame so that the sample involves a population that is not (yet) eligible for the treatment include “randomization at the margin” or “phase-in” designs.5 Using one of these designs may be the only way to conduct a randomized experiment and can still be valuable for learning about the treatment effect, but such samples require greater care in drawing conclusions for the whole population of interest. The same holds true for the case where political constraints or the implementing organization’s priorities (e.g. regional focus) constrain the population from which the sample can be drawn. See below for some ways to improve external validity with convenience samples. See also How to Randomize and the discussions in Glennerster & Takavarasha (2013), Duflo et al. (2007), and  Heard et al. (2017).

Heterogeneous treatment effects and stratified sampling

The key reason for randomly sampling units from the full population of interest is that the treatment effect may be unit-specific (heterogeneous treatment effects6) and differ across different groups. Recall from above that the ATE estimate is just the average i in the treatment group. Therefore, we generate an unbiased estimate of the ATE for the underlying population only if the treatment group is randomly selected from this population. The precision of this estimate increases with any measures that reduce the sampling variation. Proportionate stratified sampling (that is, selecting a fixed share from each population stratum to the RCT sample, where the share of the stratum in the sample is proportionate to its share in the population) on observed covariates that are suspected to affect the treatment effect reduces the variance of the sample relative to the underlying population. Essentially, stratification both at the sampling stage and the treatment group assignment stage creates treatment and control groups that are on average more similar to the underlying population (if stratified sampling is proportionate) as well as to each other.

In the context of heterogeneous treatment effects, there is a second reason to employ stratified sampling, namely when the objective is to detect differences in treatment effects between groups. As mentioned above, with stratified randomization, we typically estimate the treatment effect within strata; we may be additionally interested in comparing the ATE in different strata. This requires sufficient sample size within each stratum. Proportionate stratified sampling can ensure that individuals or groups are not inadvertently underrepresented in the sample due to random sampling, but if the stratum is overall small, this may not be sufficient (see also the power calculation section).  

If there are constraints on the total sample size, or if individuals/groups with the characteristics of interest occur with relatively low frequency in the population, the researcher may decide to use disproportionate stratified sampling (i.e., the frequency with which the types of interest appear in the sample is by design not proportionate to their appearance in the underlying population) and focus on the statistical power for estimating stratum-specific treatment effects. The statistical power to distinguish two treatment effects from each other is typically maximized if the number of units in each stratum is the same. 

Improving external validity with convenience samples

There are a few things the researcher can do in order to alleviate constraints that arise when (full) random sampling is not possible, with consequences for external validity by extension:

  • As much as possible, document criteria used to select the population. For example, if cost constraints mean that surveyors can visit no more than three villages within a day’s travel from the capital, describe how these villages were chosen.
  • Measure key characteristics of the sample population, such as wealth and income levels, demographics, and other covariates xj that are likely to influence outcome levels (see above). 
  • Within the constrained sampling population, use random sampling to select experimental units. For example, if the plan is to interview up to 600 mothers in the three villages above, these 600 mothers should be randomly sampled from all mothers in the three villages. 

In combination, these measures may facilitate generalizability later, for example by making it easier to combine convenience samples from different studies in a meta-analysis to construct population ATE estimates.

Cluster randomization and spillovers

So far, we have taken the unit of randomization i as given. The unit of randomization is often the same as the unit of observation. This could be one individual, one household, or even one hospital or school, as long as data is collected at the household level or the hospital/school level. In some cases, however, the unit of randomization contains multiple units of observation. This is called cluster randomization. With cluster randomization, observational units are sorted into groups (clusters), and the randomization occurs at the cluster level rather than at the unit level. 

A number of considerations related to the validity of the experiment may go into choosing a unit of randomization different from the unit of observation (see “implementation” below). Conceptually, the most important reason for cluster randomization is the potential for spillovers. Spillovers mean that the outcomes of untreated units are indirectly affected by the treatment given to others. Spillovers are often positive, but they may also be negative, for example if the beneficiaries of a job matching program fill all the available positions, putting untreated job seekers at a disadvantage. 

Formally, spillovers violate the stable unit treatment value assumption (SUTVA). SUTVA means that one unit’s treatment status has no effect on the outcomes of other units. A research design that does not account appropriately for spillovers invalidates ATE estimates. Consider an example where positive spillovers on food consumption from a cash benefit program arise because treated units use some of the cash they receive to increase the food consumption of others (e.g., by inviting them to meals or giving them gifts). There are a couple of ways in which spillovers could affect the ATE estimates, including:

  1. Unintended spillovers on the untreated control: If individuals in the control group are affected by the presence of the program, they no longer represent a good comparison. For example, if some of the cash given to treated households increases food consumption in the control group, we will underestimate the effect of the cash benefit.
  2. Missing spillover effects on the treated: Suppose the goal of the experiment is to estimate the treatment effect of a program in which all units are treated. This means when the full program is in effect, everyone receives their own cash benefit (and gives some of it away to others) but also receives gifts from others. We would miss this effect in an experiment if the households that, if given a cash benefit, would make gifts to the treatment group are not themselves treated.

If there is a high potential for spillovers at a given unit of randomization, the best solution can be to randomize treatment assignment at a higher level (i.e., use cluster randomization). As an example, spillovers may occur within a town, but not across towns, and so treatment can be clustered at the town level. In the cash benefit example, this could be the case if most households have their social network within their own geographic area. However, all else equal, clustered designs require a larger sample size to achieve the same level of statistical power (see the power calculations section). 

Cluster randomization methods

All methods used for unit randomization also apply to clusters: permutation randomization generates balance on the number of clusters in each treatment arm, and stratification can improve balance on other cluster-level characteristics. Note that with clusters of different sizes, it can be useful to also stratify on cluster size in order to have similarly-sized treatment groups. Cluster randomization does require adjustments in analysis (namely, clustering standard errors to account for within-cluster correlation) but from a randomization perspective is no more complicated than unit randomization.

Randomization designs to estimate spillover effects 

Under some assumptions, the randomization design can also help us measure spillover effects. This is done by randomizing at the cluster level and varying the “intensity” of treatment for the cluster, that is, the share of observational units in the cluster that receive the treatment, and then measuring treatment effects on both treated and untreated units within the cluster. For more information see Baird et al. (2014) or the discussion in 5.1 and 6.3 of Duflo et al. (2007).

Implementation

In this section we will deal with the issues above in reverse order: (1) choosing the unit of randomization, (2) choosing the sampling frame/selecting the sample, and (3) implementing the random assignment of units to treatment groups in practice while dealing with “misfits.” 

Choosing the unit of randomization

Normally, the most natural choice of randomization unit is the observational unit. The unit of observation may not always be the individual. For example, we may not be able to measure individual consumption, but only household consumption; we may only have data on hospital-level occurrence of complications in surgery, but no patient-level data, and so on. 

From a conceptual standpoint, the exception to this rule are spillovers (see above). With spillovers, the unit of randomization should be a cluster large enough to contain all spillovers. All units in the cluster must be treated in order to correctly estimate the full treatment effect (even if, say, not all units in the cluster are interviewed).7

Sometimes it is necessary to choose a unit of randomization other than the observational unit for other reasons besides spillovers.

Unit at which the treatment will be delivered: It is not always possible to assign the treatment at the observational-unit level. For example, if several households share water tanks, a study distributing water purification tablets can only randomize at the water tank level, rather than the household level. Note that the shared water tank creates a form of spillover effect of the treatment group, even if from a conceptual standpoint there may not be any “true” spillovers of the treatment (i.e., health outcomes for other households may not differ when only some households consume purified water).

Observer and experimental effects: Experimental effects may mean that there are “experimental spillovers” between units in the experiment. For example, the John Henry effect means that individuals in the control group react in some form to being in the experiment and in particular might emulate the treatment group. The difference with standard spillovers is that this is caused by individuals being in the experiment rather than being a property of the treatment itself.

The unit of randomization will affect the sample size needed to detect an effect. In general, moving up in level of randomization decreases the study’s effective sample size, meaning that more observational units are required to achieve a given level of power. For more information see the power calculations section and Choosing the Right Sample Size.

The sampling frame

In order to create a random sample, we need a list of eligible units from which the sample will be drawn, the so-called sampling frame. 

Pre-existing lists

The ideal situation is one where the sampling frame is given by a pre-existing list, say, from the government or an NGO. If using one, be sure to understand how the list was created (to ensure that it is in fact representative of the larger target population) and when it was last updated. The definition of the observational or randomization unit is not always unambiguous, say “a business” or “a household.” It is therefore important to clearly define the term, then to follow up with a couple of households (or businesses, etc.) to check the reliability of the method. 

Sources of pre-existing lists include:

  • A list of respondents already created by the partner organization 
  • Administrative data (such as a list of hospital patients, students in local schools, etc.) 
  • Local government institutions such as the registry office, national agencies such as ministries, professional organizations, or agencies and institutions such as hospitals
Creating a sampling frame

Sometimes no list exists. This can be the case for populations such as undocumented immigrants or migrant workers, but also informal businesses or clients of a business or users of a specific service. Options for creating a sampling frame without a pre-existing list include: 

  • Following a standardized procedure, for example, the sampling frame could consist of every patient to visit the emergency room in a given time span, random digit dialing, or shoe leather sampling based on randomized distance from randomly chosen points8.
  • Conducting a community meeting to map all of the households or businesses in the area, etc. Note that this method may miss units and therefore may not be appropriate in all areas, for example in larger towns or when populations are mobile.
  • Door-to-door census listing: A census can be expensive and time-consuming but may be the only option. How long it will take will depend on how spread out the units are in the enumeration area, how much information is collected on each unit, and whether there are any administrative hurdles, e.g., permission from local leaders to work in the area. If doing a door-to-door listing exercise, keep the following points in mind:
    • Using the same survey team for the census and the actual data collection can make it easier for enumerators to find households later, but be sure to budget enough time to complete the listing exercise so that the survey start is not delayed. 
    • Collect sufficient contact information so that respondents can be found later. This includes high-quality phone numbers (e.g., the listing team can call the respondent on the spot to verify the number), names and potentially nicknames of the household head and spouse, etc. 
    • Collect all information to verify membership in the target population (e.g., eligibility for the tested program) as well as any variables needed for stratification. For example, a study that targets adult women will need to collect gender and age information on all household members and may need to collect for example information on household income in order to stratify on this variable.
    • It can be useful to draw a map of the area, with landmarks, and divide the area into zones. This can also help allocate households to surveyors for surveying.
    • Taking GPS readings during the household listing can help surveyors find respondents later but may result in unnecessary work to record readings of households that aren’t included in the final sample.
Multi-frame designs

Research teams may have access to multiple sources of eligible respondents (e.g., lists of customers from different mobile phone companies). If none of the lists are large enough on their own, one option is to pool the lists together to create one sampling frame. Two key advantages of this approach are 1) increasing the sample size (especially for target groups of interest), and 2) lowering the cost of sampling if particular frames are expensive to sample (and can be substituted for sampling frames that are less expensive to access) (Lohr & Rao, 2006). 

If pooling multiple frames, it is important to identify ex-ante whether respondents appear on multiple lists. This is often done by including questions in the survey that identify any of the possible frames to which the respondent belongs––see guidance from the World Bank under the Guidelines on Sampling Design section for more information. Additionally, the correct calculation of sample weights for estimation and analysis can be complex and should be done with care (Wu, 2008). For a discussion of how to calculate weights in a multi-frame design, see Lohr & Rao (2006).9 

Implementing random treatment assignment

Implementing random treatment assignment is easiest when the sample is known, i.e., there is a pre-existing list of experimental units. In this case, researchers typically perform (stratified) permutation randomization based on that list, using a software program such as Stata and a data file containing a unit ID, cluster ID, and stratification variables if applicable. This approach has the considerable benefit of verifiability and replicability, assuming certain steps (described below) are taken. Another option would be to perform a public lottery using an urn or by flipping a coin. Note that stratified randomization is still possible; one would simply assign treatment using an urn or a coin within each stratum. The advantage of this process is that it is very transparent; it may hence be preferable in instances where it is desirable or necessary to actually show participants that their treatment assignment is random. However, the main disadvantage is that it is not replicable. 

In some cases, the sample is not known at the time of random assignment, and a basic lottery for treatment assignment is the only option. For example, units could be assigned on arrival (such as when children enroll in school, patients arrive at the clinic, etc.), and the exact number to be enrolled may be unknown at the time of random assignment. Another example is random digit dialing in phone-based experiments, where it is unknown at the outset how many numbers called will actually be in service. The randomization might be built into the contact, consent, and enumeration protocol on arrival, for example using a coin flip or a draw from an urn, or using SurveyCTO’s built-in randomization engine.10 Note that in both cases stratified randomization is difficult as the final number of units per stratum is unknown at the time of random assignment.

Basic coding procedure for randomization

Regardless of method, the randomization procedure should be verifiable, replicable, and stable, and the randomization outcome saved in a secure file or folder away from other project files.

The basic procedure is always the same:

  1. Create a file that contains only one entry per randomization unit (e.g., one line per household, one line per cluster, etc.). This might mean creating a new file that temporarily drops all but one observational unit per cluster.
  2. Sort this file in a replicable and stable way. (Use stable sort in Stata, i.e., sort varlist, stable)
  3. Set the seed for the random number generator (in Stata, set seed). Make sure that the seed is:
    1. Preserved: Some operations (such as preserve/restore in Stata) erase the seed, and then any random number sequence following that is not determined by the seed anymore and therefore not replicable.
    2. Used only once across parallel operations: Every time the same seed is set, the exact same random number sequence will be produced afterwards. If for example you are assigning daily N sized batches of units to treatment arms and using the same seed for every batch, the way in which these units are assigned will be the same every day. This could introduce unwanted patterns and imbalances.
  4. Randomly assign treatment groups to each randomization unit, then merge the random assignment back with the original file to obtain a list of all observational units with treatment assignments. 
  5. Save the list of observational units with treatment assignment in a secure location and program your routine so this list cannot be automatically overwritten.
  6. For any even slightly more complex randomization procedure, extensively test balance:
    1. In terms of sample size across treatment arms, within and across strata, to verify the correct handling of misfits (see below)
    2. In terms of covariates across treatment arms, to understand power and sample balance and make sure the stratification was done right (see also below)

Misfits

The above description of how to create balance using permutation randomization and stratification glossed over some important implementation details and in particular the problem of misfits. Misfits occur when the number of units in a given stratum is not a multiple of the number of treatments to assign (Bruhn & McKenzie 2011). In the simplest case of two groups--a single treatment and the control--and no strata, a misfit occurs when there is an odd number of units. In the two-group case, this can be resolved easily (by randomly determining the misfit’s status after other units have been assigned treatment/control), but it can become difficult to maintain balance within strata and globally as the numbers of treatments and strata, and therefore of misfits, grow.11

A simple example is with two treatment arms, to be assigned in ⅓ and ⅔ proportion,  and three strata of size 10. Suppose first that units are assigned to achieve the best within-stratum balance, i.e., to preserve treatment allocation ratios within strata. The closest allocation to a 33.3% and 66.7%  assignment within each stratum is 3 and 7 units, respectively. But for the total sample, this means that the assignment is 30% and 70%, so global balance is not as good as it could be.

  N T1 T2
Stratum 1 10 3 7
Stratum 2 10 3 7
Stratum 3 10 3 7
All 30 (100%) 9 (30%) 21 (70%)

Alternatively, suppose misfits are assigned to achieve global balance, i.e., to preserve treatment allocation ratios globally.9 For at least one stratum, this results in poorer within-stratum balance:

  N T1 T2
Stratum 1 10 3 7
Stratum 2 10 4 6
Stratum 3 10 3 7
All 30 (100%) 10 (33%) 20 (67%)

Note also that the solution from above (randomly determining the misfit’s status after other units have been assigned treatment/control) is only partially satisfying. Suppose we randomize the assignment of the misfits in each stratum according to the underlying assignment strategies, i.e., return to the basic lottery for those units. This would mean using permutation sampling to assign 9 units in each stratum, 3 into T1 and 6 into T2, and then drawing the assignment of the 10th unit according to the ⅓ vs. ⅔ probabilities. With many strata, it is likely that the final total allocation is balanced. However, with few strata, we may well end up with the first allocation above; or worse, we may get an allocation in which in all three strata the realized assignment is in fact 4:6 units, and thus we are off the targeted assignment shares both within strata and globally. 

It will almost never be the case that each stratum has a number of units that is exactly a multiple of the number of treatment arms, not least because this is at odds with proportionate stratified sampling, so any randomization procedure needs to deal with this issue. 

Solutions/programming

Manual randomization

As written above, the basic procedure for randomization involves assigning a random order to slots and treatment arms to those slots. In the most simple example, with two treatment arms (one treatment and one control) assigned in equal proportions, the procedure in Stata is as follows:
 

clear all
* Set the version for upward compatibility 
* (e.g., Stata 15 users can use Stata 14) 
version 14.2 
use “randomizationdata.dta”, clear
isid uniqueid // check for duplicates in the list of individuals
sort uniqueid, stable // sort on uniqueid for replicability
set seed 20200520 

/* generate a random (pseudo-random) variable consisting of 
draws from the uniform distribution */
generate random = runiform() 
bysort stratum: egen temp = rank(random), unique
bysort stratum: gen size=_N

* assign treatments to observations that fit evenly into treatment ratios:
gen treatment = temp>size/2+.5 

* randomly assign the misfit:
replace treatment = round(random) if temp==size/2+.5
 

As noted above, the procedure for assigning the misfits becomes increasingly tedious as the number of strata and treatments (and, hence, potential number of misfits) grow, or as treatment allocations change. For example, in the above case there will be at most one misfit per stratum. If the treatment allocation changed--1/3 assigned to treatment and 2/3 to control--there could be two misfits per stratum. The first misfit can be randomly assigned a treatment condition, but the assignment for the second misfit (if there is one) will depend on that of the first to preserve balance. An extreme example of this, with six treatments and 72 strata, is given by Bruhn & McKenzie (2011). When there will be a large number of misfits, an alternative is to use the randtreat command described below.


Sampling (sample command)

In Stata, the sample command randomly selects units, without replacement, from the sampling frame. The default is that units are chosen with an equal probability of selection, but the command can accommodate stratification and different probabilities of selection. Clustered sampling and stratified sampling can be implemented using this command as follows:

  • Clustered sampling: The cluster is the sampling unit. The procedure involves separating the population into clusters/groups (e.g., villages or schools), randomly drawing a sample of clusters, and sampling units in the cluster (either all or a random sample of them). For example, in Stata, using  school_id as the cluster variable:
* treat each cluster as a single observation and drop duplicates 
duplicates drop school_id 

sample x // x denotes % of clusters in the population to select
sort school_id

* merge sampled clusters back with original list
merge 1:m school_id using “original_dataset” 
 
  • Stratified sampling can be proportionate or disproportionate:
    • Proportionate stratification: The share of each stratum in the sample is proportionate, ensuring that the sample is representative of the overall population and especially small groups are not undersampled. For example, if widowed households comprise 5% of the population, then they comprise 5% of the sample. In Stata, this can be done as follows:
by stratum: sample x 
/* strata denotes the (categorical) stratifying variable (e.g., widow) 
and x denotes the percent to be sampled within each stratum. */

* to draw x rather than x% from each stratum, specify the count option
 by stratum: sample x, count
 
  • Disproportionate stratification: The strata sample is not proportionate to the strata population (e.g., widowed HHs comprise only 5% of the population but 20% of the sample). It is useful when you want to ensure sufficient power to detect heterogeneous treatment effects by stratum. In Stata, this can be done as follows:
sample x if stratum==1
sample y if stratum==2

/* x denotes the fraction of the sample that will be comprised 
of respondents from stratum 1 and y the fraction of the sample 
comprised of respondents from stratum 2 */
 

randtreat command

The user-written randtreat command (additional documentation) can perform random assignment of multiple treatments across strata, in equal or unequal ratios, and has several options for dealing with misfits. In particular, the user can decide whether to preserve balance within strata or globally (when the two are at odds), or can specify that misfits’ treatment status be set to missing and dealt with manually.

Balance tests and re-randomization

Balance tests

A balance test checks whether the randomization “worked” beyond just assigning the right number of units in each treatment arm, by formally testing for differences in observable characteristics between the treatment and control groups. 

Balance tests on covariates are often reported in the first table of an RCT paper. To test for differences, regression is generally preferred to t-tests, because it allows for the correction of standard errors (e.g., clustering, making robust to heteroskedasticity, bootstrapping) and the inclusion of fixed effects (e.g., enumerator, strata, etc.). Balance test regressions should use the same specification as your final regression when possible. For example, if you stratified randomization, you will include strata fixed effects in your main regression and should also include them when checking for balance on those covariates that are not used for stratification--this way, you are checking for balance within strata, not across strata.

Suppose for example we conducted stratified cluster randomization. In Stata, testing for balance would look like

reg covariate treatment i.stratum, robust cluster(cluster_id)
 

where the coefficient on treatment indicates whether there is within-strata balance on average (though determining whether there is balance within a given stratum requires either interacting the treatment and stratum variables or restricting the sample to the stratum/strata of interest).

Balance tests may be most useful when there are reasons to doubt that the randomization was done correctly (see McKenzie 2017 or Altman 1985), and there is some discussion of how informative/useful they can be when the research team did the randomization. Furthermore, there can be drawbacks to balance checks if you do trust the randomization--conducting baseline surveys and other interactions with research teams can lead to Hawthorne effects, as shown in Evans (2014) and Friedman & Gokul (2014)12. If baseline data will only be collected for baseline tables, not in order to stratify the sample, an alternative is to collect time-invariant characteristics in the endline (e.g., race, gender, etc.), and check those for balance ex post.

For a comparison of the relative merits of ex-ante stratification, pairwise matching, ex-post re-randomization to achieve balance, etc., see Bruhn & McKenzie (2009); an associated blog post goes into the mechanics of stratification for balance further.

Using the results of a balance test

Questions to consider:

  • How many differences are there (and are there more than you’d expect)? If you are testing for balance at the 5% level, you would expect to see statistically significant differences between the treatment and control groups in roughly 5% of your covariates.
  • What are the magnitudes of the differences? Are the differences economically/practically significant? 
  • In which variables are the imbalances? Consider in particular:
    • Covariates that may be correlated with treatment take-up
    • Covariates that may be correlated with attrition based on previous literature or ex-post observed attrition, which could lead to attrition that is differential by treatment status
    • Covariates that may be correlated with the main outcome variable: Imbalanced covariates are frequently used as controls in analysis, though some researchers include them as is while others recommend de-meaning them first (Imbens & Rubin 2015). See also McKenzie & Bruhn (2009) and Athey & Imbens (2017)
    • The main outcomes: If you find a pre-trial imbalance in a main outcome, you may want to change your analysis to a difference-in-difference approach or consider controlling for the baseline level of the outcome variable in your final regression.

Re-randomization

Due to random chance, balance may not be achieved on key variables with any given random sample draw. This introduces the risk of a random error term in the comparison of treatment and control outcomes (see above when 𝛾≠ 0, i.e., there are imbalances on variables that are likely to be correlated with the main outcome). Many papers13 solve this problem by re-randomizing. One approach is to carry out the randomization procedure many times, select only the draws that were balanced, and then randomly select one of these draws for the study (Glennerster & Takavarasha 2013). However, with re-randomization, not every combination of treatment assignments is equally probable; as conventional significance tests assume that each combination is equally likely, this should be accounted for in analysis. 

Bruhn & McKenzie (2009) show that not adjusting for randomization method, including re-randomization, results in standard errors that are on average overly conservative (though in a nontrivial number of cases they tested this was not the case). They also recommend including the variables used to check balance as linear covariates in the regression, so that estimation of the treatment effect is conditional on the variables used for checking balance. A challenge with this approach is that controlling for the covariates used to check balance may still not perfectly account for how the assignment combination probabilities changed. This is problematic for calculation of exact p-values if using randomization inference, as the calculation requires knowing the probability with which all potential treatment/control assignments that were possible under the research design could have occurred (Athey & Imbens 2017). As a practical matter, Bruhn & McKenzie (2009) recommend making it clear in balance tables which covariates were targeted for balance, as overall balance will be overstated by only looking at covariates on which balance was achieved through re-randomization. 

 An alternative approach is to consider before implementing randomization whether there are covariates on which imbalance would not be acceptable and stratifying on them, so that balance on key covariates is achieved by construction (Athey & Imbens 2017). Moreover, Athey and Imbens make the point that re-randomization in practice turns into a form of stratification; for example, re-randomizing to achieve balance on gender becomes an indirect method of stratifying on gender. As with re-randomization, stratification means that not every combination of treatment assignments is equally probable. Unlike re-randomization, the researcher knows exactly how these probabilities have changed and can thus calculate exact p-values if desired.

Discussions on re-randomization include Bruhn & McKenzie (2009), Athey & Imbens (2017), and (Glennerster & Takavarasha 2013). Theoretical papers include Morgan and Rubin (2012) and Banerjee, Snowberg, and Chassang (2017).

Last updated February 2022.

These resources are a collaborative effort. If you notice a bug or have a suggestion for additional content, please fill out this form

Acknowledgments

We thank Megan Lang for thoughtful suggestions and comments. Liz Cao for copy-edited this resource. All errors are our own.

1.
Potential differences in the error variance between groups--for example when there are heterogeneous treatment effects--suggest the use of Eicker-Huber-White heteroskedasticity-consistent standard errors in the analysis, and this would affect the power calculations as well. 
2.
Again, this assumes homoscedasticity, that is, the error variance is the same in both groups and there are no heterogeneous treatment effects, but with heteroscedasticity, an equal split is in many cases still close to optimal, see e.g., Athey & Imbens (2016).
3.
The full argument against randomization involves priors on the effect sizes; Banerjee et al. (2020) make a formal counter-argument based on satisfying observers with divergent priors.
4.

Random sampling means that each sampling unit has a positive, i.e. non-zero, and known probability of being included in the sample.

5.
See How to Randomize for more information on phase-in and other designs.
6.
An exception are randomization designs to estimate spillover effects where the share of observational units treated varies between clusters.
7.
More on heterogeneous treatment effects is also covered in EGAP’s corresponding methods guide.
8.
See the experimental design description in Blimpo & Dower (2019).
9.
 See the data analysis resource for more information on when and how to use weights in analysis.     
10.
The World Bank additionally has sample SurveyCTO code for taking random draws of beneficiaries
11.
 See Bruhn & McKenzie (2011) for an example of the misfits problem with six treatment arms and 72 strata.
12.
 In practice, this could be done by creating a new stratum of misfits, then randomly assigning treatments within it (Carril 2017)
13.
The Hawthorne effect describes a situation where individuals alter their behavior simply as a reaction to being observed.
    Additional Resources
    Assigning randomization
    1. J-PAL’s lecture on How to Randomize

    2. EGAP’s corresponding methods guide

    Implementation
    1. J-PAL's lecture on Choosing the Right Sample Size

    2. Stata commands:  randtreat (additional documentation) and sample 

    3. SurveyCTO’s built-in randomization engine

    4. World Bank DIME Wiki resource on SurveyCTO random draws of beneficiaries 

    Altman, Douglas G. 1985. “Comparability of Randomised Groups.” The Statistician 34 (1): 125. doi:10.2307/2987510.

    Angrist, Joshua D., and Jörn-Steffen Pischke. 2013. Mastering 'Metrics: The Path from Cause to EffectPrinceton University Press: Princeton, NJ.

    Ashraf, Nava, James Berry, and Jesse M Shapiro. 2010. “Can Higher Prices Stimulate Product Use? Evidence from a Field Experiment in Zambia.” American Economic Review 100 (5): 2383–2413. doi:10.1257/aer.100.5.2383.

    Athey, Susan and Guido Imbens. 2017. “The Econometrics of Randomized Experiments a.” Handbook of Field Experiments Handbook of Economic Field Experiments, 73–140. doi:10.1016/bs.hefe.2016.10.003.

    Baird, Sarah, J. Aislinn Bohren, Craig Mcintosh, and Berk Ozler. 2014. “Designing Experiments to Measure Spillover Effects.” SSRN Electronic Journal. doi:10.2139/ssrn.2505070.

    Banerjee, Abhijit V., Sylvain Chassang, Sergio Montero, and Erik Snowberg. 2020. “A Theory of Experimenters: Robustness, Randomization, and Balance.” American Economic Review 110 (4): 1206–30. doi:10.1257/aer.20171634.

    Beaman, Lori, Dean Karlan, Bram Thuysbaert, and Christopher Udry. 2013. “Profitability of Fertilizer: Experimental Evidence from Female Rice Farmers in Mali.” doi:10.3386/w18778.

    Biau, David Jean, Brigette M Jolles, and Raphaël Porcher. 2020. P Value and the Theory of Hypothesis Testing: An explanation for New Researchers. Clinical Orthopedic Related Research 468(3): 885-892. DOI: 10.1007/s11999-009-1164-4

    Blimpo, Moussa. 2019. “Asymmetry in Civic Information: An Experiment on Tax Incidence among SMEs in Togo.” AEA Randomized Controlled Trials. doi:10.1257/rct.4394-1.0. Last accessed June 10, 2020.

    Bruhn, Miriam, Dean Karlan, and Antoinette Schoar. 2018. The Impact of Consulting Services on Small and Medium Enterprises: Evidence from a Randomized Trial in Mexico. Journal of Political Economy 126(2): 635-687. https://doi.org/10.1086/696154

    Bruhn, Miriam and David McKenzie. 2009. “In Pursuit of Balance: Randomization in Practice in Development Field Experiments.” American Economic Journal: Applied Economics 1 (4): 200–232. doi:10.1257/app.1.4.200.

    Bruhn, Miriam and David McKenzie. “Tools of the trade: Doing Stratified Randomization with Uneven Numbers in Some Strata." World Bank Development Impact Blog, November 6, 2011. Last accessed June 10, 2020. https://blogs.worldbank.org/impactevaluations/tools-of-the-trade-doing-stratified-randomization-with-uneven-numbers-in-some-strata

    Carril, Alvaro. 2017. “Dealing with Misfits in Random Treatment Assignment.” The Stata Journal: Promoting Communications on Statistics and Stata 17 (3): 652–67. doi:10.1177/1536867x1701700307.

    Cartwright, Nancy and Angus Deaton. 2018. "Understanding and Misunderstanding Randomized Controlled Trials." Social Science & Medicine 210: 2-21. https://doi.org/10.1016/j.socscimed.2017.12.005 [ungated version]

    Duflo, Esther, Rachel Glennerster, and Michael Kremer. 2007. “Using Randomization in Development Economics Research: A Toolkit.” Handbook of Development Economics, 3895–3962. doi:10.1016/s1573-4471(07)04061-2.

    Evans, David. “The Hawthorne Effect: What Do We Really Learn from Watching Teachers (and Others)?” World Bank Development Impact (blog), February 17, 2014. https://blogs.worldbank.org/impactevaluations/hawthorne-effect-what-do-we-really-learn-watching-teachers-and-others. Last accessed June 10, 2020.

    Fisher, Ronald. 1935. The Design of Experiments. Oliver and Boyd: Edinburgh, UK  

    Friedman, Jed and Brinda Gokul., “Quantifying the Hawthorne Effect” World Bank Development Impact (blog), October 16, 2014. http://blogs.worldbank.org/impactevaluations/quantifying-hawthorne-effect. Last accessed June 10, 2020.

    Glennerster, Rachel and Kudzai Takavarasha. 2013. Running Randomized Evaluations: A Practical Guide. Princeton University Press: Princeton, NJ.

    Heard, Kenya, Elizabeth O’Toole, Rohit Naimpally, and Lindsey Bressler. 2017. Real-World Challenges to Randomization and Their Solutions. J-PAL North America.

    Imbens, Guido W. and Donald B. Rubin. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge: Cambridge University Press, 2015. doi:10.1017/CBO9781139025751.

    Kasy, Maximilian. 2016. “Why Experimenters Might Not Always Want to Randomize, and What They Could Do Instead.” Political Analysis 24 (3): 324–38. doi:10.1093/pan/mpw012.

    Lohr, Sharon, and J. N. K. Rao. 2006. “Estimation in Multiple-Frame Surveys.” Journal of the American Statistical Association 101 (475): 1019–1030. www.jstor.org/stable/27590779

    McKenzie, David “Should we require balance t-tests of baseline observables in randomized experiments?” World Bank Development Impact (blog), June 26, 2017. https://blogs.worldbank.org/impactevaluations/should-we-require-balance-t-tests-baseline-observables-randomized-experiments. Last accessed June 10, 2020.

    Morgan, Kari Lock and Donald B. Rubin. 2012. “Rerandomization to Improve Covariate Balance in Experiments.” The Annals of Statistics 40 (2): 1263–82. doi:10.1214/12-aos1008.

    Neyman, Jerzy. 1923. “On the Application of Probability Theory to Agricultural Experiments. Essay on Principles. Section 9.” Statistical Science 5 (4): 465–472. Trans. Dorota M. Dabrowska and Terence P. Speed.

    Neyman, Jerzy. "On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection." Journal of the Royal Statistical Society 97, no. 4 (1934): 558-625. Accessed June 15, 2020. doi:10.2307/2342192.

    Rubin, Donald B. 2005. "Causal Inference Using Potential Outcomes: Design, Modeling, Decisions." Journal of the American Statistical Association 100(469): 322-331. DOI 10.1198/016214504000001880

    Wu, Changbao. “Multiple-frame Sampling.” In Encyclopedia of Survey Research Methods,edited by Paul J. Lavrakas, 488-489. California: SAGE Publications, Inc., 2008. http://dx.doi.org/10.4135/9781412963947.