Coding resources for randomized evaluations
Summary
This page compiles links to resources on software, user-written commands for randomized evaluations, coding in teams, and writing reproducible code. User-written commands listed below include common checks for randomized evaluations and faster versions of frequently used commands in Stata and R.
Helpful user-written commands for RCTs
User-written programs and code can support checks of whether key steps in a randomized evaluation are running as planned. J-PAL and IPA (Innovations for Poverty Action) have written several commands in Stata and R that run helpful checks and comparisons.
-
Balance checks report whether variables are balanced across treatment and control groups; see more in Randomization.
-
Back checks allow researchers to compare a mini-survey to a larger original survey in order to assess the consistency of survey answers and enumerators’ adherence to survey protocols.
-
High-frequency checks of incoming research data can monitor a number of additional types of indicators and potential red flags.
-
ipacheck – Stata package for running multiple high-frequency checks on research data.
-
-
Protecting personally identifiable information (PII) is an essential part of data collection and analysis.1 The commands below may be helpful for scanning for obvious personally identifiable information; however, a scan that reports no PII is no guarantee that some variables or combinations of variables do not convey personally identifiable information. See more in Data de-identification.
-
PII_detection – application and Python script to identify, remove, and/or recode PII from field experiment data sets.
-
stata_PII_scan – Stata program to identify variables in .dta files that are potentially identifying.
-
-
Comparing two datasets can help check whether data is being recorded and stored correctly—for example, by checking for discrepancies between the way two people entered the same paper-based data into electronic form (a best practice for paper surveys) or checking whether data have changed if there are two versions of what should be the same dataset.
-
cfout - Stata user-written command to compare two datasets.
-
-
Commands and tips for larger datasets, including faster versions of common commands, may be useful for researchers working with large datasets in Stata.
-
gtools - Stata package of user-written commands for faster versions of collapse, reshape, xtile, tabstat, isid, egen, pctile, winsor, contract, levelsof, duplicates, and unique/distinct.
-
Efficiency suggestions for large datasets - Stata tips for extracting subsets of data and reducing run time for common operations, compiled on the NBER website. See also this guide on using Stata frames. Stata frames allow you to simultaneously store multiple datasets in memory as frames; you can easily switch or link data between these frames.
Coding with teams working in the social sciences
Because randomized evaluations may involve lengthy coding projects and multiple research staff, it is essential to have clear internal guidelines for how to code. The following guidelines and tools may assist readers with coding questions specific to the social sciences.
-
J-PAL's Data cleaning and management resource and IPA’s data cleaning guide.
-
IPA's resource, “Reproducible research: Best practices for data and code management,” which offers specific guidelines for coding in Stata.
-
Matthew Gentzkow and Jesse Shapiro's guide, “Code and data for the social Sciences: A practitioner’s guide,” outlines their best practices for data management and coding with teams, addressing topics like automation, organization, and code style.
-
Software Carpentries’ lesson on version control with Git and MIT Software Carpentries lesson on GitHub Desktop. See also DIME’s Getting Started with Github and Udacity’s course on Version Control with Git. Github is a very useful tool for managing different versions and reviewing changes when working in a team. Github actions can also be used to automate repetitive coding tasks.
-
Gentzkow and Shapiro’s GSLab GitHub shares a set of user-written commands that the lab has created specifically for research staff in economics such as tablefill, which can flexibly automate table output.
-
Ray Kluender and Ben Marx's slides, on general principles and hands-on tips for working with data in Stata.
-
Coding resources from J-PAL (GitHub) and IPA (GitHub) share commands written for social science research teams.
Guidance for reproducible coding
Reproducibility is an important consideration in coding for randomized evaluations. The following resources offer guidelines and tools for reproducibility.
-
J-PAL's Data cleaning and management resource provides coding best practices which address file organization, code organization, commenting, and version control.
-
J-PAL's Randomization resource includes detailed advice for making randomization code reproducible.
-
IPA’s “Reproducible research: Best practices for data and code management” shares motivation and guidelines for reproducible coding.
-
The BITSS Manual of best practices in transparent social science research includes sections on Power Analysis, Practical Considerations, Pre-Analysis Plans, Workflow, Stata-specific Suggestions, Data Publication and Registration, and more.
-
Rmarkdown is a document format designed specifically for reproducibility and readability.
-
AEA’s ReadMe file template for social science replication packages.
-
DIME’s ietoolkit provides reproducible Stata commands for across the impact evaluation lifecycle, including making balance tables and creating readme files.
Writing randomization code
Conducting randomization correctly is essential in a randomized evaluation. The following resources provide guidance for writing randomization code and examples.
-
J-PAL's Randomization resource discusses how to implement random sampling and random assignment, with sample code.
-
Writing randomization code in Stata: A guide uses data and an annotated Stata do-file to illustrate, step by step, how to conduct simple randomization using Stata.
-
This user-written guide to writing randomization code in R by Jorge Cimentada is a translation of the guide from Stata into R.
-
The randtreat command created by Alvaro Carril in Stata performs random treatment assignment with different numbers of treatments and uneven treatment fractions. It also provides methods to deal with “misfits” that arise in treatment assignment when observations cannot be neatly distributed.
A note on learning and code reuse
While the resources, commands, and guides listed above and below cover a wide array of material, no collection of resources can be fully comprehensive. If you are working on code for a randomized evaluation and cannot find a written resource that covers your topic, try searching existing repositories of replication data and code from similar projects. Replication code is often cutting-edge and can provide an idea of how to tackle problems that haven’t filtered into written resources yet. Some repositories that contain replication code from randomized evaluations include:
-
Harvard Dataverse, including:
-
Dataverses of research organizations, like J-PAL, IPA, and IFPRI
-
Dataverses of social science journals, like the QJE and the American Political Science Review
-
-
OpenICPSR, including:
-
Yale’s ISPS Data Archive
Last updated May 2023.
These resources are a collaborative effort. If you notice a bug or have a suggestion for additional content, please fill out this form.
Thanks to Sam Ayers, Rose Burnam, Aileen Devlin, Laura Feeney, Louise Geraghty, Mike Gibson, Sarah Kopper, Chloe Lesieur, and Evan Williams for their suggestions and advice. This work was made possible in part by support from Arnold Ventures. Any errors are our own.
Additional Resources
-
IPA's Stata Trainings | IPA offers four levels of self-guided modules to learn Stata.
-
Online Stata Tutorial at DSS | A series of exercises and guides to working in Stata, created by Professor Oscar Torres-Reyna at Princeton.
-
Short introduction to Stata | Professor Germán Rodriguez
A tutorial for new users with an emphasis on data management and graphics.
-
Stata cheat sheets | Stata.com
Data scientists Tim Essam and Laura Hughes have created "cheat sheets" on using Stata for data science tasks and analysis. These may be of interest to both novice and advanced Stata users.
-
Stata resources from UCLA | The UCLA Institute for Digital Research and Education
These resources are organized by topic. The search functionality allows for looking up resources on specific commands.
-
Statalist.org | The official forum for Stata help and questions answered. Searching a question can often lead to solutions that others have used successfully for similar issues.
-
Base R Cheat Sheet | RStudio
Cheat sheets by RStudio that provide an overview of basic R commands.
-
Data Analysis for Social Scientists | MIT Economics MicroMasters program
This course, part of J-PAL and MIT's MicroMasters Program in Data, Economics, and Development Policy, uses R to discuss methods for harnessing data to answer questions of economic and policy interest.
-
fastR | Github user Matloff
A good introduction to many different concepts including data types, graphics, regressions and text processing
-
GIS in R | Nick Eubank
An introduction to working with geographic data in R, with lessons on data types, combining geographic data, plotting, and more.
-
Methods Guides | EGAP
EGAP’s methods guides provide R code for implementing the methods introduced, including advanced randomization techniques, sampling, and meta-analysis.
-
Randomization Inference with ri2 | Alexander Coppock
An R package for performing randomization-based inference for experiments.
-
A website compiling the blogs of data analysts who share the work they do with R, including examples of data analysis and visualization.
-
R for data science | Hadley Wickham and Garrett Grolemund
A comprehensive guide to cleaning, analysing and managing data in R for beginners and advanced learners.
-
R for Statistics 571 | Bret Larget
Bret Larget teaches the basics of R and some statistical applications of R, including statistical tests and data visualization.
-
R resources from UCLA | UCLA Institute for Digital Research and Education
The UCLA Institute for Digital Research and Education provides a library of somewhat advanced resources and tutorials for R.
-
Short Introduction to R | Germán Rodriguez
A short introduction to R for new users, with an emphasis on fitting linear and generalized linear models.
-
The Tidy Tuesday repository | Thomas Mock
A repository that updates weekly with new datasets to practice data cleaning and visualization with.