Research Resources

Coding resources for randomized evaluations

Authors
Contributors
Chloe Lesieur
Summary

This page compiles links to resources on software, user-written commands for randomized evaluations, coding in teams, and writing reproducible code. User-written commands listed below include common checks for randomized evaluations and faster versions of frequently used commands in Stata and R.

Helpful user-written commands for RCTs

User-written programs and code can support checks of whether key steps in a randomized evaluation are running as planned. J-PAL and IPA (Innovations for Poverty Action) have written several commands in Stata and R that run helpful checks and comparisons.

  • Balance checks report whether variables are balanced across treatment and control groups; see more in Randomization.

    • orth_out – Stata command for exporting summary statistics or orthogonality (balance) tables. IPA wrote and maintains this command and provides a tutorial for this and related commands.

  • Back checks allow researchers to compare a mini-survey to a larger original survey in order to assess the consistency of survey answers and enumerators’ adherence to survey protocols.

    • bcstats – Stata program for analyzing back check data by comparing it to original survey data.

    • bcstatsR – R version of the Stata command bcstats.

  • High-frequency checks of incoming research data can monitor a number of additional types of indicators and potential red flags.

    • ipacheck – Stata package for running multiple high-frequency checks on research data.

  • Protecting personally identifiable information (PII) is an essential part of data collection and analysis.1 The commands below may be helpful for scanning for obvious personally identifiable information; however, a scan that reports no PII is no guarantee that some variables or combinations of variables do not convey personally identifiable information. See more in Data de-identification.

    • PII_detection – application and Python script to identify, remove, and/or recode PII from field experiment data sets.

  • Comparing two datasets can help check whether data is being recorded and stored correctly—for example, by checking for discrepancies between the way two people entered the same paper-based data into electronic form (a best practice for paper surveys) or checking whether data have changed if there are two versions of what should be the same dataset.

    • cfout - Stata user-written command to compare two datasets.

  • Commands and tips for larger datasets, including faster versions of common commands, may be useful for researchers working with large datasets in Stata.

    • gtools - Stata package of user-written commands for faster versions of collapse, reshape, xtile, tabstat, isid, egen, pctile, winsor, contract, levelsof, duplicates, and unique/distinct.

Efficiency suggestions for large datasets - Stata tips for extracting subsets of data and reducing run time for common operations, compiled on the NBER website.

Coding with teams working in the social sciences

Because randomized evaluations may involve lengthy coding projects and multiple research staff, it is essential to have clear internal guidelines for how to code. The following guidelines and tools may assist readers with coding questions specific to the social sciences.

Guidance for reproducible coding

Reproducibility is an important consideration in coding for randomized evaluations. The following resources offer guidelines and tools for reproducibility.

Writing randomization code

Conducting randomization correctly is essential in a randomized evaluation. The following resources provide guidance for writing randomization code and examples. 

  • J-PAL's Randomization resource discusses how to implement random sampling and random assignment, with sample code.

  • Writing randomization code in Stata: A guide uses data and an annotated Stata do-file to illustrate, step by step, how to conduct simple randomization using Stata.

  • This user-written guide to writing randomization code in R by Jorge Cimentada is a translation of the guide from Stata into R.

  • The randtreat command created by Alvaro Carril in Stata performs random treatment assignment with different numbers of treatments and uneven treatment fractions. It also provides methods to deal with “misfits” that arise in treatment assignment when observations cannot be neatly distributed.

Last updated March 2021.

These resources are a collaborative effort. If you notice a bug or have a suggestion for additional content, please fill out this form.

Acknowledgments

Thanks to Sam Ayers, Rose Burnam, Jack Cavanagh, Aileen Devlin, Laura Feeney, Louise Geraghty, Mike Gibson, Sabhya Gupta, Sarah Kopper, Chloe Lesieur, and Evan Williams for their suggestions and advice. This work was made possible in part by support from Arnold Ventures. Any errors are our own.

1.
Data security procedures for researchers provides more information on protecting PII.
Additional Resources
Self-guided learning in Stata
  1. IPA's Stata Trainings | IPA offers four levels of self-guided modules to learn Stata.

  2. Online Stata Tutorial at DSS | A series of exercises and guides to working in Stata, created by Professor Oscar Torres-Reyna at Princeton. 

  3. Short introduction to Stata | Professor Germán Rodriguez

    A tutorial for new users with an emphasis on data management and graphics.

  4. Stata cheat sheets | Stata.com

    Data scientists Tim Essam and Laura Hughes have created "cheat sheets" on using Stata for data science tasks and analysis. These may be of interest to both novice and advanced Stata users.

  5. Stata resources from UCLA | The UCLA Institute for Digital Research and Education

    These resources are organized by topic. The search functionality allows for looking up resources on specific commands.

  6. Statalist.org | The official forum for Stata help and questions answered. Searching a question can often lead to solutions that others have used successfully for similar issues.
     

Self-guided learning in R
  1. Base R Cheat Sheet | RStudio

    Cheat sheets by RStudio that provide an overview of basic R commands.

  2. Data Analysis for Social Scientists | MIT Economics MicroMasters program

    This course, part of J-PAL and MIT's MicroMasters Program in Data, Economics, and Development Policy, uses R to discuss methods for harnessing data to answer questions of economic and policy interest.

  3. fastR |  Github user Matloff                                         

    A good introduction to many different concepts including data types, graphics, regressions and text processing

  4. GIS in R | Nick Eubank 
    An introduction to working with geographic data in R, with lessons on data types, combining geographic data, plotting, and more.
     

  5. Randomization Inference (RI) | The Comprehensive R Archive Network (direct download)

    An R package for performing randomization-based inference for experiments.

  6. R-bloggers

    A website compiling the blogs of data analysts who share the work they do with R, including examples of data analysis and visualization.

  7. R for data science | Hadley Wickham and Garrett Grolemund                    A comprehensive guide to cleaning, analysing and managing data in R for beginners and advanced learners. 

  8. R for Statistics 571 | Bret Larget

    Bret Larget teaches the basics of R and some statistical applications of R, including statistical tests and data visualization.

  9. R resources from UCLA | UCLA Institute for Digital Research and Education

    The UCLA Institute for Digital Research and Education provides a library of somewhat advanced resources and tutorials for R.

  10. Short Introduction to R | Germán Rodriguez

    A short introduction to R for new users, with an emphasis on fitting linear and generalized linear models.

  11. The Tidy Tuesday repository | Thomas Mock
    A repository that updates weekly with new datasets to practice data cleaning and visualization with.

In this resource