Research Resources

Data visualization

Summary

Data visualization can be helpful at many stages of the research process, from data reporting to analysis and publication. Relative to regression tables, cross-tabs, and summary statistics, data visualizations are often easier to interpret, more informative, and more accessible to a wider range of audiences. This resource discusses key considerations for creating effective data visualizations and provides guidance for making design choices. We present some common graphic options and include further resources to explore this topic.

Data visualization principles

1. Plan for updates: Automate everything

"A rule of research is that you will end up running every step more times than you think. And the costs of repeated manual steps quickly accumulate beyond the costs of investing once in a reusable tool."
(Gentzkow and Shapiro, 2014) 

Data visualization workflow is important for efficiency and portability

  • Automate everything that can be automated.
  • Design your workflow so that when you add new data or tweak a regression you don’t need to manually recreate every figure and table.
  • Minimize copying and pasting figures and tables from one software into another (i.e., copying a graph from Stata output into Microsoft Word makes updating the graph impossible to automate, and often results in a lower quality figure as well). 
  • General workflow: Code > output > formatted table, graph, or map

Automating the workflow that generates and updates tables and figures, and compiles them into a finished document or presentation saves time and improves reproducibility. This World Bank blog post discusses the importance of automation, and proposes workflows for coding tables in Stata. 

2. Present interpretable information 

All of the information in your data visualization should be human-interpretable. Regardless of audience, there are some things required to make tables and figures interpretable, such as picking which parts of raw software outputs are meaningful, and converting coefficients to quantities of interest if needed. However, how you do this can depend on your audience. Consider your audience (journal readers, a presentation audience, external partners) and how they will be encountering and digesting the information you present. You should always include context to help your audience interpret information, whether this is spoken as part of your presentation, or written as titles and captions to your table or graph. 

  • Main principle: Present only human-interpretable information
  • Most tables and figures should be self-contained and self-explanatory. Experienced readers often look at tables and figures first. 
  • Watch out for common hard-to-interpret numbers that appear in tables:
  • Logit coefficients: We don’t think in log-odds; transform these to predicted probabilities and display graphically
  • Goodness of fit measures often don’t mean much even to an expert reader
  • Think about both numbers and chart graphics that might add clutter and be hard to interpret (see section 4).
  • Avoid jargon and technical abbreviations. 
  • Use color and transparency to help make your point: our eyes easily interpret the difference between colors and transparency (see section 6 for more on choosing colors.)
  • Include interpretive text such as figure captions and clear, descriptive titles, to guide your audience.

3. Present data responsibly by providing scale and context

The numerical and visual context that data is presented in can change how it is interpreted, and researchers sharing their own visualizations of data should avoid inadvertently distorting the reader or viewer’s perception of the data. 

The labeling of the axes can change how we see the data presented in a chart. If the axes labels do not start at zero, consider whether this distorts the relative size of differences or trends depicted. However, zero is not always a meaningful value, so judgement is needed here. 

It is also important to present enough context. Summary statistics alone may not capture the differences between datasets that are very different when presented visually. 

Example: How can this graph be improved?

This image shows an example of a messy bar graph of lifetime earnings.
This image shows a clean example of a bar graph of lifetime earnings
Source: Schwabish (2014).

The difference between the bars in the first graph is visually misrepresented because the y-axis does not start at zero. There are also other design considerations that make the graph confusing: the combination of different colors and different patterns in each bar is unnecessary and distracting, and the legend clutters the graphic and could easily be replaced by clear axis labels. The second graph from Schwabish (2014) revises the original colorful bar chart to be clearer and easier to interpret correctly. 

4. Eliminate junk 

 A clear and high-impact data visualization conveys exactly the information needed, without distracting extra clutter that can make interpretation difficult. When evaluating a graphic to eliminate junk, consider four principles: 

  • Maximize the data-to-ink ratio by using as little ink as possible to show your data
  • Remove non data ink, e.g., extra gridlines 
  • Remove redundant data
  • Remove indicators you don’t need
  • For summary statistics and regression coefficients, display only as many decimal places as necessary or relevant given the scale of your outcome variable. For example, income in USD should almost never include any decimal places at all; nor should percentages; years of education seldom requires decimals beyond the tenth decimal place, etc.  

Example: Revising a graphic 
Schwabish (2014) includes several examples on how to revise graphics to create more effective data visualizations, including the example below on how to transform a 3D chart. In the original version on the left, the chart uses a 3D format to show two-dimensional data, the bar colors are not easily distinguishable, and the legend is small and difficult to read. The revised version makes design choices that showcase a clearer and more readable presentation of the data.

This image shows how to shift a bar graph from three dimensions to two dimensions.

5. Represent uncertainty with care

All estimates come with uncertainty, but it can be difficult to depict this in graphs and figures. Jessica Hullman and Matthew Kay write more about what goes into visualizing uncertainty in this first post of their series on the topic. When choosing how to represent uncertainty in a visualization, think about what statistic you will use. For example, confidence intervals can be clearer and easier to interpret than p-values on a graph. 

In the figure below, the confidence intervals clearly depict the uncertainty but do not distract from the point estimates. 

This image shows a clean way to display uncertainty in a graph.
Source: Finkelstein and Notowidigdo, 2019

Considerations when graphing uncertainty: 

  • You DO want to represent uncertainty
  • Represent uncertainty without creating visual clutter
  • Represent uncertainty in a way that will make sense to your audience (consider their statistical background and familiarity with the concepts)

6. Use color thoughtfully 

The software you use for data analysis and coding your data visualizations likely has preset default colors and layouts for graphs and charts. For example, standard Stata graph background colors add junk to the page, as mentioned in section 4. Changing default line colors or backgrounds with your audience and intended publication medium in mind can result in a cleaner, more polished graphic and avoid having to redo the color scheme to meet journal or audience requirements later. 

  • Use sequential or diverging color schemes to show increasing or decreasing values or levels. 
  • For qualitative data visualizations, use a color palette designed for qualitative data.  
  • Viewers may be color blind. Use color blind-friendly palettes, which are sometimes required for accessibility. (For example, U.S. government agencies may require 508 compliance for colors.)
  • Journals may require grayscale. Check that your colors will translate well to grayscale. 

There are many tools for picking color palettes, but colorbrewer2.org is a good starting point. Its palettes are designed for working with maps, but are applicable to any graphic. Select options for color blind-friendly colors, as well as sequential, diverging, or qualitative color palettes, and output exact colors in HEX, RGB, or CMYK format to be used directly in code. 

See the Resources section below for additional color tools. 

7. Choose the type of visualization based on the information you want to convey 

Choose first between a table or a figure, and then, if using a chart, choose a type of chart that works well for the number of variables, as well as their type (continuous or discrete). 

  • Use a table when you need to show exact numerical values, you want to allow for multiple localized comparisons, or you have relatively few numbers to show. Tables are often a clearer and more information-dense choice. 
  • Use a graph when you want to reveal a pattern or relationship among key variables. Graphs can be more memorable, making them good for when you need to get the audience’s attention, or for highlighting important takeaways. 
  • You may also consider pairing a graph and a table.
  • Think about your final output and what it will look like. If you are creating a graphic at baseline that you plan to reproduce at midline and/or endline, keep in mind how you will incorporate the data from future survey rounds.
  • Avoid pie charts: it is difficult for the audience to visually distinguish the difference in size of each wedge. A good alternative for showing shares of a whole is a stacked bar chart.

8. Consider your audience

Consider who will view your graphic and their level of familiarity with the data and context to inform the design of your visualization. Will your graphics serve as internal reference for your team as part of an ongoing project, or will they be part of a report, presentation, or publication to be disseminated more broadly?

  • What is the level of technical knowledge of your audience?
  • What are the language skills of your audience? How will this inform the text you will include? 
  • Avoid jargon, especially if it may not be familiar to the audience. Always define any technical abbreviations if they are necessary and appropriate for your audience. 
  • Consider how your audience will consume the information and how this impacts your design choices (such as use of color)
  • Will they view images on a computer, as printed materials, or via another medium?

Coding data visualizations in Stata and R

In the table below, we list some graphics you may create or come across in your work, along with Stata and R commands to generate them. While most of the R commands below focus on the ggplot2 package, a number of code options exist. For detailed sample code and output examples for these graphics and a wide range of others, refer to the Stata and R resources at the end of this page.

Tip: If you see a figure or graphic you like in a published paper and want to learn how it is coded or replicate it for your own purposes, look for the paper’s replication files to find the code.

Table  1 Cheat sheet for common graphics
Type of graphic Stata command ggplot command (R)
Density plot histogram geom_histogram() 
kdensity geom_density()
Box plot graph hbox; graph box geom_boxplot()
Scatter plot with fitted line twoway scatter twoway scatter a b || lfit a b geom_point() geom_point() + geom_abline
Regression coefficient plot coefplot coefplot
Line graph twoway line geom_line()
Area chart twoway area geom_area()
Bar graph graph bar; graph hbar; twoway bar geom_bar()
Stacked bar graph graph bar a b, stack geom_bar(position = "stack")
Clustered bar graph graph bar a b, over(survey_round) geom_bar(position = "dodge", stat = identity)
Bar graph with standard errors twoway (bar) (rcap) iemargins geom_bar() + geom_errorbar()
Maps maptile  geom_map
geom_point 
spmap tmap

Other graphics: 
Explore other graphic types and be creative as you think about how to best represent your data. For example, rather than a bar chart, you might consider using a proportional Venn diagram to show sample sizes across survey rounds and highlight continuity and attrition (Stata: pvenn2; R: VennDiagram package).

Pulling it all together

This page contains many resources helpful for understanding and creating clear and effective data visualizations, and there are countless more resources available through searching online. 

To summarize the principles outlined in the first section and how they can be applied to a typical J-PAL Research Associate’s work, see an example of a graph created with the binscatter command in Stata, by J-PAL Research Associate Robbie Dulin. Note that the default Stata background colors and graph layout specifications have been customized. 

This image shows a clean example of graph customization and bin graphing
In BPNT districts, subsidy is concentrated among fewer households
(Source: Robbie Dulin)

Stata code used to create the above visualization:

grstyle init
grstyle set imesh, horizontal compact
grstyle set color "228 92 36%75" ///
"44 172 156%75" "242 196 20%75"
grstyle set legend 2, inside

preserve
gen bins = .
forval i = 10(10)100 {
  replace bins = `i' - 5 if `i' - 10 < ///
  percentile_udb_miss_to_100 & ///
  percentile_udb_miss_to_100 <= `i'
}
collapse (mean) totsub (rawsum) ///
FWT [aw = FWT], by(bins treated)

expand 2
replace bins = . if _n>(_N/2)
recode treated (0=1) (1=0) if bins ==.
replace bins = 100 if bins == 95

format totsub %9.0fc
twoway (scatter totsub bins if ///
     treated == 0 [fw = FWT], ///
     mcolor("228 92 36%50")) ///
     (scatter totsub bins if treated == 1 ///
     [fw = FWT], mcolor("44 172 156%50")), ///
     legend(order(1 "Rastra" 2 "BPNT") size(medium)) ///
     ylabel(#6) ///
     xtitle("PMT Score Bin (Bin Size = 10)" ///
    "Markers Scaled by Number of Households in Bin") ///
     xsc(titlegap(2) range(105)) ///
     ytitle("Average Subsidy Value in Bin (rp)")
restore
 

Last updated July 2020.

These resources are a collaborative effort. If you notice a bug or have a suggestion for additional content, please fill out this form.

Acknowledgments

We thank Mike Gibson and Sarah Kopper for helpful contributions.

Additional Resources
Data visualization resources
  1. From Data to Viz: walks through types of graphs, provides code and caveats

  2. Fundamentals of Data Visualization, Claus O. Wilke

  3. Data stories: a podcast about data visualization with Enrico Bertini and Moritz Stefaner

  4. Hacks: create simple LaTeX and Markdown tables interactively

  5. Datawrapper

  6. Data Visualization Society

  7. Google’s Data Studio for dashboards and reports

  8. Resources by Edward Tufte: a statistician, and political scientist who specializes in data communication: Tufte's Rules and More books and work by Tufte 

  9. J-PAL's data visualization RST lecture (J-PAL internal resource)

Coding resources
  1. Stata cheat sheets

  2. Stata globals for J-PAL colors (J-PAL internal resource)

  3. Data Workflow: complex stata graphs with replication code 

  4. Stata user written graph command for binned scatters

  5. Stata user written command to make customizing graphs easier

  6. Automated table workflow in Stata, from the World Bank Blogs by Liuza Andrade, Benjamin Daniels, and Florence Kondylis

  7. R cheat sheets

  8. Reference guide for the ggplot2 package in R 

  9. Quick R by Datacamp: Graphs

  10. “How do I?” by Sharon Machlis: practical guide to coding in R

  11. Data visualization for social science: a practical introduction with R and ggplot2

  12. R graphics cookbook

  13. R Markdown: The Definitive Guide, by Yihui Xie, J. J. Allaire, and Garrett Grolemund 

  14. The R Markdown Cookbook, by Yihui Xie and Christophe Dervieux 

  15. Tables in R using the gt package

Revising and polishing tables and graphics
  1. Kastellec and Leoni’s paper on using graphics instead of tables to convey results visually

  2. Appendix B, “Statistical graphics for research and presentation” in: Gelman, Andrew, and Jennifer Hill. Data analysis using regression and multilevel/hierarchical models. Cambridge university press, 2006

Finkelstein, Amy, and Matthew J. Notowidigdo. 2019. "Take-up and targeting: Experimental evidence from SNAP." The Quarterly Journal of Economics 134(3): 1505-1556. https://doi.org/10.1093/qje/qjz013

Gentzkow, Matthew, and Jesse M. Shapiro. 2014 "Code and data for the social sciences: A practitioner’s guide." Chicago, IL: University of Chicago (2014).

Hullman, Jessica, and Kay, Matthew. “Uncertainty + visualization, explained.” The Midwest Uncertainty Collective. Last accessed June 9, 2020. 

Jones, Damon, David Molitor, and Julian Reif. 2019. "What do workplace wellness programs do? Evidence from the Illinois workplace wellness study." The Quarterly Journal of Economics 134(4): 1747-1791. https://doi.org/10.1093/qje/qjz023

Ottaviano, Gianmarco IP, and Giovanni Peri. 2008. "Immigration and national wages: Clarifying the theory and the empirics." No. w14188. National Bureau of Economic Research.

Schwabish, Jonathan A. 2014. "An economist's guide to visualizing data." Journal of Economic Perspectives 28(1):209-34. DOI: 10.1257/jep.28.1.209

Stinebrickner, Ralph, and Todd Stinebrickner. 2014. "Academic performance and college dropout: Using longitudinal expectations data to estimate a learning model." Journal of Labor Economics 32(3): 601-644. DOI: 10.1086/675308

In this resource