Research Resources

Data visualization

Summary

Data visualization can be helpful at many stages of the research process, from data reporting to analysis and publication. Relative to regression tables, cross-tabs, and summary statistics, data visualizations are often easier to interpret, more informative, and more accessible to a wider range of audiences. This resource discusses key considerations for creating effective data visualizations and provides guidance for making design choices. We present some common graphic options and include further resources to explore this topic.

Data visualization principles

1. Consider your audience

Consider who will view your graphic and their level of familiarity with the data and context to inform the design of your visualization. Will your graphics serve as internal reference for your team as part of an ongoing project, or will they be part of a report, presentation, or publication to be disseminated more broadly?

  • What is the level of technical knowledge of your audience?
  • What are the language skills of your audience? How will this inform the text you will include? 
  • Avoid jargon, especially if it may not be familiar to the audience. Always define any technical abbreviations if they are necessary and appropriate for your audience. 
  • Consider how your audience will consume the information (e.g., printed materials, computer, projector) and how this impacts your design choices (e.g., color, which can look different on computers versus when printed or projected).
  • The visualization that convinces you or the research team of some fact might not be the visualization that convinces your audience––be prepared to modify your visualization.

2. Choose the type of visualization based on the information you want to convey 

Choose first between a table or a figure, and then, if using a chart, choose a type of chart that works well for the number of variables, as well as their type (continuous or discrete). For more guidance on how to choose the type of graph, see the United Nations' guide to Making Data Meaningful or data-to-viz.

  • Use a table when you need to show exact numerical values, you want to allow for multiple localized comparisons, or you have relatively few numbers to show. Tables are often a clearer and more information-dense choice. 
  • Use a graph when you want to reveal a pattern or relationship among key variables, see the shape or distribution of data, or perceive trends in variables. Graphs can be more memorable, making them good for when you need to get the audience’s attention, or for highlighting important takeaways. 
  • You may also consider pairing a graph and a table. For an example visualizing average monthly temperature, see here.
  • Think about your final output and what it will look like. If you are creating a graphic at baseline that you plan to reproduce at midline and/or endline, keep in mind how you will incorporate the data from future survey rounds.
  • Avoid pie charts: it is difficult for the audience to visually distinguish the difference in size of each wedge. A good alternative for showing shares of a whole is a stacked bar chart.

3. Plan for updates: Automate everything

"A rule of research is that you will end up running every step more times than you think. And the costs of repeated manual steps quickly accumulate beyond the costs of investing once in a reusable tool."
(Gentzkow and Shapiro, 2014) 

Data visualization workflow is important for efficiency and portability

  • Automate everything that can be automated.
  • Design your workflow so that when you add new data or tweak a regression you don’t need to manually recreate every figure and table.
  • Edits to graphs in Stata's Graph Editor can be replicable and automated. See this World Bank blog post for more information.
  • Minimize copying and pasting figures and tables from one software into another (e.g., copying a graph from Stata output into Microsoft Word makes updating the graph impossible to automate, and often results in a lower quality figure as well). 
  • General workflow: Code > output > formatted table, graph, or map

Automating the workflow that generates and updates tables and figures, and compiles them into a finished document or presentation saves time and improves reproducibility. This World Bank blog post discusses the importance of automation, and proposes workflows for coding tables in Stata. 

4. Present interpretable information 

All of the information in your data visualization should be human-interpretable. Regardless of audience, there are some things required to make tables and figures interpretable, such as picking which parts of raw software outputs are meaningful, and converting coefficients to quantities of interest if needed. How you do this can depend on your audience, however. Consider your audience (journal readers, a presentation audience, external partners) and how they will encounter and digest the information you present. You should always include context to help your audience interpret information, whether this is spoken as part of your presentation, or written as titles and captions to your table or graph. 

  • Main principle: Present only human-interpretable information
  • Most tables and figures should be self-contained and self-explanatory. Experienced readers often look at tables and figures first. 
  • Watch out for common hard-to-interpret numbers that appear in tables:
  • Logit coefficients: We don’t think in log-odds; transform these to predicted probabilities and display graphically
  • Goodness of fit measures often don’t mean much even to an expert reader
  • Think about both numbers and chart graphics that might add clutter and be hard to interpret (see principle 6).
  • Avoid jargon and technical abbreviations. 
  • Use color and transparency to help make your point: our eyes easily interpret the difference between colors and transparency (see principle 8 for more on choosing colors.)
  • Include interpretive text such as figure captions and clear, descriptive titles, to guide your audience.

Example: How can this table be improved?

Image shows a poorly constructed table.
Image shows a well constructed table.
Source: United Nations' Guide to Making Data Meaningful (2009)

The information in the first table is not easily interpretable. The percent sign after each value unnecessarily clutters the table, and the N/A values are not explained. Similarly, the source of the data is not apparent. In the second table, the title and footer are used to indicate the source and format of the values, reducing the text in the actual table, and explain the N/A values. Other improvements include reducing the number of lines to just separate different components of the table (header, data, footnote, and source). All values are right-justified and have the same number of decimal places.

5. Present data responsibly by providing scale and context

The numerical and visual context that data is presented in can change how it is interpreted, and researchers sharing their own visualizations of data should avoid inadvertently distorting the reader or viewer’s perception of the data. 

The labeling of the axes can change how we see the data presented in a chart. If the axes labels do not start at zero, consider whether this distorts the relative size of differences or trends depicted. However, zero is not always a meaningful value, so judgment is needed here; see the Economist’s Why you sometimes need to break the rules in data viz for a discussion of when starting axis at values other than zero is useful.

It is also important to present enough context. Summary statistics alone may not capture the differences between datasets that are very different when presented visually. 

Example: How can this graph be improved?

This image shows an example of a messy bar graph of lifetime earnings.
This image shows a clean example of a bar graph of lifetime earnings
Source: Schwabish (2014).

The difference between the bars in the first graph is visually misrepresented because the y-axis does not start at zero. There are also other design considerations that make the graph confusing: the combination of different colors and different patterns in each bar is unnecessary and distracting, and the legend clutters the graphic and could easily be replaced by clear axis labels. The second graph from Schwabish (2014) revises the original colorful bar chart to be clearer and easier to interpret correctly. Alternatively, if the goal of the graphic is to show the increase in expected earnings in each category compared to “Finish no school” you could reformat the graph to show the difference in expected earnings for each category, relative to “Finish no school”. This would ensure the y-axis starts at zero, and would make the differences in expected earnings more salient.

6. Eliminate junk 

 A clear and high-impact data visualization conveys exactly the information needed, without distracting extra clutter that can make interpretation difficult. When evaluating a graphic to eliminate junk, consider four principles: 

  • Maximize the data-to-ink ratio by using as little ink as possible to show your data
  • Remove non data ink, e.g., extra gridlines 
  • Remove redundant data
  • Remove indicators you don’t need
  • For summary statistics and regression coefficients, display only as many decimal places as necessary or relevant given the scale of your outcome variable. For example, income in USD should almost never include any decimal places at all; nor should percentages; years of education seldom requires decimals beyond the tenth decimal place, etc.  

Example: Revising a graphic 
Schwabish (2014) includes several examples on how to revise graphics to create more effective data visualizations, including the example below on how to transform a 3D chart. In the original version on the left, the chart uses a 3D format to show two-dimensional data, the bar colors are not easily distinguishable, and the legend is small and difficult to read. The revised version makes design choices that showcase a clearer and more readable presentation of the data.

This image shows how to shift a bar graph from three dimensions to two dimensions.

7. Represent uncertainty with care

All estimates come with uncertainty, but it can be difficult to depict this in graphs and figures. Jessica Hullman and Matthew Kay write more about what goes into visualizing uncertainty in this first post of their series on the topic. When choosing how to represent uncertainty in a visualization, think about what statistic you will use. For example, confidence intervals can be clearer and easier to interpret than p-values on a graph. 

In the figure below, the confidence intervals clearly depict the uncertainty but do not distract from the point estimates. 

This image shows a clean way to display uncertainty in a graph.
Source: Finkelstein and Notowidigdo, 2019

Considerations when graphing uncertainty: 

  • You DO want to represent uncertainty
  • Represent uncertainty without creating visual clutter
  • Represent uncertainty in a way that will make sense to your audience (consider their statistical background and familiarity with the concepts)

8. Use color thoughtfully 

The software you use for data analysis and coding your data visualizations likely has preset default colors and layouts for graphs and charts. For example, standard Stata graph background colors add junk to the page, as mentioned in principle 4. Changing default line colors or backgrounds with your audience and intended publication medium in mind can result in a cleaner, more polished graphic and avoid having to redo the color scheme to meet journal or audience requirements later. 

  • Use sequential or diverging color schemes to show increasing or decreasing values or levels. 
  • For qualitative data visualizations, use a color palette designed for qualitative data.  
  • Viewers may be color blind. Use color blind-friendly palettes, which are sometimes required for accessibility. (For example, U.S. government agencies may require 508 compliance for colors.) Consider using a color blindness simulator to determine if your visualization is difficult to see.
  • Journals may require grayscale. Check that your colors will translate well to grayscale.
  • Limit the number of colors used. If your visualization has more than seven colors, consider ways to recategorize the data to reduce the number (e.g., recategorize several groups as an “other” category). 

There are many tools for picking color palettes, but colorbrewer2.org is a good starting point. Its palettes are designed for working with maps, but are applicable to any graphic. Select options for color blind-friendly colors, as well as sequential, diverging, or qualitative color palettes, and output exact colors in HEX, RGB, or CMYK format to be used directly in code. 

See the Resources section below for additional color tools. 

Coding data visualizations in Stata and R

In the table below, we list some graphics you may create or come across in your work, along with Stata and R commands to generate them. While most of the R commands below focus on the ggplot2 package, a number of code options exist. For detailed sample code and output examples for these graphics and a wide range of others, refer to the Stata and R resources at the end of this page.

Tip: If you see a figure or graphic you like in a published paper and want to learn how it is coded or replicate it for your own purposes, look for the paper’s replication files to find the code.

Table  1 Cheat sheet for common graphics
Type of graphic Stata command ggplot command (R)
Density plot histogram geom_histogram() 
kdensity geom_density()
Box plot graph hbox; graph box geom_boxplot()
Scatter plot with fitted line twoway scatter twoway scatter a b || lfit a b geom_point() geom_point() + geom_abline
Regression coefficient plot coefplot coefplot
Line graph twoway line geom_line()
Area chart twoway area geom_area()
Bar graph graph bar; graph hbar; twoway bar geom_bar()
Stacked bar graph graph bar a b, stack geom_bar(position = "stack")
Clustered bar graph graph bar a b, over(survey_round) geom_bar(position = "dodge", stat = identity)
Bar graph with standard errors twoway (bar) (rcap) iemargins geom_bar() + geom_errorbar()
Maps maptile  geom_map
geom_point 
spmap tmap

Other graphics: 
Explore other graphic types and be creative as you think about how to best represent your data. For example, rather than a bar chart, you might consider using a proportional Venn diagram to show sample sizes across survey rounds and highlight continuity and attrition (Stata: pvenn2; R: VennDiagram package).

Table  2 Cheat sheet for common tables
Type of table Stata command R package::command
Summary statistics outreg2 stargazer::stargazer()
estout gt::gt()
asdoc knitr::kable()
Regression tables outreg2 stargazer::stargazer()
estout jtools::export_summs()

 

Pulling it all together

This page contains many resources helpful for understanding and creating clear and effective data visualizations, and there are countless more resources available through searching online. 

To summarize the principles outlined in the first section and how they can be applied to a typical J-PAL Research Associate’s work, see an example of a graph created with the binscatter command in Stata, by J-PAL Research Associate Robbie Dulin. Note that the default Stata background colors and graph layout specifications have been customized. 

This image shows a clean example of graph customization and bin graphing
In BPNT districts, subsidy is concentrated among fewer households
(Source: Robbie Dulin)

Stata code used to create the above visualization:

grstyle init
grstyle set imesh, horizontal compact
grstyle set color "228 92 36%75" ///
"44 172 156%75" "242 196 20%75"
grstyle set legend 2, inside

preserve
gen bins = .
forval i = 10(10)100 {
  replace bins = `i' - 5 if `i' - 10 < ///
  percentile_udb_miss_to_100 & ///
  percentile_udb_miss_to_100 <= `i'
}
collapse (mean) totsub (rawsum) ///
FWT [aw = FWT], by(bins treated)

expand 2
replace bins = . if _n>(_N/2)
recode treated (0=1) (1=0) if bins ==.
replace bins = 100 if bins == 95

format totsub %9.0fc
twoway (scatter totsub bins if ///
     treated == 0 [fw = FWT], ///
     mcolor("228 92 36%50")) ///
     (scatter totsub bins if treated == 1 ///
     [fw = FWT], mcolor("44 172 156%50")), ///
     legend(order(1 "Rastra" 2 "BPNT") size(medium)) ///
     ylabel(#6) ///
     xtitle("PMT Score Bin (Bin Size = 10)" ///
    "Markers Scaled by Number of Households in Bin") ///
     xsc(titlegap(2) range(105)) ///
     ytitle("Average Subsidy Value in Bin (rp)")
restore
 

Last updated December 2020.

These resources are a collaborative effort. If you notice a bug or have a suggestion for additional content, please fill out this form.

Acknowledgments

We thank Mike Gibson and Sarah Kopper for helpful contributions.

    Additional Resources
    Data visualization resources
    1. From Data to Viz: walks through types of graphs, provides code and caveats

    2. Fundamentals of Data Visualization, Claus O. Wilke

    3. Data stories: a podcast about data visualization with Enrico Bertini and Moritz Stefaner

    4. Hacks: create simple LaTeX and Markdown tables interactively

    5. Datawrapper

    6. Data Visualization Society

    7. Google’s Data Studio for dashboards and reports

    8. Resources by Edward Tufte: a statistician, and political scientist who specializes in data communication: Tufte's Rules and More books and work by Tufte 

    9. J-PAL's data visualization RST lecture (J-PAL internal resource)

    10. The United Nations’ guide to Making Data Meaningful

    11. Color Oracle’s Color Blindness Simulator

    12. TidyTuesday: A weekly social data project in R, focused on learning the tidyverse and ggplot packages. 

    13. Storytelling with data’s visual battle: table vs graph

    14. The Data Visualization Checklist

    Revising and polishing tables and graphics
    1. Kastellec and Leoni’s paper on using graphics instead of tables to convey results visually

    2. Appendix B, “Statistical graphics for research and presentation” in: Gelman, Andrew, and Jennifer Hill. Data analysis using regression and multilevel/hierarchical models. Cambridge university press, 2006

    Finkelstein, Amy, and Matthew J. Notowidigdo. 2019. "Take-up and targeting: Experimental evidence from SNAP." The Quarterly Journal of Economics 134(3): 1505-1556. https://doi.org/10.1093/qje/qjz013

    Gentzkow, Matthew, and Jesse M. Shapiro. 2014 "Code and data for the social sciences: A practitioner’s guide." Chicago, IL: University of Chicago (2014).

    Hullman, Jessica, and Kay, Matthew. “Uncertainty + visualization, explained.” The Midwest Uncertainty Collective. Last accessed June 9, 2020. 

    Jones, Damon, David Molitor, and Julian Reif. 2019. "What do workplace wellness programs do? Evidence from the Illinois workplace wellness study." The Quarterly Journal of Economics 134(4): 1747-1791. https://doi.org/10.1093/qje/qjz023

    Ottaviano, Gianmarco IP, and Giovanni Peri. 2008. "Immigration and national wages: Clarifying the theory and the empirics." No. w14188. National Bureau of Economic Research.

    Pearce, Rosamund. “Why you sometimes need to break the rules in data viz.” The Economist. Last accessed November 12, 2020. 

    Schwabish, Jonathan A. 2014. "An economist's guide to visualizing data." Journal of Economic Perspectives 28(1):209-34. DOI: 10.1257/jep.28.1.209

    Schwabish, Jonathan A. (2012, December, 15). Visualizing Data: An Economists’ Guide to Presenting Data. Presented at the American Association for Budget and Program Analysis Conference, Washington, D.C. 

    Stinebrickner, Ralph, and Todd Stinebrickner. 2014. "Academic performance and college dropout: Using longitudinal expectations data to estimate a learning model." Journal of Labor Economics 32(3): 601-644. DOI: 10.1086/675308

    In this resource