Research Resources

Repository of measurement and survey design resources


This is a list of resources on measurement and survey design relating to various topics. Topics are organized alphabetically for ease of navigation using the sidebar. We rely on crowd-sourced material to maintain this page and update it regularly. Feedback on the page, or suggestions of resources to add or remove, can be submitted using this form.

General resources

Welcome to the J-PAL Repository of Measurement and Survey Design Resources. The purpose of this repository is to provide an introduction to the measurement and survey design resources available in a particular subject or for a specific type of question. It is a companion piece to our Introduction to measurement and indicators and Survey design pages, compiling resources that discuss and provide guidance on specific issues related to the concepts introduced in those resources. In each section, you will find a list of resources, moving from more general to more specific, that are meant to introduce readers to the measurement tools, difficulties, and solutions in that particular topic. 

These lists do not aspire to comprehensiveness but rather attempt to tackle the main points in measurement within a topic. They cover a variety of media, from blogs and academic papers to books and journal issues. While heavy synthesis is outside the scope of this resource, short descriptions of the papers, as well as introductions to each section written by J-PAL staff with expertise in that area, are provided for the reader’s ease of use.

The General section includes resources that provide an overview of measurement concepts and guidance on survey design. It also includes a few repositories of sample questionnaires and resources on remote surveying in light of the shifts during the Covid-19 pandemic. Some of the following sections are divided by research area, for example, financial inclusion, corruption, and environment & energy. Others, like the forthcoming section on measuring consumption, assets, income, prices, and poverty, cut across several research topics. 

The current repository includes only a portion of the total number of sections, and we will continue to release more sections in phases over the coming months, so please check back for more. 

As with our other research resources, this is meant to be a living resource. If you have feedback on any of the resources in this repository, or if you have recommendations for resources and/or further subjects to add, please fill out this form.

General measurement resources

Practical guides to designing and implementing surveys

Sample questionnaires

This section contains sources for finding sample questionnaires across sectors; topic-specific sample questionnaires can also be found under the relevant topic header. Note that project data is typically stored with the questionnaire and can be found by following the links below.

  • Datahub for Field Experiments in Economics and Public Policy, by J-PAL and IPA; Harvard Dataverse -- Includes survey questionnaires, research results, replication code, and documentation from studies conducted by J-PAL and IPA affiliated researchers. 
  • The World Bank’s Microdata Catalog -- Central cache of questionnaires and data used and produced by the World Bank.
  • Model surveys from the Demographic and Health Surveys Program -- DHS questionnaire modules on household composition, biomarkers, and behavior related to the health of women, men and children.
  • The International Food Policy Research Institute's Database of Household and Community Surveys -- A library of datasets resulting from household and community surveys conducted by IFPRI, many of which are open access and contain the survey questionnaires.
  • International Household Survey Network -- A catalogue of surveys, research data and documentation from the IHSN.

Phone surveys

General guides

Research on response rates and mode effects

Practical survey guides

Covid-specific resources



The vast majority of people living in low and middle income countries depend on agriculture for their livelihoods, and a high reliance on agriculture has been attributed to the substantial income gaps between high and low income countries. As a result, governments, aid organizations, and others spend hundreds of millions annually1 on programs to increase agricultural productivity.

Yet conclusions about production, input productivity, and profits are hindered by accurate measures of inputs and output. Given the informality of the sector, smallholder agricultural labor is rarely recorded as it is done; researchers must hence rely on self-reports that can vary widely depending on the respondent answering the question (e.g., Kilic et al. 2021) or the recall period (Arthi et al. 2018; Beegle et al. 2012). Though purchases of fertilizer and seeds are easier to measure accurately, they can be of unobserved and mixed quality (see e.g., Bold 2017 and Michelson et al. 2021 for contradictory findings), further complicating accurate productivity estimates. Self-reported land size suffers from measurement bias that varies with plot size and is thought to be at least partially responsible for the puzzle of an inverse relationship between land size and productivity that has plagued agricultural economists for decades. Fortunately, the proliferation of GPS devices in recent years has enabled better land size measurement.

Self-reported output is subject to similar concerns of recall bias and measurement error that vary with the plot size (e.g., Abay et al. 2019), particularly when crop production is consumed by the household (rather than only sold), as is the case for smallholder farmers. Crop cuts provide a more accurate measure but are costly and must be taken correctly to avoid miscalculations based on the “edge effect”2 (Bevis & Barrett 2020). Satellite imagery offers a promising alternative that is both low cost and objective though may require ground-truthing (Lobell 2013; Lobell et al. 2019). Yet even when inputs and outputs are measured accurately, the endogenous nature of input use and the many other factors that affect the production function, from highly localized and costly-to-measure soil and growing conditions to farmer skill, require careful econometric analysis for credible inference.

These measurement challenges and others have broad policy implications. Mismeasurement of farm-level production scales up to mismeasurement of aggregate production (Gollin & Udry 2021). This in turn affects calculations of potential gains to large-scale policies to increase production such as through higher chemical input use, commercialize smallholders such as through better connections to input and output markets, or to move people out of the agricultural sector altogether. 

Carletto et al. (2021) surveys the latest research on measurement error in agriculture, offering actionable suggestions for data collection and analysis, and is a good starting point for readers aiming to design a study or analyze data.

- Sarah Kopper, Associate Director of Research

General resources

Input use, productivity, and production

Land: Size, ownership, and fertility
Skills and knowledge
Other inputs

Satellite and geographic measurement

Corruption in governance and service provision


Corruption, or when bureaucrats and elected officials misuse their positions or break rules for private gain, is difficult to measure due to its illicit and often secretive nature. Directly observing corrupt activities like bribery by government officials, neglecting official duties, or tax avoidance can be challenging as officials may change or conceal their behavior in response to being monitored. Further, traditional survey approaches are unlikely to elicit truthful responses as officials may be unwilling to confess to corruption due to social desirability bias. Alternative methods of measuring corruption that rely on asking about citizens’ experience with corrupt officials (e.g. “have you paid a bribe for a service before?”) or their perceptions of corruption may be biased, outdated, or incomparable across contexts (Olken 2009). Perception-based indices and rankings may also provide limited insight into the type, causes, or consequences of corruption in a given context (Banerjee, Mullainathan, and Hanna 2012).

Given these challenges, open questions around measuring corruption include: How can researchers measure corruption without distorting public officials’ behavior or eliciting a biased response? What is the best way of measuring social norms around corruption? What types of corruption, if any, can citizen reports shed valuable insights on? Can e-governance reforms that improve the collection of administrative data also improve our ability to measure corruption? More reliable measures of corruption help us better answer policy-relevant questions like the effects of corruption on the efficiency of public service delivery, and the effectiveness of anticorruption policies and programs.

While measuring corruption is difficult, researchers have made remarkable progress in doing so in the past few years, including through the use of a variety of innovative approaches that directly measure corruption and begin tackling some of these questions. This includes:

  • Surveyors accompanying truck drivers on delivery routes, dressed as their assistants, to record bribes paid to police at checkpoints (Olken and Barron 2009)
  • Combining GPS-tracked company vehicle data with administrative data to measure corrupt behavior among bureaucrats of a large public service provider (Schonholzer et al., ongoing)
  • Comparing villagers’ perceptions of corruption to an objective measure (e.g. the difference between government-reported expenditure for a road building project and the estimated cost of actually building the road according to independent engineers; Olken 2009).

The papers that follow include many more examples of methods that can be used to measure corruption in governance and service provision, including through audits, public expenditure tracking surveys, market inference, and more. For a discussion on the different measurement approaches and their applicability, see the MITx Micromasters Course on Political Economy and Economic Development.

- Aimee Barnes, Policy Associate, and Eliza Keller, Senior Policy & Communications Manager, for the J-PAL Political Economy and Governance sector


General resources

  • Paper: Eight questions about corruption, by Jakob Svensson (2005) -- Provides a definition of corruption and then discusses the level of corruption in different countries, the different ways to reduce corruption, and the impact of corruption on growth.
  • Paper: Corruption in developing countries, by Benjamin Olken and Rohini Pande (2012) -- A review of the different measurement techniques and the existing evidence. 
  • Paper: Section 4 (measurement), of Corruption, by Abhijit Banerjee, Sendhil Mullainathan, and Rema Hanna (2012) -- A review of different measurement methods and their application in the literature.
  • Book: New advances in experimental research on corruption, edited by Danila Serra, Leonard Wantchekon, R. Mark Isaac, and Douglas A. Norton; Emerald Group Publishing Limited, Vol. 15 (2012) -- Reviews the research on corruption measurement and reduction generated from laboratory and field experiments. [Gated]
  • Paper: Survey techniques to measure and explain corruption, by Ritva Reinikka and Jakob Svensson (2003) -- Reviews the use of Public Expenditure Tracking Surveys (PETS), provider surveys, and enterprise surveys for measuring corruption in education, health, and private businesses.
  • Book: Are you being served? New tools for measuring services delivery, edited by Samia Amin, Jishnu Das, and Markus Goldstein; the World Bank, Vol. 1  (2008) -- Examples of using different methods and tools for measuring public service delivery.
  • Book: Advances in experimental political science, edited by James N. Druckman and Donald P. Green; Cambridge University Press, Vol. 1 (2021) - A comprehensive guide to the next experimental methods, data collection, analysis, and challenges. [Gated]
  • Book: Corruption: What everyone needs to know, by Ray Fisman and Miriam A. Golden; Oxford University Press (2017) - An overview of corruption and its causes and consequences with examples from around the world. [Gated]

International indicators of corruption and governance

Specific approaches to measuring corruption

Through perception
  • Paper: Corruption perception vs corruption reality, by Benjamin Olken (2009) --  Examines the reliability of villagers’ perception of corruption in a project by comparing it with the “missing expenditure.” Missing expenditure is the difference between the actual cost of the project and the individual’s perception. [Gated published version]
  • Paper: Parochial politics: Ethnic preferences and politician corruption, by Abhijit Banerjee and Rohini Pande (2009) -- Uses expert surveys to measure perceptions about how corrupt a candidate is. They report a high correlation between journalist’s perception about the candidate with actual data. 
Through survey estimates of bribes
Through direct observation
By comparing estimated and actual expenditure
From market inference
Using audits
Through other methods



Improving diversity, equity, and inclusion in a particular context requires first understanding the nature and extent of bias and discrimination. For instance, there may be disproportionately fewer individuals of a certain marginalized group employed in a certain industry. This could be due to discrimination within the industry’s hiring practices, to other factors that reduce the ability of individuals from within the marginalized group to gain the skills necessary to enter the industry, or both. Disentangling these drivers is important for determining what types of interventions will effectively address this issue. 

This section provides resources and tools to help researchers and practitioners better measure bias and discrimination. This includes a handbook chapter and numerous papers providing an overview of the types of research methods often used to identify discrimination in different contexts, such as audit and correspondence studies, list randomization, and more. It also includes more practical considerations for implementing these methods, including insight into when each can be especially useful and their limitations. In some cases, this overview is provided in the context of specific types of discrimination, such as race or gender, as well as in certain thematic areas, such as housing. It also provides a number of resources for those seeking to measure discrimination within the labor market. These include a synthesis of RCT evidence on hiring discrimination, as well as a book chapter and several papers that provide practical suggestions and methodologies for using many of the tools mentioned above specifically in labor market settings.

- Anupama Dathan, Policy Manager for the J-PAL Health sector

General overview

Labor market discrimination



Please note that this section focuses on measurement challenges associated with K-12 education. While some of these may also be applicable to higher education, measurement in higher education presents unique challenges that are not included in this section. 

Beyond the intrinsic value-add of education, schooling's association with higher income, greater health, more civic participation, and other benefits (Duflo 2001) has made increasing access to and quality of education a significant policy concern globally and a primary focus of RCTs in the social sciences. Education field experiments have predominantly centered on finding cost-effective ways to improve participation and learning outcomes, and a growing body of evidence has begun to reveal general themes about "what works." However, despite substantial advancements over the last decade, difficulties in measurement often limit our understanding of the impact of social programs. 

At the broadest level, much work has been done in order to identify measurement issues caused by the high potential for spillover effects in education studies and to identify both potential solutions in study design and when they are feasible (see Muralidharan 2017 for a discussion). However, studies of specific outcomes reveal other potential measurement errors. For example, attendance is a standard indicator for measuring student participation and teacher absenteeism. However, administrative data in general is subject to several biases (Feeney et al., 2015), and administrative data in education has been found in certain cases to have systematic errors (Singh, 2021). When feasible, unannounced visits can help combat this issue (Muralidharan 2017). Similarly, test scores are the most popular indicators of learning outcomes. However, test construction often lacks transparency and systematic design. This can impact the replicability, scale and generalizability of a study, as well as cause other measurement issues, such as impact underestimation because the test was too difficult for those in treatment and control conditions (see Singh (2015) for a discussion, Muralidharan, Singh, and Ganimian (2016) for a demonstration, and Muralidharan (2017) for essential principles of test design). Kautz et al. (2014) provides a review of the literature on measuring non-cognitive skills. Grit, agency and other skills may be crucial to economic success but may not be adequately captured by traditional tests. They also provide an overview of some tools for measuring non-cognitive skills.  The subsection on “Skills and Effort” in the Labor subsection also provides guidance on measuring skills as they relate to economic output. 

Besides student skills and performance, it can also be difficult to measure and compare the quality or effort of teachers. One commonly used method of comparing pre and post scores of students may suffer from biases as enumerated in Rothstein (2008). Guarino et al. (2012) provide a comparison of value added methods to other measures of teacher quality. 

There are crucial policy implications to these measurement errors and the many others not mentioned. This section provides real-world datasets from the World Bank and national and regional assessments as well as resources that discuss different educational outcomes, options for measuring these outcomes, challenges that arise when measuring different effects, and strategies to overcome potential issues. Please see Muralidharan (2017) and the other resources below for a more in-depth discussion of the significant measurement issues discussed here.

- Demitria Wack, Policy Associate on J-PAL's Education sector, and Thanh Nguyen, Senior Education Manager


  • World Bank EdStats -- Data on over 4000 indicators from 214 economies using data collected from 1970 to the present.
Datasets of test scores:

General resources

Measuring cognitive skills: Developing valid and reliable tests:

Measuring student participation and non-cognitive skills

Teacher effort and quality

Household spending on education

Energy and environment


Research in environment, energy, and climate change encompasses a range of topics and with them challenges and opportunities in measurement unique to the sector. Some of the topics included in the sector include greenhouse gas emission reductions; measures to help people cope and live with the effects of climate change; pollution reduction and sustainable natural resource management; and access to affordable, reliable, and clean energy sources.

Generating evidence in environment, energy, and climate change is becoming increasingly urgent as emissions increase, global warming progresses, and communities start to feel the effects of a changing climate. Climate change is highly inequitable, with low-income communities being hardest hit by climate and weather shocks, while at the same time having the fewest resources to adapt. To better understand the impacts of programs, technologies, and policies, researchers are exploring new ways of combining different sources of data. Combining remote sensing data, satellite data, or administrative data collected by governments and utility companies with ground-truth and survey data has the potential to unlock insights about the efficacy of climate solutions. Technological innovations in, for example, sensor technology can produce more granular data on air quality and pollution (Khanna, 2000; O’Neill et al., 2003), allowing for new and innovative combinations with data on welfare losses and health. Policies and research on environment and climate change often face challenges in using available data to predict and measure the impacts of environmental shocks - a challenge that is being met with developments in predictive modeling to inform policy and humanitarian interventions.

Lastly, understanding human behavior and household-level measures to face environmental and energy challenges as well as mitigate and adapt to climate change opens up more questions. Researchers tackle these questions by studying incentive structures (Jayachandran et al., 2017; Hanna et al., 2016), effective regulation enforcement and monitoring (Duflo et al., 2013; Ghanem and Zhang, 2014), and energy consumption and conservation behavior (Burgess et al., 2020; Lee et al., 2020).

- Maike Pfeiffer, Policy Associate, and Andrea Cristina Ruiz, Policy Manager, for the J-PAL Environment, Energy, and Climate Change sector

Environment: General resources

The reading lists for MIT’s Environmental policy and economics (Allcott, Spring 2011) and Environmental economics and government responses to market failure (Greenstone, Spring 2005) are good primers for measurement issues in environmental economics. Though the measurement issues are more focused on cost benefit analysis, the Energy economics (Joskow, Spring 2007) course also has a reading list that may be helpful.

  • Book section: Environment modules, by Dale Whittington, in Designing household survey questionnaires for developing countries; World Bank, Volume 2, 5-30 (2000)  -- An overview of measurement issues surrounding indicators relevant for environmental policy, including sections on contingent valuation, measuring resource use, and capturing environmental priorities. Has a particular focus on LSMS methods and includes example questionnaire modules.

Environment: Measuring benefits

General resources
  • Paper: Nonmarket valuation of environmental resources: An interpretive appraisal, by V. Kerry Smith (1993) -- A literature review of the pros and cons of various methods of nonmarket valuation of environmental resources; covers both indirect (e.g., revealed preference) and direct (e.g., WTP surveys) methods. [Gated]
  • Book: The measurement of environmental and resource values, by A. Myrick Freeman III, Joseph A. Herriges, and Catherine L. Kling; RFF Press, Vol. 3 (2014) [direct download link] -- A graduate-school level textbook that covers the valuation of environmental resources. Includes chapters on topics including, but not limited to, hedonic pricing, environmental quality as a factor input, and valuing longevity and health.
Hedonic method

Environment: Measuring bads and costs

Measuring pollution using air quality sensors
Measuring pollution using audits
Measuring deforestation using satellite imagery
Anthropometric measurement of costs

Energy: General resources

  • Book section: Environment modules by Dale Whittington, in Designing household survey questionnaires for developing countries; World Bank, Volume 2, 5-30 (2000)  -- An overview of measurement issues surrounding indicators relevant for environmental policy, including sections on contingent valuation, measuring resource use, and capturing environmental priorities. Has a particular focus on LSMS methods and includes example questionnaire modules.
  • Paper: Does household electrification supercharge economic development?, by Kenneth Lee, Edward Miguel, and Catherine Wolfram (2020) -- Challenges to measuring electrification (and combining measures of electrification).

Energy: Access and use

Energy demand

Financial inclusion


Financial inclusion, or access by individuals and businesses to quality and affordable financial products and services that meet their needs, is an increasingly common goal among policymakers. Measuring financial inclusion can help policymakers assess the current state of financial inclusion, set goals, and monitor progress towards achieving them. Moreover, research on these topics helps build our understanding of how interventions can support financial inclusion efforts. Data on financial inclusion has grown and improved in quality over the past decade. 

To measure financial inclusion, researchers usually collect data from users (demand-side) or providers (supply-side). On the demand-side, surveys can capture an individual’s, household’s, or businesses' access to and use of financial services. One challenge with surveys is that it relies on self-reported data: respondents may not remember some financial decisions and may not want to provide accurate responses given the sensitivities of talking about one’s finances. On the supply-side, researchers can use administrative data from financial service providers or regulators. However, one limitation is that data from a single financial institution does not provide a full picture of an individual’s financial behavior, particularly if they use multiple institutions or do not use one at all, a case which is particularly common for low-income individuals. Finally, there are efforts to develop indices that capture the many dimensions of financial inclusion and aggregate measures of financial inclusion at a macro-level. 

This section discusses various approaches to measuring financial inclusion, including by using  survey data, using administrative or non-survey data, constructing indices of financial inclusion, and by using aggregate/macroeconomic measures of financial inclusion. 

- Mikaela Rabb and Sam Carter, former Senior Policy Associates for the J-PAL Finance sector

General resources

Using survey/administrative data

Using survey data
Using administrative/non-survey data

Constructing indices or using aggregate measures

    Constructing indices
    Aggregate/macroeconomic measures



    Outcomes that indicate progress in the sphere of gender equality are often difficult to define and measure. For example, abstract concepts such as women’s agency and empowerment are difficult to define, making it challenging to identify indicators that capture meaningful changes in them. Indicators may additionally need to be tailored to the local context to accurately measure women’s agency or empowerment in any given region (Donald et al., 2017), while self-reported indicators of gender equality or empowerment, such as individuals’ gender attitudes, may be subject to reporting bias. For instance, participants may report having more progressive attitudes than they actually do if they believe that is aligned with what the surveyor wants to hear or with generally accepted social norms (J-PAL Women’s empowerment measurement guide). Due to intra-household power dynamics, women may answer questions differently than men, or may answer differently depending on who is present for the interview (Goldstein, 2011). In addition, researchers and practitioners must always consider the potential unintended effects and ethical implications of collecting sensitive information related to gender, such as information on women’s experience of gender-based violence (IPA’s GBV survey guide).

    Some important questions that research on measuring gender-related outcomes attempts to answer are: i) how are some abstract concepts related to gender equality, such as women’s agency and empowerment, defined and measured? ii) what are some standardizable indicators of gender-related outcomes that can also be tailored to local contexts? iii) how can researchers and practitioners design survey instruments to reduce reporting bias and gather accurate information related to gender? iv) what are some ethical issues that researchers and practitioners must consider while collecting sensitive information related to gender?

    This section includes an array of resources that provide answers to the above questions. The section begins with a list of repositories of measures of women’s agency and empowerment (e.g. EMERGE website, UCSD) and examples of questionnaires and survey instruments that minimize bias and help gather accurate information (e.g. appendices of the J-PAL Women’s empowerment measurement guide; DHS women’s status module). It also lists several resources on defining and measuring important outcomes related to women’s agency (e.g. Kabeer,1999Donald et al. 2017), household decision-making power (e.g. Glennerster and Walsh, 2017; Doss and Quisumbing, 2018), women’s economic empowerment (e.g. Glennerster and Diaz-Martin, 2017; Anand et al., 2019), and women’s empowerment in agriculture (Koolwal, 2019; Lambrecht et al., 2017).  This section also includes resources that delve into various sources of bias in surveys (e.g. Goldstein, 2013; Jayachandran, 2017) and others on ethical considerations when collecting sensitive information (e.g. IPA’s GBV Survey Guide; WHO and PATH GBV Guide). 

    Gender issues cut across many different sections. While we try to provide a comprehensive overview of measurement issues related to gender, it may be advisable to refer to particular sections to learn more about how to measure gendered experience in a particular field. Also note that guides on measurement of gender-based discrimination are included in the larger section on measuring discrimination.

    - Yvette Ramirez, Policy Manager for the J-PAL Gender sector

    Data Sources

    • EMERGE website, UCSD -- EMERGE records and tests different measures of women’s empowerment. This includes measures of autonomy, outcomes, agency, etc.
    • Women’s Empowerment in Agriculture Index (WEAI), International Food Policy Research Institute -- Survey based index designed to measure the empowerment, agency and inclusion of women in the agricultural sector. 


    Empowerment, autonomy, agency, and household decision-making

    Agency, autonomy, economic empowerment and decision-making are interwoven with each other and form part of the much larger construct of female empowerment. As such, the allocation of papers into buckets is necessarily somewhat arbitrary. 

    • Paper: Measuring women’s agency, by Aletheia Donald, Gayatri Koolwal, Jeannie Annan, Kathryn Falb, and Markus Goldstein (2017) -- Evaluates different methodologies for measuring agency against a multi-dimensional framework of agency as goal-setting, ability to achieve goals, and acting on goals. Also assesses how each method adapts to the sub-Saharan context.
    • Paper: Using machine learning and qualitative interviews to design a five-question women’s agency index, by Seema Jayachandran, Monica Biradavolu and Jan Cooper (2021) -- Creates a five-question index of women’s agency. The questions are chosen based on their correlation to coded qualitative interviews of women in Haryana, India. The paper also provides a short literature review of the different methods used to measure agency.
    • Paper: The SWPER index for women's empowerment in Africa: development and validation of an index based on survey data, by Ewerling et al. (2017) -- Based on the Demographic and Health Survey data on partnered women from 34 African countries, the SWPER index includes 15 questions on three dimensions of empowerment attitudes to violence, social independence, and decision making.
    Household decision-making
    Economic empowerment and labor force participation
    Measuring women’s employment and empowerment in agriculture

      Effect of intra-household differences on measurement

      Gender-based violence

      Gender preferences



      Field experiments in health economics help to answer a variety of questions related to the take-up and delivery of health products and services. From helping to better determine the factors that motivate individuals to adopt healthy behaviors to identifying innovations that improve the delivery of health services, this type of research is an important input to strengthening health systems and improving health outcomes around the world. Accurately measuring baseline, intermediate, and final health outcomes is a critical component of determining whether a given policy or program was effective. Some metrics, such as HIV prevalence, can be measured through relatively straightforward tests. But other outcomes are trickier to measure. For instance, child malnutrition is a key predictor of mortality. What is the best measure of malnutrition rates? Height-for-age, weight-for-height, mid-arm circumference, iron deficiency anemia, and more can all be appropriate in certain situations. Which one should a researcher choose given the context and their research questions? Use of modern contraceptives is an important measure of fertility, but respondents may be tempted to report regular use, even if this is untrue, if they feel they should be using them. How can researchers avoid this type of desirability bias?

      This section, categorized according to outcomes and health conditions, compiles resources to guide researchers through these and other health measurement challenges. Produced by experts including the World Health Organization, UNICEF, and pioneering researchers in the field, these resources range from survey design guides to best practices for measuring tricky outcomes. In instances where multiple metrics may be appropriate, they also provide suggestions on how to help determine the best indicator(s).

      - Anupama Dathan, Policy Manager for the J-PAL Health sector

      General resources

      • Questionnaire: Model surveys from the Demographic and Health Surveys program -- Provides a high-level overview of the DHS’s four main questionnaires (Man, Woman, Household, and Biomarker), and provides links to current and past modules.
      • Book section: Health modules by Paul Gertler, Elaina Rose, and Paul Glewwe, in Designing household survey questionnaires for developing countries; World Bank, Volume 1: 177-216 (2000) -- An overview of indicators relevant for health policy, a discussion of survey methods used to capture those indicators, and annotated example questionnaire modules. Has a particular focus on LSMS methods.
      • Online course: J-PAL’s Measuring health outcomes in field surveys course -- Contains lectures and interactive material on all aspects of measuring health outcomes in field surveys: measuring individual and population health, selecting health indicators and measurement tools, questionnaire development, and practical and ethical issues for data collection.
      • Paper: The impact of recall periods on reported morbidity and health seeking behavior, by Jishnu Das, Jeffrey Hammer, and Carolina Sánchez-Paramo (2012). -- An experimental comparison of different recall periods on different reported health outcomes, including morbidity, doctor visits, time spent sick, and use of self-medication. Includes an exploration of the effects among different subgroups of the sample. [Gated published version]
      • Blog post: Quantifying the Hawthorne effect, by Jed Friedman and Brinda Gokul (2014) -- A compilation and short literature review of papers attempting to quantify the Hawthorne effect in health studies.

      Health indicators

      Conventional indicators
      Composite measures
      Anthropometric data
      Early childhood development (general)
      Early childhood development (cognitive)
      • Journal issue: ScienceDirect’s collection of articles on the Bayley-III Scale -- A collection of journal articles on the Bayley-III Scale, an instrument designed to assess the developmental functioning of infants, toddlers, and young children aged between 1 and 42 months; contains articles on the individual scales that make up the Bayley-III, as well as an international review of research that employs it and reviews of similar developmental tools.
      • Questionnaire: Tools from the U.S. Bureau of Labor Statistics:
        • The Peabody Picture Vocabulary Test -- An overview of the Peabody Picture Vocabulary Test, which measures verbal ability and scholastic aptitude for individuals 2.5-40 years of age. Includes links to the instrument itself, its technical report, and similar cognitive development tools. 
        • The Home Observation for Measurement of the Environment (HOME) -- An overview of the Home Observation for Measurement of the Environment module, which measures the quality of a child’s home environment. Includes links to the instrument itself, its technical report, and similar tools from the BLS.
      Early childhood development (physical)
      Sexual and Reproductive Health

      Healthcare quality

      Healthcare quality/patient satisfaction
      Using audits and mystery shoppers

      Housing stability and homelessness


      Housing instability is both a function of and a catalyst for poverty. Maintaining stable housing is a necessary prerequisite in many cases for health, employment, education, and a host of other fundamental needs. The scope and complexity of housing instability and homelessness highlight the need for rigorous evidence on the effectiveness of strategies to prevent and reduce homelessness. A first step in generating this evidence is defining and measuring homelessness and housing instability adequately. 

      Unfortunately, the measurement of housing instability is complicated by the existence of a variety of definitions and no widely established measurement system of it. For instance, in the United States, children who share housing with others (living “doubled up”) qualify for assistance under some programs, but not others. Moreover, the scope of people experiencing homelessness can vary by orders of magnitude depending on which definition one uses; including children who are living in doubled up conditions increases estimates of the number of children experiencing homelessness by a factor of 10 from the standard “point in time” (PIT) count. An emerging literature looks at how to measure housing stability using techniques for reaching mobile populations and consumer reference data (e.g. Phillips (2020)Kalton (2001))

      Further challenges to measurement come with measuring people who are unsheltered (those sleeping outside or in places not meant for human habitation), typically part of the PIT count in the United States; some studies have found that PIT counts can understate the rate of unsheltered homelessness by as much as 50 percent (e.g. Evans, Phillips, and Ruffini (2019))

      The resources included below cover survey and administrative data methods for counting people experiencing housing instability and homelessness, covering topics from oversampling to ensure adequate representation of minority groups through methods for including hard-to-reach subpopulations. A reflection of J-PAL’s internal expertise, the resources below center around housing instability and homelessness in the United States; we welcome suggestions for additional resources to include, particularly those based in or relevant to other countries.

      - Rohit Naimpally, Senior Research and Policy Manager, for J-PAL's Reducing and Preventing Homelessness Initiative

      General resources

      Using administrative data

      Survey methods

      Capture-recapture methods (plant capture, mark-recapture etc.)



      While governments invest a lot of resources into active labor market policies, evidence on the effectiveness of many of these programs is inconclusive. Reliable micro- and macro-economic data on labor market outcomes is essential for policymakers to understand the needs in their labor markets and to assess the impact of their policies. Key areas for research include training people in skills demanded by the labor market, helping them search for work, and reducing discrimination. 

      An important challenge for research in the labor space is inconsistency in the ways researchers measure key labor market indicators such as work, employment, unemployment, inactivity, skills, ability, and productivity. Small differences in survey features--such as the design of a questionnaire, the length of the labor module, the way the survey is implemented or even the wording of a question--can have outsize effects on labor market statistics (Dillon et al. 2012). A second challenge is that reliable administrative data is limited in low-income countries given the prevalence of self-employment or seasonal work and the complexity of capturing migration. People living in low-income countries usually have a portfolio of formal and informal activities. Tracking their earnings can be a sensitive task. 

      Questionnaire modules using either single keyword questions, activity lists, or time diaries are often used. A common concern about the accuracy of labor data collected relates to women and youth statistics. Indeed, these two subcategories of population tend to engage more often in an "atypical" type of work or in domestic work, which can lead to systematic under- or over-reporting depending on the wording of a question. Collecting accurate data on skills and job satisfaction is also challenging (Friedman, 2012). Cognitive skills measurements such as Raven and Stroop tests are usually reliable and consistent. Soft skills are usually more subjective due to being self-reported, though psychometric tests and exercises are increasingly being used to enhance accuracy of those measures (Laajaj and Macours, 2017). 
      The International Labour Organisation has shared a definition of all key concepts in labor that serves as reference for surveys. However, not all statistical offices across the world use those definitions (Desiere and Costa, 2019), and it is difficult to harmonize key labor outcome variables. 

      This section compiles resources that explore and discuss these issues, moving from general resources and measurement challenges to resources and tools for measuring skills and effort, productivity, and job satisfaction.

      - Victoire Fribourg, Policy Associate, and Lisa Corsetto, Policy Manager, for the J-PAL Jobs and Opportunity Initiative

      General resources

      • Book section: Employment modules by Julie Anderson Schaffner, in Designing household survey questionnaires for developing countries; World Bank, Volume 1: 177-216 (2000) -- Details the key policy concern and the required data, and provides a few prototypes of employment modules.
      • Paper: Employment data in household surveys, by Sam Desiere and Valentina Costa (2019) -- Discusses the methodological challenges related to measuring employment indicators and reviews the different kinds of surveys used in different studies.

      Bias and sensitivity in labor statistics

      Skills and effort


      Job satisfaction

      Microenterprises and firms


      Firms do not simply provide goods and services for the economy; they also generate jobs and secure income for workers. Firm-related policies can potentially have large impacts on poverty alleviation through quality employment. However, firms are an understudied area in the experimental economics literature and in policy analysis. Part of the challenge stems from the onerous cost of conducting rigorous firm-level research at scale, while additional constraints persist on the data, measurement, and methodological fronts. These challenges are more pronounced for low- and middle-income countries and the informal sector given the scarcity of existing reliable data, and more nuanced measurement challenges.Therefore, it is important for researchers to leverage and build on existing firm-level datasets before they embark on their own data collection efforts.

      There are three central questions on how to measure the performance of microenterprises and firms, and their contribution to economic development: 

      1. What are some of the most commonly used firm-level datasets that are currently available, and how have researchers been using them to study the role of firms in development
      2. What are the most accurate and cost effective techniques to obtain representative and comparable firm-level indicators across countries and time?
      3. How can researchers leverage innovative survey instruments and data sources to measure firm characteristics and outcomes that are not reflected in existing official statistics or directly observable (e.g., entrepreneurial activity, business and management practices, balance sheet data, productivity etc.)? 

      This chapter compiles resources that explore and discuss these issues, starting with an overview of existing datasets, followed by a general description of measurement resources, and concluding with more specific sections regarding the measurement of profits, inventories, business practices, and entrepreneurship.

      - Siena Harlin, Senior Policy & Communications Associate, and Daniela Muhaj, Senior Policy & Research Associate, for the J-PAL Firms sector


      As noted in the introduction to this section, collecting new, high-quality data on firms can be complex and expensive; further, the universe of existing data on firms across the world is rich and varied. Therefore, before moving on to sections more in-line with the structure of the rest of the resource, we provide a non-comprehensive introduction to some of the already-available datasets and data sources that researchers have previously used to conduct research on firms. 

      Accounts data

      A few organizations, most prominently Dun & Bradstreet and Bureau van Dijk (Orbis), maintain proprietary datasets on public and private companies and entities across the globe. While the exact variables that the organizations track may differ, the datasets work by assigning each firm in their database a unique identifier and then tracking harmonized statistics.

      Data from statistical agencies

      Data on firms from governmental statistical agencies generally comes in two forms: survey data and administrative records. Examples of the former include the Small Business Pulse Survey and the Annual Survey of Entrepreneurs from the U.S. Census; these surveys aim for representivity as opposed to completeness, and generally provide rich data on a specific aspect of firms in their country. Administrative records, like Brazil’s RAIS, generally have much larger and comprehensive samples, but may contain less or less-specific data. Specific examples include:

      VAT data

      Tax registry data, especially data on Value Added Taxes (VATs), have become prominent in the  study of firm behavior, dynamics, and outcomes. VAT data is especially useful because the taxes generally involve a “paper-trail” from the firms selling raw materials to the final retailer of a goods or services, which can be particularly useful for studying firm linkages, buyer-supplier networks, and tax compliance. Specific examples of research using VAT data include:

      Unstructured/alternate data

      Beyond the standard forms of data described above, recent innovations in data collection, processing, and analysis have allowed researchers interested in firm dynamics to broaden the sources and types of data they work with. In particular, recent research has used text, satellite, and  network data to broaden the type of questions, outcomes, and models available to firms researchers.

      Non-data resources

      Business practices
      • Blog post: Measuring entrepreneurship, part (I) and part (II) by Markus Goldstein and Francisco Campos (2012) -- These two blog posts together form a short literature review of the measurement issues associated with measuring entrepreneurship in developing countries, as well as many of the standard methods and recent innovations in entrepreneurship measurement.
      • Book section: Who are the microenterprise owners? Evidence from Sri Lanka on Tokman versus De Soto, by Suresh de Mel, David McKenzie, and Christopher Woodruff (2010), in International Differences in Entrepreneurship -- Contains a description of a microenterprise survey conducted in Sri Lanka that sought to gather data on the characteristics of microenterprise owners.
      • For research using characteristics of entrepreneurs, see also de Mel, McKenzie, and Woodruff (2009), who measure innovation, and Cole, Sampson, and Zia (2010), who measure risk preferences.

      Poverty, consumption, and income


      Accurate and precise measures of income, consumption, assets, prices, and poverty are pivotal in ensuring that social programs are equitably and efficiently targeted and in the estimation of some of the most common primary outcomes of policy interventions. However, complicating this measurement are two larger facts. First, the measurement issues surrounding these five topics are highly, and to a certain extent mechanically, interlinked; measurement error in one can filter through to estimates of the others. Second, the indicators associated with each concept vary dramatically by context. Whereas indicators like the HIV-positivity rate or the concentration of CO2 in the atmosphere can generally be transferred across contexts, the goods and services that make up consumption bundles, income schemes, and asset classes vary over time and geography.

      This section compiles resources to guide researchers through the above issues and other cross-cutting topics such as respondent and recall effects, but also delves into more specific topics. It is split into five subsections, each focused on  one of the five topics above, and each starting with  resources that form a general introduction to the measurement of that topic and proceeding with papers and guides that address specific challenges within it. The first, consumption, covers both general and food consumption and provides resources on issues like  the effect of the level of reporting and length of the reference period (e.g., Beegle et al., 2012),differences in reporting from individuals vs. households (e.g., Sununtnasuk and Fielder, 2017), and difficulties and tools associated with measuring hunger (e.g., Friedman et al., 2014). The subsection on assets guides readers through issues in, and innovative tools for, valuation (e.g., Kochar, 2000 and Marx, Stoker, and Suri, 2016), using assets in the estimation of other indicators such as poverty and inequality (e.g., Filmer and Pritchett, 2001 and McKenzie, 2005), and measuring asset indicators when there are incentives to misreport (e.g., individuals’ rights to land as in FAO et al., 2019). 

      The income subsection provides resources on income measurement in general but focuses on aspects of income for which reliable and precise measures are more difficult to obtain, including non-labor and informal-wage income (McKay, 2000) and income expectations (Hensel, n.d.). The subsection on prices contains discussion of common issues facing researchers collecting and using price data, including its variation by season (e.g., Gilbert et al., 2016) and by unobserved quality differences (e.g., McKelvey, 2011). Finally, the poverty subsection focuses on three larger issues: how best to estimate, predict, and target poverty (e.g., Elbers, Lanjouw, and Lanjouw, 2003Brown, Ravallion, and van de Walle, 2018; and Banerjee et al., 2016); the extent and prevalence of questionnaire (e.g., Kilic and Sohnesen, 2017) and respondent (e.g., Silverio-Murillo, 2018) effects on poverty measures; and considerations for using adult equivalence scales for measuring child poverty (e.g., Ravallion, 2015). 

      - Jack Cavanagh, Senior Research, Education, and Training Associate, and Ximena Mercado Garcia, MSRP Intern


      General consumption resources
      Food consumption and hunger


      General resources
      Physical assets
      Measuring inequality with assets




      General resources
      Poverty mapping/imputation
      Questionnaire and respondent effects
      Targeting and proxy indicators
      Scale sensitivity of poverty measures

      Recall periods and interview effects


      This section includes resources on measurement error from  recall periods and interview effects. 

      As covered in our Introduction to measurement and indicators resource, recall bias can arise in questions posed to survey respondents about events, processes, or decisions that occurred in the past. Questions like “how much fertilizer did you use last year?” and “how many times were you sick in the last month?” can be very helpful in gathering survey data on infrequent or lumpy events, and extending the time horizon can help better capture variation in the underlying concept. But there are tradeoffs. The longer the time horizon, and the less noteworthy the event, the harder it may be for respondents to remember accurately, and the more likely they are to use potentially biasing mental heuristics like anchoring to guide their answers (Godlonton, Hernandez, and Murphy 2018; see also Table 3 in Introduction to measurement and indicators for a list of common biasing heuristics). Further, the extent of measurement error may differ between subgroups if there are interactions between the variable separating the subgroups and the concept being measured – for example, differing rates of visits to doctors and perceptions of the normality of illness between poorer and richer households can cause comparisons of health measures between the two groups to differ both quantitatively and qualitatively depending on whether a short or long recall period is used (Das, Hammer, and Sanchez-Paramo 2012). 

      The subsection on recall bias compiles resources that seek to measure the extent to which these biases operate in different research areas, including agriculture (Beegle et al. 2012), health (Das et al. 2012), and microenterprises (de Mel et al. 2014). It also includes resources that introduce novel ways to reduce the chance of recall period bias. These largely involve finding innovative ways to increase the frequency of data collection so that short-term memory dominates: among others, Wiseman et al. (2005) provide a guide to using diaries to collect data in resource-poor settings, and de Mel et al. (2014) use a novel technology (RFID chips) to collect high-frequency data in microenterprises.

      Resources in the second subsection, on interview and question effects explore the extent to which the survey or interview questions themselves can influence respondent behavior. There are a couple of channels through which this could bias measurement: question-behavior effects cover situations in which asking respondents about intentions or behavior impacts the behavior itself, e.g.,  through shifting perceptions of social desirability (Dholakia 2010; Fitzsimons and Moore 2008), a response “freezing” effect in panel surveys (Bridge et al. 1977) or a “self-prophesying” effect in situations where the respondent is asked about future behaviors/intentions (Smith et al. 2003). Interview effects, on the other hand, can occur even without any elicitation of intentions or other behavioral prompts (see e.g., Zwane et al. (2011) for an example). The papers in this subsection provide theory and evidence for when these effects are most likely to influence behavior (Feldman and Lynch, Jr. 1988) and recommendations for question modifications to ameliorate some of the effects (Fitzsimons and Moore 2008).

      - Jack Cavanagh, Senior Research, Education, and Training Associate, and Sarah Kopper, Associate Director of Research

      Recall periods

      Interview/question effects

      Sensitive questions


      Sensitive information can be among the hardest data to gather accurately and ethically, but at the same time this data can often be the most informative to answer certain research questions, particularly in studies relating to health, gender, and crime, violence, and conflict. What makes a question sensitive depends on culture and context, but information relating to identity, illegal activities, and socially unacceptable behavior are almost always sensitive. Complicating accuracy, respondents may not answer truthfully due to social desirability bias or embarrassment, or because they feel that a different answer is strategic. Beyond this, they may choose not to answer at all due to privacy concerns or discomfort. On the ethical side, extra considerations (such as enumerator demographics and training, interview environment, and availability of referral resources) are necessary when asking sensitive questions; further, the questions themselves might actually have the ability to harm research participants through various channels, including the potential for retraumatization. For more on the ethics of sensitive questions or human subjects research in general, see our resources on Survey design and the Ethical conduct of randomized evaluations.

      Therefore the main questions that research on measuring sensitive subjects attempt to answer are i) What are the best techniques to get accurate information on sensitive topics, and which situations is each technique best suited for? and ii) What additional ethical considerations do research teams need to take when measuring sensitive questions, and how do those considerations differ across survey media? 

      This section compiles resources that attempt to answer those questions; it begins with resources that provide an overview of what sensitive questions are (e.g., Blair (2015)) and why they are difficult to measure (e.g., Özler (2013) and Fitzsimons and Moore (2008)), moves on to a set of papers that compare and validate current methods for measuring sensitive topics (e.g., Chuang et al. (2020)), and then finally contains sections on three of the most popular measurement techniques: list randomization, randomized response technique, and implicit association testing. In these sections you will find articles discussing each method’s use and studies of their validity (see Droitcour et al. (2004), Blair, Imai, and Zhou (2015), and Kondylis et al. (2019), respectively, for a higher-level introduction of each technique).

      - Jack Cavanagh, Senior Research, Education, and Training Associate and Sarah Gault, Training Manager

      General resources

      Validating indirect response survey methods

      List randomization

      Randomized Response Technique (RRT)

      Implicit association testing (IAT)

      Subjective questions


      Subjective questions assess subjective psychological states and are not verifiable by external observation or records. They aim to measure a number of different things:

      • Beliefs about the object (“Is your child’s school providing them with an adequate education?”)
      • Expectations, i.e., plans for future actions (“Do you plan to enroll your child in the same school next year?” )
      • Attitudes about the object (attitudes are distinct from beliefs in that they measure judgements on normative instead of positive issues: “Your neighbor is trying to decide if they should send their child to secondary school or have them work instead. If the child attends secondary school they potentially could work in a higher paying job in the future, but they wouldn’t be earning money for the family in the present. Ultimately, your neighbor decides to send their child to school. Do you agree with their decision?”)

      While social scientists have made strides in the past decades in improving measurement of subjective questions, important challenges still persist on the definition and quality of indicators. These include the difficulty of creating and comparing measures of subjective welfare, like life satisfaction, happiness, and subjective poverty measures (Friedman 2011); the increased time it often takes to get point estimates for subjective questions (Delavande, Giné, and McKenzie 2011); and the reliability and precision of subjective expectations (McKenzie 2016).

      This section provides resources discussing advances and challenges on the measurement of two specific topics: 1) subjective wellbeing, and 2) subjective expectations. The resources on subjective wellbeing focus on its definition, guidelines on measurement, and best practices for constructing comparable indicators. They explore measurement issues related to multiple facets of wellbeing, including but not limited to meaning in life and autonomy (Samman 2007), hope and aspirations (Wydick 2013), and social connectedness (Zavaleta, Samuel, and Mills 2014). The subjective expectations section provides resources that provide a general overview of the subject (Manski 2004 and Attanasio 2009), as well as papers discussing recent advances in methods (Delavande, Giné, and McKenzie 2011). Cross-cutting both of the subsections are discussions of the extent to which subjective measures may vary across time and space, potentially confounding attempts to create comparable indicators (Kahneman and Krueger 2006; Beegle et al. 2012). For a discussion of subjective questions in the context of survey design, see the J-PAL research resource on Survey design.

      - Daniela Muhaj, Senior Policy and Research Associate, and Sarah Gault, Training Manager

      Measuring wellbeing

      Measuring subjective expectations

      Using games to measure trust, preferences, and risk aversion


      There has been a growing interest in understanding how social norms and preferences affect behavior and decision making and thereby economic and political phenomena such as economic growth, poverty, or corruption. Understanding how they evolve and whether they are malleable are therefore of similar interest. 

      However, measuring norms and preferences such as trust, fairness, or risk aversion is not straightforward. Answers to survey questions asking directly about an individual’s preference may not be accurate due to either general biases such as interviewer effects (Binswanger 1980) and social desirability bias, or more specific biases that arise because of differences in conceptualizing abstract concepts like “trust” (Glaeser et al. 2000Meki n.d.).

      To mitigate these challenges, researchers often use behavioral experiments or games to measure preferences. During these experiments, participants are asked to choose between different options - preferably in an incentive-compatible way - and preferences are derived from individual choices within the experiment instead of relying on survey questions, often yielding more accurate measures of the concept of interest. Experiments to elicit preferences or norms can either be carried out in university labs (so-called lab experiments) or in more naturalistic settings (often referred to as lab-in-the-field experiments).

      While both types of experiments help solve many of the issues that arise in survey measurements of norms and preferences, there are tradeoffs between the two. Lab experiments ensure tight control over the setting and thereby allow the researcher to eliminate potential confounders. However, they might not be a good proxy of how individuals make decisions in real life and often involve non-representative samples (e.g. university students). On the other hand, while lab-in-the-field experiments may be better able to mimic real-life decisions using a more theoretically relevant population, they provide less control over the setting and hence results can be more noisy and/or situation specific and therefore less generalizable and replicable (Gneezy and Imas 2017).
      Finally, there are further considerations to keep in mind when using games to measure preferences and norms. For example, measures from experimental games have been shown to be sensitive to the timing of the experiment (Zelenski et al. 2003), its name (Libermann et al. 2004) and set of choice options (List 2007). Other papers address the difficulty of disentangling related concepts (e.g., Ashraf et al. 2006) or the validity of the proposed measures more generally (e.g., Dean and Sautmann 2019).

      This section compiles resources that provide an overview of, and introduce innovative solutions to these and similar difficulties in using games to measure norms and preferences. It begins with resources that provide an introduction to using experiments in economics, listing common games and highlighting challenges. It then contains two sections that discuss the measurement of trust, cooperation & fairness and risk and time preferences, respectively.  

      Katharina Kaeppel, Senior Research and Training Associate and Michala Riis-Vestergaard, Postdoctoral Training Associate

      General resources

      Using distributional games to measure trust, cooperation, and fairness

      Risk and time preferences

      Last updated June 2022. These resources are a collaborative effort. If you notice a bug or have a suggestion for additional content, please fill out this form.


      We thank Aimee Barnes, Sarah BaumSam CarterAnupama DathanMaya Duru, Sarah Gault, Nilmini Herath, Eliza KellerTithee Mukhopadhyay, Kyle MurphyRohit NaimpallyWilliam Pariente, Maike PfeifferMikaela Rabb, Andrea Cristina Ruiz, Emily Sylvia, and Caroline Tangoren for helpful review and comments, and Manvi Govil and Ximena Mercado Garcia for their help copy-editing the resource. Any errors are our own.

      For example, Jayne et al. (2018) calculate 853 million USD spent by 10 African countries on input subsidy programs in 2014.
      That is, that productivity is highest around a plot’s borders, an observation widely recognized in the agronomic literature.

        In this resource