Research Resources

Implementation monitoring

Summary

Researchers should monitor the implementation of a program to preserve its integrity of the program and collect additional information that can inform the generalizability of the results of the program. There is a variety of methods available to researchers, such as administrative data, site visits, and focus group discussions. This resource provides an overview of monitoring methods, how to select indicators to monitor, and how to choose monitors. For more information on how to set up and carry out a monitoring plan, see our companion resource, Real-time monitoring and response plans.

Introduction

In order to understand the effect of a program researchers need to make sure it is actually implemented—and implemented as designed. For example, did the program reach the correct participants? At the correct time? In the correct quantity? At a high level of quality? If aspects of the program are not implemented correctly, researchers may falsely conclude that the program had no effect, when in reality, the program wasn’t actually implemented as designed. Implementation requires careful monitoring to ensure that the program maintains fidelity to its design. 

Monitoring the implementation of an intervention can help mitigate threats to internal validity. While randomization ensures internal validity at the beginning of the study, there are several threats (such as attrition, spillovers, non-compliance, and evaluation-driven effects) that can arise over the course of the experiment and bias estimates of the treatment effect. If these threats are anticipated, researchers can modify the study design at the outset to avoid them—see the “Threats and Analysis” lecture from our Evaluating Social Programs (ESP) course or the resource on randomization for examples. However, it is also important to monitor whether these threats are occurring during the implementation of the intervention in order to address them in real-time or to shape analysis ex-post.

The data collected as part of implementation monitoring can also inform our understanding of the external validity of our result. By providing information on a program’s implementation details, researchers can make it easier for decision-makers to assess whether the program might work in their context. 

Indicators to monitor

The exact indicators to measure during monitoring will depend on both the broad sector of the intervention (e.g., health, agriculture, etc.) and the specific research questions that the project is trying to answer. Researchers/implementers should identify the points in a study when it is most important to follow protocols (or in other words, where there are the most concerns about things going wrong). If some version of the intervention exists prior to the evaluation, assess the status quo before the study begins to measure benchmarks of implementation for future comparison. It may be helpful to consider the CART principle developed by IPA: monitoring data should be credible, actionable, responsible, and transportable

  • Data should be credible in the sense that it is high quality and believable. Suppose the intervention is a cash incentive for teachers if they attend school for a certain percentage of school days. A daily photo of the teacher in their classroom might be a more credible measure of their attendance than that teacher’s self-reported attendance.
  • The data should be actionable—the point of monitoring at the beginning of the program is to course correct any issues so that the evaluation should take place as intended. If research teams collect too much data, they might not be able to quickly analyze it to report back to implementers.1 Research teams should consider a few key indicators that focus on things the implementing partner can control. Consider the research questions, relative severity of possible concerns, and available budget and staff capacity in order to choose which indicators to monitor actively and how frequently to check them. For example, if the intervention is a cash transfer delivered via mobile money, research teams may choose to focus on a count of the number of mobile money payments sent by the partner organization, and the number of respondents who were eligible and had received the mobile money. If researchers find too many recipients are not receiving the money, they may need to redesign the intervention, or examine where the issues in money delivery are occurring. While other aspects of implementation may influence the effectiveness of the intervention (such as the ease of using the mobile money application, the number of vendors in a village who accept mobile money, etc.), the partner organization has the most control over the dispersion of mobile money. 
  • To be responsible, the benefits of collecting monitoring data should outweigh the costs. Establishing a system for monitoring can be costly—it often requires designing survey forms, training enumerators, surveying respondents, and more. Research teams should consider indicators that provide the most information and minimize the burden to respondents. Administrative data can often be utilized in monitoring at low cost and burden, but there may be delays in getting access to administrative data so it may not be useful at the right moment. For example, using the data a mobile money provider collects as part of its routine operations (e.g., the amount of money sent, the names of senders and recipients) can be used to monitor an intervention without further burdening the respondents. 
  • While the main focus should be monitoring the intervention, the data can be transportable if it generates knowledge for other programs. Research teams and organizations can help other implementers by sharing the information they collect. For example, suppose researchers tested two models of delivering in-school tutoring and found that before-school tutoring was more effective than after-school tutoring because after-school tutoring sessions had lower attendance from both teachers and students. This implementation detail can be shared with other organizations and researchers who are designing in-school tutoring programs.

The indicators collected serve two purposes: to inform researchers about program fidelity and internal validity during the intervention and to inform analysis and reporting after the intervention. To illustrate how these indicators differ and how they are used, first consider the case study below:

Suppose the program involves certain branches of a large, nationwide bank offering special savings accounts to households earning below a certain income cutoff. The research team distributes flyers advertising the accounts to eligible households in regions served by treatment branches, informing them of the following features of these accounts: 

  1. The accounts’ interest rate is higher than that for standard savings accounts offered by the same banks.
  2. Households saving above a certain amount are awarded a cash bonus that is randomly determined at the bank level. 

Randomization is done at the bank level, and the sample of bank branches is chosen such that there are buffer zones in between treatment and comparison banks.
 

Selecting indicators
 

During the implementation of an intervention, multiple indicators will need to be collected and reviewed. The process is iterative; research teams might need to revise the indicators they collect to match local conditions, or to respond to some external factor. Guidance on how to monitor your indicators can be found below.

  • To improve implementation quality and/or ensure program fidelity in real-time, look at indicators that can illustrate whether and how each step of the implementation is carried out. 
    • In the case study example, researchers would need to check if: 
      • The correct banks were giving out the accounts 
      • (The correct) households were receiving the flyers (at the predetermined time, as per the research design)
      • Households who saved the correct amount received the correct randomly determined bonus 
    • Researchers should also examine both extensive margins of implementation quality (e.g., did a treatment group bank branch ever offer the savings account) and intensive margins (e.g., how many times did a treatment group bank branch offer savings accounts).  
  • To avoid threats to internal validity researchers need to consider how each threat applies to their project.2 Note that while addressing and resolving threats to internal validity can add to the cost of the project (e.g., by delaying the project to allow for survey modifications, increased monitoring, etc.), doing so means researchers can be confident in the results of the study.
    • Attrition can be measured by the number of respondents who cannot be found in each survey wave. Researchers should consider collecting contact information to make it easier to find participants in the next round.3 This can be done cost-effectively through phone surveys, before the start of a future round of surveys.
    • Compliance can be monitored by comparing a list of respondents who were treated to a list of respondents who were randomly assigned to the treatment group. The intervention design impacts how researchers can determine who actually was treated. For example, if the intervention is a mobile money transfer, researchers can access administrative data to see which accounts received the money. If the intervention is a training on agricultural practices, the implementing staff can collect an attendance list. Researchers should monitor the treatment status data to find implementers who either allow too many comparison respondents to take up the treatment, or vice versa, and retrain them to prevent further issues with compliance. Keep in mind that in some research designs, it is important that enumerators do not know which respondents belong to which group; see enumerator-driven effects below for more information.
    • Spillovers can be tricky to measure. Researchers will need to think through the ways that the comparison group can be impacted by the treatment group taking up the intervention and monitor the comparison group accordingly. In the running example, one possible spillover is that the bank allocates funds away from comparison bank branches in order to afford the higher interest rates and bonuses in treatment banks. To monitor this, researchers would need the number of accounts comparison banks opened and the amount of money in the accounts, both of which could be obtained through the bank's administrative data. 
    • Evaluation-driven effects occur when respondents change their behavior in response to the evaluation instead of the intervention. These can be difficult to monitor, but in general, research teams should minimize the interactions between enumerators and respondents in order to reduce the risk of these effects. These effects include:
      • Observer driven effects occur when respondents—in treatment, comparison, or both groups—change their behavior in response to being observed.4 Respondents may be able to tell the goal of the project based on the interview questions and moderate their behavior accordingly. One strategy to measure observer-driven effects is to include a “pure control” group, who is randomly selected to not receive the treatment and to be interviewed fewer times than either the comparison or treatment group. For example, in one study on the impact of teacher bonus pay on student achievement, researchers created two comparison groups: both groups did not receive the intervention, but while the main comparison group did receive periodic monitoring (at the same level as the comparison group), the pure control group were not monitored (Muralidharan & Sundararaman, 2011). The pure comparison group allowed the researchers to determine if treatment effects were driven by the intervention or by the monitoring (i.e., the observer effect).  
      • Enumerator effects occur when respondents change their behavior in response to an enumerator rather than the intervention. Respondents may react to both the behavior and characteristics of enumerators when reporting answers. For example, female respondents may answer sensitive questions less accurately when being surveyed by a male enumerator (Di Maio & Fiala, 2018). The behavior of the enumerator (such as their reactions to respondent’s answers) may influence the responses they receive. For example, if enumerators have a sense that the research project seeks to understand nutrition, they may subtly respond positively when respondents report eating foods with greater nutritional value. This could influence respondents to misreport their true food consumption. To avoid enumerator effects on the basis of enumerator characteristics, researchers can consider piloting the survey to determine if enumerator characteristics systematically influence responses. To avoid enumerator effects on the basis of enumerator behavior, researchers should conduct thorough surveyor training and conduct data quality checks to detect any enumerator effects. 
    • Contamination occurs when some external factor influences outcomes other than the treatment. For example, if a NGO moves into study areas and starts offering a similar program to the one being studied, respondents’ outcomes could be changing due to the NGO’s program and not the one being researched. Contamination often requires a deep understanding of local conditions to be monitored. Research teams should conduct focus groups with implementation staff or local government officials to get a sense if other interventions are occurring in the evaluation location. 
  • To inform ex-post analysis and reporting, thoroughly document any threats to internal validity that were found during the course of the intervention. This information will shape both how the data analysis is conducted (e.g., estimation of the LATE and the ITT) and can also help the implementing partner fine tune their program. Often, this does not involve collecting new data, but rather consists of documenting the issues encountered while monitoring threats to internal validity.

  • To inform ex-post generalizability and scaling-up collect data on the costs of the intervention and contextual details. Costing indicators might include the number of staff needed to implement an intervention, their salaries, the cost of inputs, and measures of time and effort spent on the intervention, while contextual details might include local conditions and implementation capacity (e.g., number of schools, population density, views on vaccinations).

    •  J-PAL has resources on collecting cost effectiveness data, including guidelines and templates, and a framework for how to determine generalizability. As pointed out in Holla (2019), collecting this data may be challenging: partner organizations may not be willing to share this data, or may fear how it is used. It can be beneficial to discuss the plan to collect costing data with the partner organization, prior to the intervention occurring. 

For general guidance on which indicators to measure, see J-PAL’s Introduction to measurement and indicators resource.

Methods for monitoring

Researchers can use a variety of subjective and objective methods to monitor implementation. Evaluations often employ many or all methods of monitoring. A description of each method can be found below with an application to the case study, followed by a table that provides resources on how to conduct that method, its associated pros and cons, and links to papers that employ that method.

Monitoring methods

  • Adding survey questions to planned survey rounds, such as the midline or endline surveys, can provide information on compliance, spillovers, and take up for ex-post analysis. While the midline and endline surveys may occur too late to fix implementation issues while the program is being implemented, the monetary cost of adding questions to existing surveys is low and the information generated is important to document for later analysis. Note, however, that adding questions to a survey by definition lengthens the survey, which can increase the costs of surveyor training or reduce the productivity of surveyors.

  • Administrative data can be used by research staff to monitor compliance, take up, and spillovers. Administrative data can require some upfront work to be utilized—for example, any survey data collected must contain some indicator that links to the administrative data. The research team should verify any important terms (e.g., household, family, district, etc.) have similar definitions for both the administrative data source and the research team. While some administrative data can provide real-time feedback, sometimes it can cause delays in analysis, if it is generated infrequently. Administrative data can be difficult to acquire, so research teams should work to procure access well ahead of the intervention. 

  • Focus groups or interviews conducted with key informants (such as implementers, participants. etc.) can help research staff determine critical aspects that influence the quality of the intervention. These include cultural and local context aspects that may not be evident from the program alone, as well as the perception of the program by those impacted by, or involved in, the intervention. Speaking with implementers and key informants can reveal information that isn’t reported by respondents or enumerators, such as logistical challenges in delivering the intervention, pushback from local leaders on who receives the program, confusion about the rules of the program, or cultural issues that may affect how the program and/or research is being delivered and received and may be difficult to observe. However, interviews and focus groups can be more costly (in terms of both monetary and time costs) than other forms of monitoring. When designing the focus group questions, consider the hierarchy of the implementing organization and determine which questions are appropriate for each level of staff. Focus group discussions should function in tandem with the other methods of monitoring. In the case study, researchers might find low levels of take-up in several branches and conduct focus groups with bank staff or participants to understand what is causing the issue.

  • High-frequency monitoring (such as through mobile phone surveys) allows the research team to monitor compliance, take up, and attrition through frequent, short, engagements with the research team. Sometimes, these surveys are part of a larger back check survey—see the data quality checks resource for more information. Oftentimes this type of monitoring is conducted via mobile phone surveys. Due to their delivery method, mobile phone surveys should consist of a relatively small number of questions on important concepts. The delivery method of high-frequency monitoring also can limit the type of questions asked; if sent via text message, respondents may be constrained to entering only numbers or short text responses (in contrast to in-person or surveys via phone call, where longer responses and additional types of data can be collected, such as photos or audio recordings). However, mobile phone surveys can typically be deployed at relatively low costs. To limit attrition, research teams can use mobile phone surveys to keep respondents’ contact information up to date, especially if there are long gaps between survey rounds.

  • Other objective, observable data might be available to incorporate in the project. A common example is the use of photographs, which have been used to monitor teacher attendance and community meetings with police. Other projects have used remote sensors to monitor fuel emissions from cookstoves. Research teams should think carefully about how they will use this data, particularly in the early stages of the experiment–with a large sample size, it may prove difficult to analyze the teachers’ attendance photos, for example.5 The cost of such data depends largely on if researchers can access existing infrastructure to obtain the data.

  • Site visits allow research staff to directly observe the quality of a program's implementation. They can also be useful for documenting issues that will inform ex-post analysis. Keep in mind that site visits require careful planning—for example, they are most useful when research staff show up unannounced—and can induce Hawthorne effects (Evans, 2014). An alternative approach is the “mystery shopper” model, where trained surveyors who are unknown to the implementer pose as potential clients or participants.6

Table  1 Monitoring methods
Adding survey questions to a planned survey
Resources Pros Cons
  • Low cost if added to existing endline survey or back-checks
  • Quantifiable (can use in analysis)
  • Can provide real-time feedback on uptake, implementation (depending on study)
  • May occur too late to make changes to implementation (e.g., if the questions are added to the endline survey)
  • May not provide information on why a program is or isn’t working
  • Some questions may be manipulatable (is there an incentive to misreport?)


Administrative data
Resources Pros Cons
  • Can be generated at low cost, especially if the implementing partner is already  generating it
  • Can provide real-time feedback on uptake, implementation (depending on study, data, and partners)
  • Definitions may differ by partner organization (e.g. what constitutes a household?)
  • Depending on the indicators captured by the partner, may not provide information on why a program is or isn’t working
  • May not provide real-time feedback and instead cause delays in analysis
  • Documentation can be poor
  • Potentially limited set of indicators


Focus groups
Resources Pros Cons
  • Can provide valuable insight to inform research design
  • Can provide insight into why an intervention is or isn’t working
  • Can check for respondents’ perceptions of treatment quality
  • Can provide qualitative, anecdotal evidence
  • May be more costly than using admin data or adding survey questions
  • Can be difficult to administer and moderate


High-frequency monitoring (e.g., through mobile phones, short surveys)
Resources Pros Cons
  • Can be deployed rapidly and implemented at a low cost
  • Can be used at a high frequency over a period of time (especially between major survey waves) to constantly monitor dynamic programs 
  • Can be adapted swiftly to changing circumstances
  • (If using mobile phones) May be less effective due to under-coverage of groups with poor network connections or limited access to phones 
  • Known to be affected by high levels of non-response and attrition
  • Can be limited by the practical length of the interview 


Objective observable data 
Resources Pros Cons
  • Can be low-cost if built on an already existing system 
  • Researchers may have limited control over variables collected
  • Some data (sensor data, photographs) may be difficult to work with in analysis


Site visits
Resources Pros Cons
  • Research staff observe firsthand how the intervention is being implemented
  • Can check for problems with treatment quality, implementer effects
  • Can be more costly (financially and in terms of time) than using admin data or adding questions to an endline survey or back-checks
  • Of little value if not done properly (e.g., implementers notified of visit in advance) 
  • Can induce Hawthorne effects

Selecting monitors

Researchers face a choice when selecting who should conduct monitoring. When creating a monitoring plan researchers should seek to pair the monitor with the monitoring tool that minimizes the burden of monitoring on respondents, and reduces the risk of biasing the evaluation. 

With study participants, researchers should be careful to avoid experimenter effects caused by frequent interaction with study staff. Consider utilizing additional survey questions built onto the planned survey (e.g., adding back-check questions to the endline survey), or mobile phone-based surveys to collect information on a small number of key indicators. Focus groups with study participants may reveal valuable insights, but should be planned carefully—research staff may clue participants into the study design, and cause social desirability bias to influence participants' responses.7 

Due to directly implementing the program, implementation staff can provide valuable feedback on why it is or is not working. Consider using implementation staff to generate administrative data, such as attendance records from a training, or to collect objective data, such as photos of participants during the training. 

In addition to conducting surveys, enumerators and research staff can also conduct site visits and facilitate focus groups. Care should be taken to ensure that research staff do not elicit or social desirability biases. Researchers should also periodically conduct open-ended conversations with field staff to get a general sense of how the evaluation is going. 

Administrative data can often be collected via systems (e.g., mobile banking applications, student attendance records, etc.). Researchers should create processes for accessing systems and test them before the intervention goes to the field.

Once the research team has chosen their indicators, monitoring tools, and monitors, the next step is to create a monitoring plan. For information on how to create one, see the Real-time monitoring and response plans resource.

Last updated February 2021.

1.
Collecting too much data also poses ethical concerns. Burdensome surveys (those that ask too many questions, or take too much of the respondent’s time) can violate the beneficence principle. See the Ethics resource for more information. 
2.
Note that many threats to internal validity can (and should) be minimized or avoided through program design––see the threats and analysis lecture from our ESP course, or the resource on randomization for further information.
3.
 If the observer-driven effects occur in the treatment group, they are referred to as Hawthorne effects, named for a study in the Hawthorne Works electrical plant in which researchers determined increases in productivity came from the workers being observed, rather. If observer effects occur in the comparison group, they are known as John Henry effects, in reference to an American folklore tale (Duflo et al., 2007). 
4.
 See the respondent tracking section of the survey logistics resource for more information.   
5.
For examples of research that used machine learning to analyze image data, see Donaldson & Storeygard, 2016: https://www.aeaweb.org/articles?id=10.1257/jep.30.4.171
6.
 For an example of how to utilize the mystery shopper model, see Steinman et al., 2012: https://link.springer.com/article/10.1007/s11414-012-9275-1
7.
Social desirability bias occurs when survey respondents give an answer they think is common in their context, or socially acceptable, rather than their true answer.
Additional Resources

Di Maio, Michele and Nathan Fiala. “Be Wary of Those Who Ask: A Randomized Experiment on the Size and Determinants of the Enumerator Effect.” World Bank Policy Research Working Paper No. 8671, (2018). https://ssrn.com/abstract=3299769

Donaldson, Dave, and Adam Storeygard. 2016. "The View from Above: Applications of Satellite Data in Economics." Journal of Economic Perspectives, 30 (4): 171-98.DOI: 10.1257/jep.30.4.171

Evans, David. “The Hawthorne Effect: What Do We Really Learn from Watching Teachers (and Others)?”. World Bank Development Impact (blog). February 2014. Last accessed August 08, 2020. 

Holla, Alaka. “Capturing cost data: a first-mile problem” World Bank Development Impact (blog). April, 2019. Last Accessed August 08, 2020.

Muralidharan, Karthik, and Venkatesh Sundararaman. "Teacher Performance Pay: Experimental Evidence from India." Journal of Political Economy 119, no. 1 (2011): 39-77. Accessed February 24, 2021. doi:10.1086/659655

In this resource