Evaluating technology-based interventions
This resource provides guidance for evaluations that use technology as a key part of the intervention being tested. Examples of such interventions might include automated alerts embedded into an Electronic Medical Record, or a text messaging platform facilitating communication between teachers and parents. This resource discusses challenges that J-PAL North America has seen across several studies and describes steps that may help mitigate them. This resource is not targeted to studies in which the only technological component is enrollment or survey administration, though some of the guidance may be applicable to those as well.
Technology-based interventions can offer many benefits, in particular the ability to standardize treatment across participants and sites. This can make it easier to replicate the interventions with fidelity across contexts, though the context itself can still change the effectiveness of the intervention (Pane et al. 2014; Roschelle et al. 2016). If interventions are shown to be effective, being technology-based also offers unique opportunities for rapid scaling and replication.
Evaluating technology-based interventions, however, poses some unique challenges. Generally, these challenges center around:
- Research and technology design: designing both the intervention and the study well, and integrating them with one another;
- Study implementation: executing the study, in particular being vigilant about randomization fidelity;
- Interaction between the technology, study, and individuals: accounting for participant adoption of the technology, and ensuring that all implementers are aware of the study.
This is by no means a comprehensive list, and even in taking all these measures, challenges may still arise. If any readers have additional studies or suggestions they would like to contribute to this resource, we encourage them to reach out to the J-PAL North America team at [email protected].
Research and technology design
- Try to study pre-existing interventions instead of designing new ones. Researchers have repeatedly reported that developing new interventions is much slower and costlier than anticipated. One-time development of new technologies is usually insufficient, and these technologies may also have maintenance costs as underlying platforms or systems change.
- Work with a partner who has experience designing, building, and maintaining tech interventions. There is a world of technology-building expertise to draw on, and it is difficult for researchers to be experts in both creating and evaluating a technology-based intervention. Working with a partner, we recommend that researchers:
- Carefully assess the software platform early in the process. Software companies, especially those who are new to research, may promise more flexibility in their programming than is actually available. The more researchers probe earlier on, the faster they may be able to discover gaps between what is required for the study, and what is possible for the company to build.
- Ask early on about the timeline and cost of development—and budget extra. Even for updating old interventions, both timing and cost often end up being greater than anticipated. It helps to have buffer time and funding in case unexpected developer time is needed.
- Consider testing parts of the technology in different treatment arms. Because technological interventions can be standardized across participants in a way that interventions requiring human service delivery cannot, they allow for more precise study of mechanisms. With a large enough sample size, multiple treatment arms with different versions of the intervention may isolate these mechanisms (Oreopoulos and Petronijevic 2018).
- Use process and outcomes data collected from the system. For example, researchers may collect information on: whether participants open screens, how frequently they log in, how many times they click on the screen, whether they use specific software, and other process measures that give insight into how the intervention is used.
- Interim measures can also be actively obtained on the technology platform (e.g. through assessments or surveys). If researchers use these data as outcomes, however, they must consider whether and how to obtain similar measures for the control group.
Case study: One research team concluded that a tech firm was a good fit based on several conversations and their initial use of the platform. Only several months into their collaboration did the researchers discover that the platform was not capable of incorporating if-then loops, which are common in most programming languages. This substantially complicated design and implementation of the planned intervention, as well as randomization.
Triple-check the randomization procedure and sample
Before the study launches, the technology related to both intervention delivery as well as randomization should be heavily scrutinized. Implementing partners likely have expertise in running programs, but may have less experience with testing how random assignment fits within these programs. Researchers play a critical role in flagging the importance of piloting randomization and building robust and comprehensive test cases to ensure the fidelity of assignment.
- Pilot both the intervention itself and the randomization procedure. Researchers can work with implementing partners to run and explicitly document checks on treatment dosages for all individuals tracked in the study (i.e., treatment, control, refused/other). These tests should be done before the study begins.
- Check the exact logic statement used to implement random assignment. For a given random assignment strategy, there are a number of ways it can be translated into a logic statement or code. It is essential that the researchers are provided the exact logic statement that is used to determine whether or not the intervention is active for an individual or group. All variables and technology employed in the logic statement should be examined and tested to ensure definitions are correct. See the Appendix for an example of such a statement.
Case Study: In one study, the intervention was embedded into a broader technological platform and was meant to be available only to the list of participants in the treatment group. Six months into the study, it was discovered that the intervention was actually delivered to all individuals enrolled in the study, based on a flawed definition of a variable that was meant to define the study population.
Case Study: In another study, the random assignment was done incorrectly twice, in different ways. Researchers were alerted to this based on data from regular checks, and eventually had the developer create a video to explain the exact random assignment code so it could be verified.
- Check test cases in a “live” environment, or as close to one as possible. It is important to test on the newest version of the software and on the live platform, whether this means in an online forum, in an Electronic Medical Record (EMR), or on an updated device. For example, some hospitals have a “training” environment for the EMR where doctors can practice putting in test orders and getting a sense of the workflow. However, this environment may differ slightly from the “live” EMR that has actual patient information and where physicians can put in real orders and notes.
Verify random assignment regularly
Work with the partner to develop (preferably automated) monitoring tables, as well as a regular system for reviewing those tables. Monitoring tables display indicators of treatment intensity and/or key steps of implementation—such as proportion assigned to treatment, balance across treatment groups for different characteristics, or the duration of a certain step in an intervention—for a set of cases. An unexpected change in the results displayed in these tables could indicate a problem with delivery of the intervention.
Case study: In one study, researchers asked for regular checks on the fidelity of implementation, including results of t-tests to check 50/50 random assignment. These checks alerted them twice to mistakes in random assignment early in the study, including once where treatment and control assignments were alternated sequentially.
Be aware of software updates
Some questions to consider are: When do the operating systems on phones or computers change, does coding of the intervention need to change, or do software packages need to be re-loaded? If the intervention is embedded into something like an EMR, do version changes to the platform affect outcomes collection in any way? Operating systems can affect things like data storage, precision of digits stored, randomization algorithms, and pseudo-random seed generators.
- Understand your developer’s update schedule. Developers often have set release dates for updates —knowing their timing allows researchers to account for it in designing the study. Regardless of explicitly research-related updates, it is useful for researchers to have updates on and dates of software version changes, downtimes, and testing procedures for version upgrades.
Case study: In a study on Clinical Decision Support for Radiology Imaging, with every version change in the EMR, the IT team needed to re-verify the list of physicians randomized into the treatment versus control groups and notify the research team. The intervention was also added to the list of packages to “reload” with each system reset or upgrade. This was a simple and automatic process after the first manual addition of the intervention to the list.
Interaction between the technology, study, and individuals
Make sure everyone who can change the intervention knows about the randomized evaluation
With all randomized evaluations, buy-in from implementing staff is important—but with a tech intervention, even staff who are not involved directly with the randomized evaluation may make changes that affect the evaluation.
- IT staff should know how to answer help desk emails and calls related to the intervention or study.
- The implementing partner should have a clear point of contact on the research team in case of problems or changes (e.g., power outages).
Case study: Three months into one study, a newly-hired engineer at the intervention’s tech company noticed that the intervention at the study site was implemented slightly differently from those at the company’s other sites. Assuming this was an error, they changed programming of the intervention at the study site to match the other locations, which changed the inclusion/exclusion criteria for the study. Researchers did not realize this was the case until a significant amount of time had passed, when they noticed discrepancies in the data collected.
Try to think separately about questions of adoption versus efficacy
Adoption of new technology is often very low. Evaluating technology adoption and efficacy simultaneously can make it difficult to interpret results if lack of adoption causes questions of efficacy to be underpowered. This is a great reason to find implementing partners whose technology has proven high fidelity of usage (if the goal is to test efficacy).
Case study: Researchers were hoping to evaluate an app to be used in a defined target population. Outreach was done by a combination of email, text reminders, and peer ambassadors. Although 8,000 participants were emailed a link to access the app, only 4% signed up and 1% logged in two or more times. The research team paused the original study and turned efforts toward assessing other options for increasing take-up.
- Ways to improve take-up will vary depending on the intervention, but could include: social media advertisements, working with community representatives or organizations, recruiting participants in-person and setting aside time for them to engage with the interventions, and adding short-term incentives for interactions with the technology.
- Test for take-up and account for it in your power calculations. This can be done in a number of ways, including piloting the app and holding focus groups with participants (both those who do and do not use the app), building process measures into data collection (e.g., signing up versus logging in), and examining historical data based on this specific intervention.
- Piloting the intervention is especially critical if it is new or changed significantly. It can be especially useful to gather qualitative feedback and have discussions with participants to see how and when they use the technology. This feedback is most informative when the intervention is piloted wherever it is meant to be used (e.g., in a clinic, at a school, etc.).
- Low take-up has a large effect on statistical power for studies using intent-to-treat.1
- Consider opt-out versus opt-in adoption of technology. If allowed by the IRB of record, automatic enrollment may be an option to dramatically increase take-up of an intervention.
- It is critical to remember that repeated use of the technology (and thus delivery of the intervention) may still be low, even if all participants are automatically “enrolled.”
- It is critical to remember that repeated use of the technology (and thus delivery of the intervention) may still be low, even if all participants are automatically “enrolled.”
Case study: In a text message alert system to parents about child performance and school attendance, researchers found that automatic enrollment resulted in 96% adoption and <4% attrition from the treatment group. This is compared to 8% adoption in the “simplified adoption” group, where parents had to respond “yes” to a text message to adopt the technology (Bergman and Rogers 2017).
In this appendix, we explore a hypothetical research study to demonstrate how slight differences in code can have impacts on the composition of the study sample. We suggest tests designed to validate whether the code produces the intended sample.
Consider an evaluation of an education-related intervention for elementary school students. A data set of all students exists, but is not directly accessible to researchers. In order to identify the sampling universe for the study and assign students to interventions, they must describe the study inclusion criteria and randomization procedures to the holders of the data set. Researchers will receive a de-identified data set consisting of the study sample.
The researchers decide to focus on public school students in the second, third, fourth, or fifth grades for whom English is their first language. To simplify for the data holder, they provide the following inclusion criteria:
- In 2nd, 3rd, 4th, or 5th grade
- Attends a public school
- Does not participate in English as a Second Language (ESL) classes
In 2nd, 3rd, 4th, or 5th grade
Error may be introduced if, for example, a typo results in the use of a strict rather than a non-strict inequality:
grade >2 & grade <=5
grade >= 2 & grade <=5
inrange(grade, 1, 5)
inrange(grade, 2, 5)
Attends a public school
Error may be introduced if the code does not properly account for all types of schools included in the dataset. For example:
The following example may improperly exclude students who attend public schools with a different code (e.g., public magnet or charter schools.)
School_type != "private"
School_type == "public"
Does not participate in English as a Second Language (ESL) classes
These methods may produce different results if ESL is ever missing for students not enrolled in ESL courses.
ESL == 0 ESL != 1
Interaction of inclusion criteria
Researchers should verify that the code combines or applies inclusion/exclusion criteria appropriately. For example, if this code were implemented in Stata, it would include all charter school students, due to the syntax of Stata.
Using parentheses to subset criteria can ensure that the program executes operations in the intended order, and may make the code easier for humans to read. For example:
if inrange(grade, 2, 5) & ESL != 1 & School_type == "public" /// | School_type == "charter"
if inrange(grade, 2, 5) & ESL != 1 /// & (School_type == "public" | School_type == "charter")
Researchers may request the code used to create the de-identified data set in order to review the translation of inclusion criteria to code. We also suggest researchers request a codebook including data definitions, summary statistics, and all possible values of each definition included in the code. Further, we suggest the use of test cases to assess the code. For example, researchers can test edge cases – cases just inside or just outside of eligibility criteria.
For this example, a request for a codebook and testing might include:
- Students by grade:
- Test inclusion/exclusion of all edge cases: 1st, 2nd, 5th, and 6th grades
- Test inclusion/exclusion of: student who attends ESL course, student who does not
- Codebook for all possible values of "ESL" variable
- School type:
- Codebook for all possible values of "School_type" variable and related indicators
- Test inclusion/exclusion of students at each possible school type, and students where school type is missing
- Test cases for all possible interaction of the above test cases, e.g.:
- A 3rd grade student enrolled in ESL courses at a public school
- A 3rd grade student enrolled in ESL courses at a private school
We are grateful to Peter Bergman, Ben Castleman, Jennifer Doleac, Amy Finkelstein, Louise Geraghty, and Sam Haas for their insight and advice. Chloe Lesieur copy-edited this document. This work was made possible by support from the Alfred P. Sloan Foundation and Arnold Ventures.
Power Calculations 101: Dealing with Incomplete Take-Up | World Bank blog
David McKenzie’s blog post provides a more complete illustration on the effect of the first stage on power.
Conduct power calculations | J-PAL North America Evaluation Toolkit
This resource outlines key principles, provides guidance on identifying inputs for calculations, and walks through a process for incorporating power calculations into study design.
Bergman, Peter, and Todd Rogers. 2017. “The Impact of Defaults on Technology Adoption, and Its Underappreciation by Policymakers.” CESifo Working Paper Series, no. 6721 (November). https://ssrn.com/abstract=3098299.
Oreopoulos, Philip, and Uros Petronijevic. 2018. “Student Coaching: How Far Can Technology Go?” Journal of Human Resources 53 (2): 299–329. https://doi.org/10.3368/jhr.53.2.1216-8439R.
Pane, John F., Beth Ann Griffin, Daniel F. McCaffrey, and Rita Karam. 2014. “Effectiveness of Cognitive Tutor Algebra I at Scale.” Educational Evaluation and Policy Analysis 36 (2): 127–44. https://doi.org/10.3102/0162373713507480.
Roschelle, Jeremy, Mingyu Feng, Robert F. Murphy, and Craig A. Mason. 2016. “Online Mathematics Homework Increases Student Achievement.” AERA Open 2 (4): 2332858416673968. https://doi.org/10.1177/2332858416673968.