Twelve iterative A/B tests to optimize tutoring for scale
This is a guest post authored by Youth Impact’s Noam Angrist, Elle Brooks, Colin Crossley, and Claire Cullen.
In Maun, Botswana, a tutor makes her daily calls to students who are falling behind in basic maths. She notices something interesting: whenever a parent joins in—listening, asking questions, helping their child think through a maths problem—the student seems to learn faster. She mentions this at her weekly team meeting, and several other tutors agree—they’ve seen the same thing.
By the next school term, this observation is being tested. Students are randomly split into two groups: half the students receive the tutoring program as normal, while in the other half, tutors invite caregivers to “take over” halfway through the call. At the end of term, the results are striking: students whose caregivers took over performed much better.
This improvement was made possible because A/B testing had been built into the program’s systems, allowing for rigorous experimentation to be embedded into day-to-day implementation and giving results in real time.
When scaling, it's hard to maintain impact. What if it wasn't?
In global education, this story is familiar: a program proves effective in a randomized controlled trial (RCT), but when scaled, its impact drops. This usually happens because early pilots tend to be more intensive, better resourced, and closely monitored than large-scale programs. But what if the opposite were possible? What if as a program scales, it not only gets cheaper, but also gets better?
ConnectEd, a phone tutoring program, has been shown to be effective in multi-country RCTs. But rather than viewing RCTs as the end of the evidence journey, we see them as the foundation for continuous improvement. We always ask: “how can it work even better?”
Rigorous, rapid, and regular: the A/B testing approach
In a series of 12 iterative A/B tests, we compared two versions of our tutoring program to see which performed better. We call these version A, or the “status quo”; and version B, which contains a slight variation, like asking a parent to take over the phone call. At the end of each test, results were compared and whichever version delivered the most impact for the least amount of money was generally adopted as the new “status quo”. In this way, each test built on the last, honing the most cost-effective model, cycle after cycle.
The A/B testing approach is characterized by 3 ‘Rs’: it’s rigorous, because it uses randomization just like RCTs; rapid, running a test per school term; and regular, since it is embedded as an ongoing practice rather than a one-off project.
What we learned from 12 tests
Not every A/B test “improves” a program; of the 12 A/B tests we report on in our new paper, seven led to a measurable program improvement. Each of the seven best tests resulted in a measurable improvement of between 9 and 30 percent—this adds up to a huge cumulative difference when delivering at national scale.
The tests either focus on reducing costs, or on improving impact, and all aimed to make tutoring more cost-effective and therefore scalable.
- Cost-reducing tests. Cost-reducing tests aim to simplify or remove components of a program while preserving impact. As an example, in one experiment we tested a change to the schedule from 20 minutes of tutoring per week, to 40 minutes every two weeks. Results stayed the same (so the program was just as effective), but operational costs dropped by 11 percent.
- Effectiveness-enhancing tests. Effectiveness-enhancing tests add a component to the program at low marginal cost to increase impact. For example, in the caregiver participation case mentioned earlier, parents were invited to co-lead phone calls. This change didn’t increase costs, but it doubled the effectiveness of the program.
A new model for learning organizations
The benefits of A/B testing go beyond improved outcomes and lower costs. When we spoke to practitioners—tutors and program leads—we found that A/B testing also helped align their beliefs with evidence. In other words, implementers became more data-driven in their decision-making. A/B testing is not just a tool for researchers—it’s a learning system for frontline implementers. It helps organizations build a culture of evidence, one small test at a time. It also builds research considerations into programming and program considerations into research, syncing evidence and action.
What's next?
Few social and public sector organizations currently use iterative A/B testing, but the approach is spreading. At Youth Impact we’ve made it core to how we work—we’ve run 75 A/B tests in just a few years and we now support partners to embed A/B testing into their own systems too. A growing cohort of implementers, researchers and funders are turning to A/B testing to make evidence-based programs more adaptive, scalable, and cost-effective. We think iterative A/B testing is the future: where education programs don’t lose power as they grow, but gain it—reaching more children, more effectively, for every dollar spent.