The Impact of Automated Writing Evaluation on Learning and Access to College in Brazil
Artificial intelligence has the potential to support teachers in completing time-intensive, subjective tasks, such as grading essays, and to provide individualized feedback to students on their writing. However, evidence on whether and how this impacts students’ learning and educational trajectories remains limited. To fill this gap, researchers are evaluating whether programs that use a natural language processing (NLP) and a machine-learning algorithm to score and comment on essays can improve learning and increase access to college for secondary students in public schools in Brazil.
Improving learning is one the most pressing goals for educational policy in developing countries. One of the challenges that governments face is that human capital and time constraints may hinder the capacity of teachers to provide individualized feedback to students and tailor their classes to address differences in students’ learning levels. These constraints may be particularly serious in the case of lengthy written essays, for which providing grades and feedback requires several interpretative skills and tasks that are particularly time-intensive.
While there is a large evidence base on how educational technologies (ed tech) can support teaching and help alleviate those constraints in a variety of settings, there is only limited evidence on the effectiveness of ed tech focused on writing. In particular, evidence on the effects of pedagogy programs based on automated writing evaluation (AWE) systems is scant, particularly in developing countries. Are technologies that use AWE-based ed tech effective in improving students’ writing skills and, as a result, can they lead to increased access to college in Brazil? What is the relative cost-effectiveness of different combinations of artificial and human intelligence in overcoming relevant barriers of effective writing pedagogy?
Context of the evaluation
Improving education quality is a pressing policy goal in Brazil. According to the 2015 PISA exam—a worldwide study of students’ scholastic performance—the average 15-year-old Brazilian student scored 407 points on reading, compared to an average of 493 points in all OECD countries.
Responding to the need for higher education quality in Brazil, the implementing partner in this evaluation was launched in 2015 with the mission of improving literacy among school-aged youth from unprivileged backgrounds by applying artificial intelligence to linguistics. Its main product is a pedagogical program that provides feedback on writing skills to students and teachers, using an automated writing evaluation (AWE) system combined with validation of feedback by human essay graders. The scoring is based on the National Secondary Education Exam (“Exame Nacional do Ensino Médio,” ENEM), which has been increasingly used as an entry exam by many post-secondary institutions in Brazil. Differences in the essay score account for the largest share of the public-private achievement gap in the exam.
In 2017, according to the Brazilian Educational Census, 14,891 public schools (out of 18,407) with at least one class of high school seniors had a computer lab and Internet connection. While the quality of the infrastructure of these schools might be low, one of the advantages of the provider’s AWE technologies is that they are based on a platform that works well with poor internet connections. Given that the cost of sharing online access to automated essay scoring is very low, this algorithm could represent a cost-effective way of improving writing skills among school-aged youth, even in contexts with low internet connectivity.
Details of the intervention
Researchers are partnering with the implementer in order to measure the impact of the AWE-based programs on students’ writing skills. The evaluation will take place in 178 public schools with computer access in the state of Espírito Santo. Schools will be randomly assigned to one of three groups:
Platform with the AWE system and human essay graders (standard treatment): Students will write essays in the platform, which will provide them with instantaneous feedback on syntactic text features, such as spelling mistakes and the practice of “writing as you speak”, and with a general overview of the their achievement, based on a performance bar composed of five levels. About three days after submitting their essays, students receive a final grade from a human scorer, including the final ENEM essay grade on a 1000-point scale, comments on the skills valued in the exam, and a general comment on essay quality. Students are scored based on their ability to: adhere to the rules of formal written Portuguese; respond to the essay prompt; select, relate, organize, and interpret data and arguments in defense of a point of view; use argumentative linguistic structures; and elaborate on an intervention to solve the problem in question.
Platform with the AWE system only (alternative treatment): In this treatment arm, students will write essays in the provider’s platform, but the whole experience with the writing task is completed at once and is based only on interactions with the artificial intelligence. After submitting the essays, students receive instantaneous feedback on text features and the five-level performance assessment (as in the first treatment arm). They are also presented with the AWE-predicted grade on a 1000-points scale and with comments selected from the implementers' database, which includes a list of specific comments reflective of the student’s score.
Comparison group: Students do not have access to the provider’s platform.
The primary short-run goal of the evaluation is to document the impacts of both treatments on ENEM essay scores. Through the standard treatment, researchers will be able to estimate the effect of an intervention developed to overcome relevant challenges of effective writing pedagogy. The main idea is that a combination of additional inputs from artificial and human intelligence can help compensate for the lack of time teachers may face to support the development of students’ writing skills. Such constraints are arguably particularly tight for Portuguese public school teachers, because grading and providing feedback on essays is a time-intensive nonroutine task, and because Portuguese teachers in public schools usually have to divide their time to teach Grammar, Literature, and Writing.
Nevertheless, scaling up an intervention like the standard program would necessarily entail large additional costs. Therefore, estimating the effects of the alternative treatment will provide evidence on the potential of a relatively easily scalable program to improve students’ writing skills.
Finally, the differences in impact between the two programs will be informative for understanding the state of art of artificial intelligence—in particular, natural language processing (NLP) algorithms—and its potential to emulate human intelligence without supervision. The ENEM essay is an interesting setting to study this issue, since the essay values both low-level abilities (such as the command over grammar rules) and high-level abilities (such as global coherence). Therefore, a closer look on how the treatments differentially affect different types of abilities will bring valuable information on the current potentials and limitations of artificial intelligence.
Researchers will apply surveys and use administrative data to capture the effect and the channels of impact of both treatments on writing skills. In the future they will also study the effects on medium, and long-term outcomes, including students’ enrollment, progress, and completion of postsecondary education, as well as labor market outcomes, like formality and work earnings.
Results and policy lessons
Study ongoing; results forthcoming.