THE COMPUTER MOVES INTO ESSAY GRADING: UPDATING THE ANCIENT TEST.

0
461

When it comes to reading essays and rating them, computers have not played much of a role to date. But that may change, according to Mr. Page and Ms. Petersen, because – for the first time ever – a blind test has demonstrated that a computer can simulate the judgment of a group of human judges on a brand-new set of essays. It is now obvious that computers can do a raft of things once reserved for humans. But when it comes to reading essays and rating them, computers have not played much of a role. Of course, we’ve all grown accustomed to spell-checkers and grammar-checkers. And we’ve even learned to live with their limitations. But we all know that, for evaluations of real live student papers, only a literate human reader can give an essay a proper grade. Or do we? Research on computer rating of essays has continued since an article on the topic appeared in the Kappan almost 30 years ago.(1) At that time everyone was surprised to learn that a computer could do as well as a single human judge. That early work led to much research and federal support, but the world was plainly not ready for practical applications.(2) Computers were terribly expensive and rare, and software was poor and scarce. And in those early years, nothing had been achieved like today’s success in essay reading. In the past three years, however, there has once again been much activity in this area, and some successes have even seemed to promise practical applications.(3) In 1994 the Educational Testing Service (ETS) collaborated to arrange a blind test of the latest work of Project Essay Grade (PEG). The Experiment In mid-1994 we and our colleagues undertook a unique test of computer essay grading. It had previously been shown that Project Essay Grade could succeed in a research setting – but, until there was some blind test of the system, no one could be sure how it would work in practice. The test we worked out used 1,314 essays supplied by ETS that had been composed on computers. These essays were written by college students taking the computer-based writing assessment that is part of the Praxis Series: Professional Assessments for Beginning Teachers, a program that is used as part of the teacher licensing process in 33 states. This constituted the largest set of essays yet analyzed by PEG. ETS divided the essays randomly into two groups: 1,014 research essays and 300 test essays. Along with the research essays, ETS sent two human ratings that had already been collected in the operational scoring of the Praxis Series. These we used formatively to fine-tune the computer program. (ETS also sent some rich data about those students’ objective test scores and other nonpersonal information.) Although ETS had two ratings for the test essays, no human ratings, objective test scores, or other data about the test essays were sent to the research team. Thus, when our experiment began, two human scores existed for each of 1,314 essays. But even two human judges do not give a very accurate rating of essays, so ETS collected four more ratings for the 300 test essays and for 300 essays randomly chosen from among the 1,014 research essays. Thus 600 essays now had a total of six human ratings, and none of the judges knew whether the essays they read would be research or test essays. Here, then, is our unprecedented test. From the 300 test essays alone, without human ratings or background information, PEG would try to predict the average scores of the six human judges. All of us could then evaluate the performance of PEG. If the predicted ratings surpassed what two human judges could do in the way of predicting the scores of other judges, then the quality of PEG’s performance would be reassuring. Otherwise, it would be back to the drawing board. The bottom line is that the computer program did predict human judgments well – perhaps even better than three human judges. In practical terms, these findings are very encouraging for large-scale testing programs. …Â