New Test for Computers: Grading Essays at College Level

danso · on April 4, 2013

So, for this to be even a feasible solution, EdX needs to show more proof of concept...like, here's a sample question, here's the 100 sample answers that were used to machine-learn against, and here's how the auto-grader graded these 10 different answers (both good and bad).

Why should anyone have faith that EdX has cracked the perfect mix of machine learning and NLP and other associated technologies needed to provide accurate assessments of essays? Even Google from time to time has trouble guessing intent. Wolfram Alpha even more so. If the engineers at these companies can't get it always right -- and it's not just engineering talent, but data and data analysis -- why should a school entrust one of its most important functions to EdX?

Grading is something critical to get right, not just, "almost there." Think of how much time you spent arguing with a profesor that your score deserved an 8 instead of a 6, enough points to bring you from a B to a B+ for the semester...Think of the incentive to do so (career prospects). If the machine is ever wildly off just even in one case, would you ever take its assessments as gospel? Multiply yourself by 20 or 50 or whatever a usual professor's lecture load is, and now a ton of lag has been introduced into the grading workflow.

Obviously, there are ways to mitigate this. One would be to write questions that are so narrowly focused that there are very clearly right and wrong answers...Which of course raises the problem of: why not just give everyone multiple choice tests, then?

The sad thing is is that even if these machine graders were empirically better than human graders, they can't just be better, they have to be almost perfect. If a plane's autopilot failed, on the whole, 1 out of a 1000 times compared to 5 out of a 1000 times for human pilots...how much more pissed do you think victim families are going to be when they find out their loved ones died because of a algorithmic malfunction rather than just pilot error? People, for obvious reasons, don't like thinking of their work or their destinies being defined by deterministic machines...so if those machines aren't 99.99% right, then the fallout and pushback may cost schools more than the savings in professor-grading time.

jurassic · on April 4, 2013

> Think of how much time you spent arguing with a profesor that your score deserved an 8 instead of a 6, enough points to bring you from a B to a B+ for the semester.

I spent zero time doing this, and I never even realized how widespread it was to try to negotiate grades until I helped TA a few classes. Now I see how much professors hate the students who do.

Seriously, what job is going to care that you got a B vs B+ in a class? If that subject is really so important to their mission, they are probably looking for A+ candidates anyway.

danso · on April 4, 2013

GPA is, unfortunately, one metric that is used to quickly sift applicants. So one B+ over a B may not matter for one course, but can be quite significant over the course of four years.

hmsimha · on April 4, 2013

Additionally, certain scholarships require a minimum GPA be met. A B vs. a B- can be a huge difference to someone faced with losing their tuition check.

rayiner · on April 4, 2013

I've never grade-grubbed (you can't really where I went to school), but I've had to explain a B versus a B+ in an interview before.

Ilmesnkie_Jones · on April 4, 2013

Even if their perfect its not really good enough outside of say standardized testing because they can't integrate what they learn about the individual students from grading their papers and then use it tutor/advise/grade the student.

A holistic understanding of the individual student and how all students are doing in a course is important. It allows the professor to tailor the course or see where weaknesses in understanding are.

Also I would imagine that there are some exceptional essays which because of their unique nature would slip by the grading system. Subjectivity is a plus and a minus of having human markers.

rollo_tommasi · on April 4, 2013

Critical precision in grading is only necessary in competitive high-stakes academic environments where everyone is striving for excellence (or at least a shot at belonging to the social or technical elite). If you consider an environment where the point is simply to train a crowd of semi-skilled workers to a level of basic white-collar functionality (which is what most mid- to lower-tier colleges are approaching, and what MOOCs will probably become), widespread use of this technology becomes much more plausible and acceptable.

lmkg · on April 4, 2013

There are a few reasons I'm not excited about this sort of thing, but there's one big reason where I think it would be a distinct improvement over the status quo: standardized testing.

Most standardized testing in the US, especially the ones deployed at a national scale, are designed for grading first, and testing second. This is a simple concession to feasibility: if you're trying to evaluate a few dozen million students, scantron is the only current tool that is time-efficient, cost-efficient, and consistent ("fair") at scale. The constraints imposed by that tool are significant, and tying so much incentives to those tests have warped the whole education system.

Automated essay grading, no matter how imperfect, would still be an expansion of the available testing techniques that could be deployed at the national scale. It would expand the range of skills that are measurable, and therefore incentivized to teach.

The cynical way of putting that is, no matter how shitty computers are at grading essays, they're still an improvement when the competition is multiple-choice questions. I have decided to be excited by that.

cma · on April 4, 2013

Instead of bothering with grammar, unique personal takes on subject matter, and other complicated things, teachers can just teach students how to spew out Markov Chain incomprehensible crap peppered with high vocabulary.

oob205 · on April 4, 2013

I have to disagree with you. This will make standardized testing worse, not better. With human grading, the humanities managed to avoid (to an extent) the extreme standardization of what should and should not be taught. There was still room for creativity. Now the type of writing that cannot be graded by an algorithm will get even less room in our curricula. Goodbye to poetry, satire, and other creative endeavors.

bitwize · on April 4, 2013

If this is deployed nationwide, how long before students start submitting carefully crafted nonsense that tricks the dopey grading algorithm into giving them an A? My bet: not long at all.

a_p · on April 4, 2013

A quote from the article:

“My first and greatest objection to the research is that they did not have any valid statistical test comparing the software directly to human graders,” said Mr. Perelman.

That is a good enough reason not to use the software. If it does eventually pass such a test, I doubt that it would ever be better than the best human grader but rather it would be better than the average grader.

SilasX · on April 4, 2013

I'm not sure a "good" result for such software would be much progress though -- any success would be the result of students writing for a human grader. If you actually employ the software grader, students write with that in mind and start looking for "tricks" and "holes" in the algorithm that result in a good grade but diverge from what we humans regard as good writing.

Retric · on April 4, 2013

Your assuming 100% automation, having a person skim for obvious bad faith essays is probably enough to combat such cheating while still saving 90% of the effort of fulling grading each essay.

SatvikBeri · on April 4, 2013

Of course there are a million-and-one possible problems, but the upside is also massive, especially if an automated grader is used to supplement human graders. Some examples:

-A student can iterate and improve their essays on their own using an auto-grader. They can get it up to a decent level just through iteration before having to get a human involved.

-Human graders are highly subjective

-Human graders tend to be strongly affected by factors such as "number of hours since they last ate"

-Different humans have different levels of harshness, a machine could help calibrate these

-Outside, say, the top 10% of colleges, the vast majority of human graders suck. Especially for standardized tests, they typically get paid near minimum wage and the qualifications are along the lines of "have a degree." While automated graders will probably never be as good as the best graders, they don't have to be.

So overall, I'm pretty excited. Even a half-baked solution has a lot of potential value.

notahacker · on April 4, 2013

A student can iterate and improve their essays on their own using an auto-grader. They can get it up to a decent level just through iteration before having to get a human involved.

If awarding people degrees based on essays whose form is dictated by trial and error rather than actual comprehension and interpretation is an upside...

SatvikBeri · on April 4, 2013

I also hold the opinion that it's much easier to learn programming if you can get feedback from the compiler in a few seconds or minutes as opposed to waiting two weeks between each compile.

solistice · on April 5, 2013

You don't wait for 2 weeks to get your compiler errors? How do you handle the subjectivity of stack traces then? Dude, that's madness.

But seriously, short feedback loops facilitate learning, and if this shortens the feedback loop, then I'm exited about it.

TheCapn · on April 4, 2013

Feedback is more than a grade letter. Real teachers provide insights and direction to material like an essay to stimulate critical thought and development. How well can a computer grade system instill this type of learning on students? Can it at all?

geoffpado · on April 4, 2013

I actually had to use something like this for a class in college… maybe 2 years ago? It was the worst thing. It would auto-grade, tell you your score, and how you could improve it. However, there was almost no way to actually respond to the grade. It would often tell you you didn't cover a specific topic, even when you had. There was no way of pointing out where you had discussed it, no way in the program of asking for a human review, anything.

All this, of course, was exacerbated by the class being run by possibly the laziest professor I had all throughout college (all essays were graded by this, all tests in class were taken by clicker, all notes were Powerpoint slides provided by the book's publisher). Possibly, in the hands of someone who actually cared about teaching, a tool like this wouldn't have been so bad—but it seems to me that someone who cared about teaching would just grade manually to begin with.

marianne_navada · on April 4, 2013

As an educator, I appreciate technology that provides students a quicker way to get feedback, but this software only reinforces students' frustration with essays: no one reads them except the professor. Most are written to be read by one person. Artists are able to share their work, but for most majors that require writing, assignments are meant to be forgotten. This is exactly the reason why students who major in these fields graduate without a portfolio of their work.

My husband and I developed https://chalktips.com/ to solve this problem. We wanted to make essays and school work for college students engaging and shareable. Students publish booklets and slideshows as part of their assignment. Students can tweet or share their work in Facebook or Tumblr.

Last semester as part of the final, I had professionals from different fields comment on student work, and the feedback from students was amazing.

Using AI to comment on essays might be efficient and probably comparable to having ONE person grade an assignment. But it doesn't make use of the power of community. It also does not address the fundamental problem with essays: after it's graded, so what?

We currently have 1,400 users with more than 2,500 booklets and slideshows published. Students are embracing the platform. Of course, I'm a community college teacher without the clout of MIT professors. But we're hopeful that more students start demanding that they are given assignments that have utility after the course is over.

rollo_tommasi · on April 4, 2013

The problematic aspects of this software are applicable to MOOCs and online education in general. Truly valuable, high-level education requires a level of precise, individually calibrated feedback in order to train students how to think and express themselves critically and rigorously (there are also obviously signalling, credentialing networking benefits that will never be replicable by MOOCs, but that's not relevant to a discussion of pure educational quality).

Computer-driven mass education will never be able to provide that level of instruction. At best, it's a tool for bringing the workforce up to the basic level of competence necessary to function in an information economy, much like how the original purpose of the public education system was to crank out semi-skilled factory workers and low-level clerks. This may be a valuable end - it may help to stem the commoditization of the BA and the attendant tsunami of student debt - but there needs to be wider acknowledgement that it is actually fulfilling a purpose that is different from that of traditional higher education.

scarmig · on April 4, 2013

One big advantage of this, if it ever becomes decent: iteration.

Make it available to students. They can write an essay, submit, and get feedback on it. Repeat.

It'd be difficult to get a machine to bring someone to great writing. But clear, concise prose that gets to the point? That seems reachable. And that would be an improvement over the status quo.

In other words, machines won't anytime soon be able to know that a Wodehouse is superior to an Orwell. But an Orwell to, say, a Thomas Friedman or typical ninth grader? Totally.

mikeash · on April 4, 2013

It would be fun to hook this up to, say, a Markov chain with weights adjusted by a genetic algorithm, or other sort of machine learning with text generation, to automatically produce an essay that scores high.

notahacker · on April 4, 2013

Machines will also "know" that the course leaders' model essays put through a bog standard article-spinner are superior to Orwell.

ISL · on April 4, 2013

One problem: Essay quality, and that of writing in general, is subjective.

It's possible to assess objective properties of writing, but it cuts off the top end of the spectrum.

  Three quarks for Muster Mark!
  Sure he has not got much of a bark
  And sure any he has it's all beside the mark.

Grade? F. Author: Joyce. Importance: Inspired Gell-Mann's naming of the quark [1].

[1] http://en.wikipedia.org/wiki/Quark

solistice · on April 5, 2013

But going into a writing class to be graded for the subjective quality of my writing doesn't really make sense, does it? I mean it boils down to whether I get lucky with the grader, whether his subjective evaluation of my work is positive, and it tells me almost nothing. I now know Professor Manning likes my writing. But Professor Sieboldt doesn't. I honestly don't think subjective art should be graded. It has no use to grade it. For example, imagine I want to hire a painter. An objective apraisal of his skill would be useful for that. A for Stroke Quality, B+ for Perspective Drawing. Fair enough. But if I want him to create an artistic work for me, I'd look through his portfolio first. Do I like the images he's done before? Do I like his style of painting? What Professor Manning thinks about his style is absolutely useless to me, and it's even worse if he mixed that grade with his technical prowess, because then all the grades he gave me are completely useless.

But Hackers and Painters right? We do it too, don't we? 5 years of Java experience, 7 years of PHP, and 94 years of Haskell. So 5 years of Java, or 5x1year of Java, or Java as something you do for 2 hours every 2 months? 5 years of Java tells you very little doesn't it? As does A+ for Stylistic Expression.

Were grading that Arts like the Sciences. Judge a fish by it's ability to climb a tree and...

noonespecial · on April 4, 2013

They're facing the same problem as testing (diagnosis) in large scale health care: How to test (and teach really) without having to take on the bothersome, unscalable task of actually getting to know the student.

Wouldn't it be great if we could just figure out "education" once, code it up and then let it run on all of the students? Its easy really, all we need are identical students.

JEVLON · on April 4, 2013

Well I referenced a Harvard business school public journal and it got flagged as being from Wikipedia. So I do not believe we should let an automated system mark our essays just yet.

therobot24 · on April 4, 2013

as a precursor to grading/giving feedback on an essay, i usually pull out a Naive Bayes network for topic classification and see if their report is correctly classified. It's kind of fun and a good indicator for what i'm about to read.