The Grand Mystery Of GMAT Scoring

studyingIt almost doesn’t even matter if the score is high or low.  When GMAT students see their practice test scores they almost always give the same response: “But I got x% right, how could my score by y?!”  Whether they think the score is too high (how could 13 wrong translate into the 92nd percentile on quant?) or too low (“I heard you could miss 13 on quant and still get over 90th percentile; why was mine only 52nd?”), students universally distrust the scores they see on their tests. And those that don’t are often more dangerous in their thinking – they look for ways to use the response pattern to game the system (“so as long as I get the first five questions right the rest don’t matter?!”).  So let’s end the debate once and for all:

There’s a huge difference between “percent correct” and “percentile” (or score); they’re related, but much more like third cousins than brothers.

The basics of adaptive scoring

One of the main reasons that people so often equate percent correct with score – even though they know the GMAT is a computer-adaptive test – is that most explanations of a CAT scoring system are crude enough to support that thinking.  If your understanding is “get a question right, the next one is harder; get that one wrong, the next question is easier”, you’re not “wrong”, but you’re only partially there.  That thinking seems to support a percent-correct calculation in a lot of ways: if you get a “Level 5” question right you’d get a Level 6 question next.  Get that right and you’d see Level 7, and then if you get the next two wrong you’d be back at Level 5 again.  2 right, 2 wrong in any order, you’d think, would get you to the same place.

But that’s not really how it works.  Say you did, in fact, get the first “average” question right.  At this point the system already has good evidence (albeit one data point’s worth) to suggest that you’re above average.  So its next question is going to try to determine how far above average, not to confirm whether its earlier data point is correct.  So if you answer that next question wrong, you don’t go back to “average”.  At this point the system has two data points to evaluate you: one that says “above average” and another that says “but not way, way above average”.  So the system will likely see you closer to the 70th percentile, not all the way back to the 50th.

We’ve always been fans of the “Twenty Questions” analogy. If you were playing the road trip game of Twenty Questions and had narrowed the candidate down to “a US president”, you might try to determine which era by asking if he was president after 1900.  If the answer comes back “yes”, then you’ll likely want to figure out just how recent he was president, segmenting the remaining 113 years by asking “was he president after 1960?  If the answer comes back “no”, you wouldn’t start thinking back to presidents in the 1860s because your data points already suggest that it’s not one of those, so you’ll narrow your search even more between 1900 and 1960.

That’s in large part what the GMAT scoring algorithm is doing.    If its data suggests that you’re above average, it’s looking to pin in a score in the same way, with this important caveat – it knows that you’ll give it bad information from time to time.  Sometimes you’ll make a silly mistake on a problem you should get right, and sometimes you’ll guess right on a question you should get wrong.  So the algorithm has to be more flexible than the game of Twenty Questions.  And it does so by using probability and statistics.

The ABCs of adaptive scoring

Let me preface this by saying that the following are all facets of “Item Response Theory”, the driver of computer adaptive testing and the basis for the GMAT scoring algorithm.  The GMAT may include its own tweaks and nuances so we won’t claim that this is absolutely how the GMAT does things, but rather how the theory works and so therefore a likely way that the GMAT does.

With Item Response Theory, each question has three scores of its own:

A-value – measures how that question is weighted.  Not all questions carry the same weight in the “weighted average” calculation of your score.  Ideally each question would work as absolutely as those in Twenty Questions – if you answer “yes” to “are you above the 50th percentile”, that should be it. You’re above average!  But in reality that probability that your ability is above average is less than 100% based on that lone data point.  Some questions might give a much stronger indication  than others, however, and the A-value serves to measure that, telling the computer how much stock to put in your answer to that question.

B-value – you could call this one the “difficulty” metric. This number tells the computer where the question is likely to be the best determinant of ability.  If the B-value is around the 75th percentile, then that’s an ideal question to give to someone at that ability level…but also a fair question for someone the system believes to be around the 65th percentile, just a little less telling.

C-value – this helps to account for the probability of guessing correctly, allowing the system to take that into account.

Questions about this article? Email us or leave a comment below.