I finally got a response back from my actuarial brother in law, who says "my statistics are weak". Hmm -- been managing for too long, Dave?
There are three possible ways to improve the criteria if you don't care about absolute scores but do care about accuracy of ranking:
One is to compare each judge individually against the statistics of the group consisting of all the
other judges in that circle (this has the potential to improve any scoring system).
One is to subtract out the judge's and the group's average score, i.e. instead of comparing:
group_average_pilot_i - judge_score_pilot_i,
compare
(group_average_pilot_i - group_average_all_pilots) - (judge_score_pilot_i - judge_score_all_pilots).
This would mean that a judge who scores consistently 5 points higher, but otherwise always tracks the group would get better scores, while a judge whose average score always matches the group but who is either more random or trends differently than the group would still stand out, and get a worse score.
A third, which seemed really bright when I thought of it but seems questionable on reflection, is take all the judges' scores and rank each pilot individually for each judge. Then give each judge one point for every other judge's rank that he matched for each pilot, and zero points for being off. Or if you really want to be nasty, zero points for being off by one and negative 1 point for being off by more than one.
For four pilots and three judges, this would work out to something like:
overall | judge 1 | judge 2 | judge 3 |
1st | 1st | 1st | 3rd |
2nd | 2nd | 2nd | 1st |
3rd | 3rd | 3rd | 2nd |
4th | 4th | 4th | 4th |
Judges 1 & 2 would each get four points, while judge 3 would either get 1 point or would get 0 points, depending on whether you wanted to be nasty about missing more than one place.
The downside to this method is that it's somewhat coarse -- someone who's clinging to the threshold of giving a wrong placement, but is just within it, will get the same score as someone who's judging exactly like the rest of the group. For a large number of pilots it may not be too bad.
Wasn't there some huge study done for RC or full-scale aerobatics judging that went into fine detail on the statistics involved, and had some recommended formula for ranking?