Author Topic: Another Nats Item (Read 1796 times)

Howard Rush · « **on:** March 15, 2015, 03:11:03 AM »

The formula used for the last seven Nats to evaluate judge performance assumes that each judge sees the same number of contestants each round. This was a pretty valid assumption until the last couple of Nats. The problem is that a judge's evaluation score tends to get worse the more people he judges. Below is the formula we’ve been using.

I think it should be normalized by dividing again by number of flights judged by that given judge. I can give a weak theoretical explanation, but I’m fairly confident that this is correct, having looked at some randomly generated scores and also some real contest data. This change also requires using a weighted (by number of flights) average of judge evaluation scores from each day’s judging, rather than lumping all the flights together. I also multiplied the judge scores by 1,000 to make them look nicer: e.g. 54, rather than .054.

This is kinda esoteric and not very significant. The main reason for bringing it up is that I put the revised formula into the Nats program, and anything that goes into that program is public and objective. The other reason for bringing it up is my ongoing request for theoretical input.

I plugged last year’s Nats data into the revised formula. It made very little difference to the judge rankings. Two guys moved up one notch each, and two guys moved down one notch each.

Dennis Moritz · « **Reply #1 on:** March 15, 2015, 01:45:11 PM »

Judges get tired.

Sent from my iPhone using Tapatalk

Tim Wescott · « **Reply #2 on:** March 15, 2015, 03:24:44 PM »

I need to dig out my statistics book and do some thinking.

I do know that whatever you're doing, it doesn't have a name in normal statistics -- you're not estimating the mean and variance, then reforming a score from that, you're not estimating the root-mean-squared average error. The judge's score is kind of an aggregate something-or-other.

I would be inclined to estimate the mean and standard deviation of the error separately, then weight how much of each you care about (the mean error tells you if the judge is consistently high or low; the standard deviation tells you how widely the judge's scores get splattered).

This is all complicated by the fact that you're not measuring the judge's performance against a standard yardstick: you're measuring the judge against an aggregate score, which means that one bad judge will make all the judges in the group look bad (in the extreme case, with just two judges, they'd always score exactly the same, even if one is Paul Walker and the other is me).

I dunno what the right answer is -- this is why I was regretting not pulling my brother-in-law-the-actuary into control line.

Tim Wescott · « **Reply #3 on:** March 15, 2015, 10:58:59 PM »

Is the goal of scoring the judges performance to insure that the points awarded in the various circles are the best representation of the "ideal" score? In other words, would you like the scores for the first two days (it's two days, right?) to be as close as possible to what you'd get if you had one big contest in one circle?

Or is the goal of scoring the judges performance to insure that any group of judges drawn from the judging pool is the most likely to get the correct ranking of contestants in each circle, without regard to how a score in circle A may compare to a score in circle B?

Because if you're aiming for the former, that's not necessarily a bad algorithm to use. But if you're aiming for the latter, I think you're leaving things out.

Howard Rush · « **Reply #4 on:** March 16, 2015, 12:26:12 AM »

Maybe the second. The event director can use the data from the four qualifications circles to allocate the judges on Friday, when all the Open guys fly one flight before each of two sets of judges, and the Advanced guys do the same before each of two other sets of judges. Then, he can use data from the qualifications days and Friday to allocate judges on Saturday, when the top five Open guys fly off, the kids fly off, and the winners of each age category fly for the Jim Walker trophy.

Tim Wescott · « **Reply #5 on:** March 17, 2015, 05:02:09 PM »

I finally got a response back from my actuarial brother in law, who says "my statistics are weak". Hmm -- been managing for too long, Dave?

There are three possible ways to improve the criteria if you don't care about absolute scores but do care about accuracy of ranking:

One is to compare each judge individually against the statistics of the group consisting of all the other judges in that circle (this has the potential to improve any scoring system).

One is to subtract out the judge's and the group's average score, i.e. instead of comparing:

group_average_pilot_i - judge_score_pilot_i,

compare

(group_average_pilot_i - group_average_all_pilots) - (judge_score_pilot_i - judge_score_all_pilots).

This would mean that a judge who scores consistently 5 points higher, but otherwise always tracks the group would get better scores, while a judge whose average score always matches the group but who is either more random or trends differently than the group would still stand out, and get a worse score.

A third, which seemed really bright when I thought of it but seems questionable on reflection, is take all the judges' scores and rank each pilot individually for each judge. Then give each judge one point for every other judge's rank that he matched for each pilot, and zero points for being off. Or if you really want to be nasty, zero points for being off by one and negative 1 point for being off by more than one.

For four pilots and three judges, this would work out to something like:

overall	judge 1	judge 2	judge 3
1st	1st	1st	3rd
2nd	2nd	2nd	1st
3rd	3rd	3rd	2nd
4th	4th	4th	4th

Judges 1 & 2 would each get four points, while judge 3 would either get 1 point or would get 0 points, depending on whether you wanted to be nasty about missing more than one place.

The downside to this method is that it's somewhat coarse -- someone who's clinging to the threshold of giving a wrong placement, but is just within it, will get the same score as someone who's judging exactly like the rest of the group. For a large number of pilots it may not be too bad.

Wasn't there some huge study done for RC or full-scale aerobatics judging that went into fine detail on the statistics involved, and had some recommended formula for ranking?

Randy Powell · « **Reply #6 on:** March 17, 2015, 06:08:14 PM »

Maybe there is a "I'm tired" corrective factor.

Howard Rush · « **Reply #7 on:** March 17, 2015, 06:15:47 PM »

Before I spend much effort reading what you wrote, do you know how the method we use works?

Quote from: Tim Wescott on March 17, 2015, 05:02:09 PM

Wasn't there some huge study done for RC or full-scale aerobatics judging that went into fine detail on the statistics involved, and had some recommended formula for ranking?

Yes, and we use a judge evaluation method derived from that, but we use the traditional method of averaging the judges' scores for each flight to determine contestant placing. To switch to the pilot-scoring method the full-scale guys use would require a rules change and would be difficult for folks to accept. I'm one of those who'd have difficulty accepting it. I haven't seen any mathematical justification for it.

Tim Wescott · « **Reply #8 on:** March 17, 2015, 06:31:16 PM »

Quote from: Howard Rush on March 17, 2015, 06:15:47 PM

Before I spend much effort reading what you wrote, do you know how the method we use works?

Is it not what you posted in the opening post to this thread?

Howard Rush · « **Reply #9 on:** March 17, 2015, 06:40:49 PM »

Quote from: Tim Wescott on March 17, 2015, 06:31:16 PM

Is it not what you posted in the opening post to this thread?

It is. Offhand, it looks like the things you discuss are notions off the top of the head of somebody who hadn't looked at what we actually do. You could convince me otherwise.

I doubt if I was clear that we compare the ranking from 1 to n of all the flights in the round by each judge to the scoreboard ranking from 1 to n. A judge's score is based on the total number of notches he was off in ranking each flight. In your example, judges 1 and 2 would score 0 and judge 3 would score 1,000 * (2 + 1 + 1 + 0) / 4².

Tim Wescott · « **Reply #10 on:** March 18, 2015, 02:10:32 AM »

Quote from: Howard Rush on March 17, 2015, 06:40:49 PM

I doubt if I was clear that we compare the ranking from 1 to n of all the flights in the round by each judge to the scoreboard ranking from 1 to n. A judge's score is based on the total number of notches he was off in ranking each flight. In your example, judges 1 and 2 would score 0 and judge 3 would score 1,000 * (2 + 1 + 1 + 0) / 4².

Ah. I read "placement" and my brain said "score". You meant "placement" or "ranking". I'm not sure if your division by n² is the best arrangement (although I'll believe you that it's better than division by n¹), but unless and until I play with some numbers I can't argue with it.

Derek Barry · « **Reply #11 on:** March 18, 2015, 12:48:31 PM »

I say draw numbers from a hat. I think it would work just as well as the old formula, maybe better...

Derek

Author Topic: Another Nats Item (Read 1796 times)

Howard Rush

Another Nats Item

Dennis Moritz

Re: Another Nats Item

Tim Wescott

Re: Another Nats Item

Tim Wescott

Re: Another Nats Item

Howard Rush

Re: Another Nats Item

Tim Wescott

Re: Another Nats Item

Randy Powell

Re: Another Nats Item

Howard Rush

Re: Another Nats Item

Tim Wescott

Re: Another Nats Item

Howard Rush

Re: Another Nats Item

Tim Wescott

Re: Another Nats Item

Derek Barry

Re: Another Nats Item