Mr.Buk,
Yes, it is a very interesting discussion. Almost every one known what "mean" is, but few knows exactly its limitations.
Why use tree or five judges if you don't think any of the scores will be "wrong" ?
I think the correct word is "extreme value" instead of "wrong value". The extreme value, different of the other values of the group is not necessarily wrong, but the possibility of being right is very low.
The reason you use more than one judge is that THE SCORE IS A MATTER OF OPINION, not a measurement of an objective number. They may (and do) weight different types of errors differently, but *not incorrectly*. If a particular flier manages to get good scores from a group of competent judges all weighting different errors fairly, that indicates that the pilot is making fewer errors of all types, and therefore should win. If you just had one judge, and that one judge weighted one aspect of a flight more heavily, it selects (more or less at random) for the guy who does that one thing better, not the one that does a reasonable range of things well.
Moreover, judges in real life may in fact be looking for the same sorts of errors, but deduct more (or less) *per error* than someone else. That results in the "high judge/low judge" phenomenon. There are judges whose scores are known to run high or low but nonetheless rank the fliers in the correct order - someone might give a mediocre flight a 525 and a great one 570, someone else might give the same flights a 425 and a 470, but BOTH rank the flights in the right order. There is no reason to believe that the 525 is more "correct" than the 425. **
Those are AMA scores, of course, but replace them with 900 and 1100 for FAI, same point applies.
This is the essential fallacy of blindly throwing out high and low on a particular flight, and also a fallacy of using the median raw score. This has long been understood and that is why there is judge selection from one round to the next based on "tracking"
Whatever the method you use (mean or median), you substitute the notes of each judge (3,5,7 or better 20 or more) for only one number. I dint understand when you say ......and should be removed (much less two of them)? .....
In your example case, with 3 judges, and using the median, you use only one score - the one in the middle. The other two are not used, despite the fact that one of the judges saw far more errors, or deducted more for a particular set of errors. With the average (or sum, same thing) the score is directly determined from all of the judge's inputs.
It's a degenerate case, but that also makes the flaw of the method obvious.
Brett
** p.s. by the way, this suggests that what you might do is wait until you get all the scores from a round, execute a tracking program, and then exclude a particular judge's score from EVERBODYs flight, then calculate and post the scores. This seems to directly address the possibility that some judge was intentionally biased, which should show up as his score reflecting a bias towards a particular flier. This is the essence of 99% of the judging complaints, particularly "judge is my flying buddy" or "West coast judge packing" complaints. Even this is fallacious reasoning - there's absolutely no way of studying the scores from distinguishing intentionally biased scores from a legitimate preference for the same type of flying/array of errors.
A real example with real names for once, with made up hypothetical case. Judge Peabody and pilots Urtnowski and Fancher, same circle. Just say for arguments sake that we had 5 judges, Peabody and 4 others. Peabody and Windy are both from the same area, have many of the same contest experiences, see the same people fly most of the time, and have the same basic idea on how stunt should be flown. Peabody sees Fancher twice a year, have no common experience, see different people fly, and may have very different ideas on how errors should be weighted. The other 4 judges are have no common experiences with EITHER Ted or Windy or anyone else.
Get out there at the NATs. Fly some flights, 4 of the judges track perfectly all day, all ranking the 40 flights they see in the same order. Say they all think Ted was first and Windy was third in the round. Peabody's score sticks out, it ranks Windy 1st and Ted 10th. Aha, we see what is going on, that antichrist of stunt is a ringer for Windy and killing Ted out of spite, right?
But of course that's jumping to a conclusion. It is ABSOLUTELY IMPOSSIBLE to make that conclusion from looking at the scores after the fact. There's no way to distinguish this apparent "cheating" from a case where, naturally enough, Peabody looking for or weighting the errors differently, and it would be perfectly reasonable to expect that Windy has concentrated on removing the types of errors that are heavily weighted by people in the Northeast, where Ted has concentrated on removing other types of errors. It is entirely legitimate (and inevitable) that both the absolute and relative weighting of different types of errors maybe different from judge to judge, so the "anomalous" tracking that Peabody's results seem to show may be cheating, or may be the result of looking for different errors and legitimately deducting for them.
That doesn't mean the tracking method used (which has always looked for these sorts of "anomalies") is incorrect or shouldn't be used. We think it picks judges that have a balanced range of error weights. But it does mean you can't ever determine that someone is doing it on purpose.
I would note that if you replace "Peabody" with "McClellan" and "Windy" with "Bob Baron" you have the gist of the "Anatomy of a Team Trials", where careful mathematical analysis of scores was used to leap to idiotic conclusions about Gary's motivations.
And for the record, I have never had any notion Peabody's score of my flying to be erroneous or biased despite our extremely contentious relationship, I just used him as an example. I did have a bit of a problem with his comments at the 2002 Judge Training on the topic of the "West Coast Hourglass" but even that could have been a discussion of the geometry. He was wrong as I showed in my SN article "Fun facts about the Hourglass" but being wrong and judging unfairly are two entirely different things. - bb