Author Topic: Number of Judges for a Stunt Contest (Read 3585 times)

Howard Rush · « **on:** March 12, 2015, 05:35:57 PM »

I've been trying to figure out some stuff about stunt scoring. One question I had was how increasing the number of judges affects accuracy in placing contestants. I figured that if one could represent each judge's accuracy by a probability distribution, one could calculate the effect of number of judges judging a flight. I don't remember enough probability to figure this out analytically, so I decided to brute-force it with a Monte Carlo simulation.

One problem is picking a probability distribution function to represent how accurately a judge scores a maneuver. I assumed that a judge's performance could be represented by a simple function, which may not be too pure an assumption. I didn't consider the high-scoring-judge-low-scoring-judge issue, which doesn't matter (to be proven separately), nor favoritism, nor other judging vagaries. Smart people told me to use a binomial distribution, so I did, sorta. The first figure shows Probability Function 1, which is a binomial distribution (actually a density function, also known (by Wikipedia, anyhow) as a probability mass function. That looked like pretty flaky judging performance, so I squished it to make two other functions, Probability Function 2 and Probability Function 3. I whacked off the tails of each function to save calculation time: I limited errors to +/- 8 for Probability Function 1, for example. Which, if any, do you think comes closest to representing a good stunt judge? I ran data for all three. Note that these show error probabilities per maneuver, not per flight.

I decided to determine how likely it is for a panel of judges to err in the relative placing of two guys who flew patterns that should have scored two points apart, which is the sort of point spread one sees in big-league stunt. The second plot shows the upchuck of 30,000 flights for each of the 18 combinations of number of judges and probability functions. It took that many for the worst cases to converge. I was surprised to see that the better the judges, the better the effect of having lots of them. My conclusion is that the number of judges should be maximized for the Nats Open top-20 day and for the Nats Open finals. I think histograms of the 18 cases would be interesting, and I might calculate them if there's any interest.

Tim Wescott · « **Reply #1 on:** March 12, 2015, 07:28:33 PM »

If you had used a Gaussian distribution for this you would have found the math easy enough to be calculated by hand.

What you didn't look for is what happens if you have n judges, with n-1 of them having a "good" point spread, and one of them having a point spread twice as bad. Assuming that by "point spread" I mean the deviation, then if you start with some number of "good" judges (expected point spread = 2) and one "bad" judge (expected point spread = 6), then the expected point spread of the average of the judges score will be dominated by the one bad judge out to a group of three judges, and will still be significantly worse even when you have nine "good" judges and one "bad" one.

See da graph -- it shows the expected deviation in score for groups of various sizes, first assuming one bad judge (call him "Tim") among a bunch of good ones (Call them "Paul", "Mike", "Steve", "Mark", "Dave", etc.), second assuming that everyone is a "good" judge.

So your available energy for dealing with judges may best be spent by finding good ones and making them better, rather than trying to find huge masses of them to judge contests (not to mention, if you do manage to round up ten judges, a lot of them are going to be wishing they could be flying and not judging).

Howard Rush · « **Reply #2 on:** March 12, 2015, 08:18:18 PM »

Interesting, Tim. Thanks.

Quote from: Tim Wescott on March 12, 2015, 07:28:33 PM

If you had used a Gaussian distribution for this you would have found the math easy enough to be calculated by hand.

By you, maybe. Show me, please. It would be interesting to make comparisons of the same thing. You are looking at a different matter. Even so, it looks like having more judges dilutes the bad judge rather well.

I thought about doing a run with judges using different probability functions among them. Maybe I'll do that with one rogue judge.

I should also show standard deviations for my data. It would be easier than making histograms. I like the how-often-do-the-judges-get-it-wrong calculation, which is maybe more interesting to the usual stunt person than standard deviations.

Quote from: Tim Wescott on March 12, 2015, 07:28:33 PM

So your available energy for dealing with judges may best be spent by finding good ones and making them better, rather than trying to find huge masses of them to judge contests ...

Actually, much of my available energy for dealing with judges lately has been spent on the Nats judge-ranking formula. I came up with a change to deal with a problem caused indirectly by dwindling Nats Advanced attendance (with or without Expert). I'll put something out on that presently.

Tim Wescott · « **Reply #3 on:** March 12, 2015, 11:03:32 PM »

I tried to answer this, and when I got up to about half a page and realized that I had a totally incomprehensible summary of the first two or three chapters of any decent book on statistics -- I stopped.

Do you want to learn statistics? They teach it to psychology majors, so it can't be that hard -- (hello Randy Powell!).

The really, really huge summary, which you should only use as a motivation to go learn this stuff, is that:

Just about everything practical, in time, ends up looking like the Gaussian distribution (the binomial distribution for 15 trials looks pretty close)
If something has a Gaussian distribution, you only have to care about the mean and variance
There are some easy to apply rules about what to do with mean and variance when you add or subtract Gaussian random variables, and when you multiply them by constants
If you apply these rules, you'll find out that when you add two Gaussian random variables, the variance of the sum is dominated by the greater variance of the two random variables -- this means that averaging random variables works best if they all have the same variance
If you make use of something called the "error function" (Excel almost certainly calls it "erf"), then you can compute your probability of a wrong placement directly, without Monte Carlo simulation

Howard Rush · « **Reply #4 on:** March 12, 2015, 11:16:03 PM »

Here are standard deviations for the above.

Tim Wescott · « **Reply #5 on:** March 12, 2015, 11:17:57 PM »

Quote from: Howard Rush on March 12, 2015, 08:18:18 PM

By you, maybe. Show me, please. It would be interesting to make comparisons of the same thing. You are looking at a different matter. Even so, it looks like having more judges dilutes the bad judge rather well.

When you add a bunch of random variables, the variance of the result is the sum of the variances of the individual variables. Variance is the square of standard deviation, which is why one judge who splatters his scores all over will unduly influence the overall result.

So, having lots of judges does dilute the effect of one bad one, but you're still better off not having the one bad judge, particularly if those good judges are scarce.

Quote from: Howard Rush on March 12, 2015, 08:18:18 PM

Actually, much of my available energy for dealing with judges lately has been spent on the Nats judge-ranking formula. I came up with a change to deal with a problem caused indirectly by dwindling Nats Advanced attendance (with or without Expert). I'll put something out on that presently.

Well, if the Nats judge-ranking formula causes you to pick good judges, then your energy is being well spent.

Tim Wescott · « **Reply #6 on:** March 12, 2015, 11:52:05 PM »

I will now descend into math ("maths" if this were an English-speaking, rather than an American-speaking group. This is why the Americans won the war -- having only one math, instead of lots of them, makes it easier for our engineers to perform calculations).

Consider two random, Gaussian variables, x and y. Because they are Gaussian, each one has a mean and a standard deviation. If you take them as flight scores, then their means are what the scores "should" be, and their standard deviation is a measure of how much each score can differ from that mean.

Contestant x wins over contestant y if the expression x > y is true. (Let's not think about ties -- just please no). For x > y to be true, the expression x - y > 0 must be true.

Now define the random variable w = x - y. The mean of w can be found from m_w = m_x - m_y -- in other words, the mean of w is the difference of the means of x and y. The variance of w can be found as the sum of the variances of x and y: v_w = v_x + v_y.

Because x and y are Gaussian, w is Gaussian. This means that w is completely defined by its mean and variance.

The probability that w > 0 is true is equal to (1 + erf(m_w / s_w))/2, where s_w is the standard deviation of w, defined as s_w² = v_w, and erf(x) is the error function as defined in the Scilab help files (everyone defines erf differently -- it's special that way).

If the mean of x is greater than the mean of y, then pilot 'x' "should" win. So the probability of an error in placement in this case is simply* the probability that the actual, judged and calculated x is greater than the actual judged and calculated y -- in other words, the probability that w > 0.

Some example probabilities of errors are:

nearly 0.5 when m_w is nearly zero -- in other words, when it's a dead heat, who wins is a coin toss
about 0.36 when m_w/s_w = 0.25
about 0.24 when m_w/s_w = 0.5
about 0.08 when m_w/s_w = 1
about 0.002 when m_w/s_w = 2

This means that when the variation of a judges score from flight to flight is equal to the difference in scores, there's about a 1 in 12 chance that the ranking between those two people will be wrong.

Where this whole analysis really crashes into the Rocks of Reality is the fact that you're making a noisy measurement (via judging) of a quantity (how "good" the flight is), which itself is random: how good I (or Paul Walker, or Howard) flies on any given day is subject to variation not only in the air conditions and all sorts of other uncontrollable external factors, but on how well the pilot does on that particular flight. So you can toss numbers around all day, and get a better understanding of what might be -- but even if we were all trained statisticians who graduated in the tops of our respective classes, we'd still disagree on what actually is.

* People like to sprinkle mathematical calculations with the word "simple" or "simply". I don't know if it's because we're just reveling in actually having figured the stuff out, or if we like it when friends read what we've written and say What?!? SIMPLE?".

Howard Rush · « **Reply #7 on:** March 13, 2015, 01:00:17 AM »

Pretty cool, Tim. Using your method, I came within 5% of what my crude simulation gave for the probabilities of error in ranking guys two points apart in actual flying. That will make it much quicker (a millisecond vs. 10 hours) to repeat the exercise for different point spreads and judge combinations. I reckon the small difference in results came from the Gaussianity of my probability functions.

FLOYD CARTER · « **Reply #8 on:** March 13, 2015, 11:55:09 AM »

Mathematics can be invoked all over the place. How to account for all judges being biased? Either overtly or without collusion? I'm sure this isn't universal, and probably doesn't apply to major contests where close scrutiny is probable. If a pool of local judges knows a particular contestant, and has a preconceived bias, then how can that pool of judges be expected to render a fair score?

I have seen this effect in action. There is no cure, except for a contestant to retire from competition and perhaps wait for a new generation of judges to evolve.

F.C.

phil c · « **Reply #9 on:** March 13, 2015, 11:59:17 AM »

Howard, you've got data from at least one set of NATS score sheets. Simply use that data and calculate the variance statistics for , say, the top twenty- variance due to pilot, maneuver, judge. No need to guesstimate what kind of judging spread you've got.

In must be bad weather in Seattle if you've got time on your hands to do this kind of stuff.

Phil C

ps(or PA?) You generally need about 20 judges, minimum, to get the chance of a wrong placement down below 3-5%.

Tim Wescott · « **Reply #10 on:** March 13, 2015, 12:54:35 PM »

Quote from: FLOYD CARTER on March 13, 2015, 11:55:09 AM

How to account for all judges being biased?

I believe that part of the Nats judge selection process is designed to detect and reject judges that are inconsistent -- judging bias toward one pilot, unless it's shared by all the judges, would show up as inconsistency. One could also explicitly look for such bias.

Of course, if all the judges were equally biased toward some famous (or infamous) pilot, then a blind algorithm could not pick up on this.

Quote from: phil c on March 13, 2015, 11:59:17 AM

Howard, you've got data from at least one set of NATS score sheets. Simply use that data and calculate the variance statistics for , say, the top twenty- variance due to pilot, maneuver, judge. No need to guesstimate what kind of judging spread you've got.

I suspect that the sample would be too small to get really good statistics. The result would be a "guesstimate".

Quote from: phil c on March 13, 2015, 11:59:17 AM

ps(or PA?) You generally need about 20 judges, minimum, to get the chance of a wrong placement down below 3-5%.

Upon what science do you base this claim?

Howard Rush · « **Reply #11 on:** March 13, 2015, 02:13:47 PM »

Quote from: phil c on March 13, 2015, 11:59:17 AM

Howard, you've got data from at least one set of NATS score sheets. Simply use that data and calculate the variance statistics for , say, the top twenty- variance due to pilot, maneuver, judge. No need to guesstimate what kind of judging spread you've got.

In must be bad weather in Seattle if you've got time on your hands to do this kind of stuff.

Simply? I don't know how. You tried to lead me to something before, but either I didn't understand it or it wasn't useful. I forget which. I did do a normalized cross correlation analysis of World Champs judging, something that Bill Lee put me onto. I'll send it to you. The impetus for my doing the number-of-judges study above was your and Dr. Buffalano's mention of judge accuracy compared to pilot accuracy and the benefit of lots of judges. People have been using a maximum of six judges per circle at the Nats because: 1) that's all that fit on the circle without being too far off the upwind point, 2) it's hard to get a lot of judges to volunteer, and 3) six is the maximum that the program can handle.

Seattle weather has been excellent. My wife has been out of town for three weeks, though, so I've used the time to work on the Nats program and think about this stuff.

Tim Wescott · « **Reply #12 on:** March 13, 2015, 02:21:02 PM »

On the one hand, I appreciate all of the science that people are willing to apply to this problem.

On the other, if you're going to compete -- or even seriously spectate -- in a subjective event like this, I think you should just accept that anyone who's good enough to make it into the top ten or top five is absolutely awesome, and if you get to that point and they just draw names out of a hat for the winner, you should accept that it may be as accurate as any other method.

Howard, I did make an effort to solve this problem for you, without even knowing it: fifteen years ago, I tried to get my brother-in-law interested in and spun up on flying control line. On the plus side, he's an actuary (i.e., he calls the bets for insurance companies -- think "Jimmy the Geek"). On the minus side, he's almost totally mechanically inept, so he couldn't keep his interest up.

If you could get an actuary interested in this, then you'd get about the best possible statistical analysis possible.

Howard Rush · « **Reply #13 on:** March 13, 2015, 02:23:05 PM »

Quote from: Tim Wescott on March 13, 2015, 12:54:35 PM

I believe that part of the Nats judge selection process is designed to detect and reject judges that are inconsistent -- judging bias toward one pilot, unless it's shared by all the judges, would show up as inconsistency. One could also explicitly look for such bias.

Paul had a term in the Nats judge-evaluation formula to identify outliers and ding their scores. He took it out, I guess because it didn't look like favoritism is a problem.

Howard Rush · « **Reply #14 on:** March 13, 2015, 02:31:11 PM »

Quote from: Tim Wescott on March 13, 2015, 02:21:02 PM

If you could get an actuary interested in this, then you'd get about the best possible statistical analysis possible.

From time to time I have lunch with Dr. Ramesh, a high-fallutin' statistics guy from Boeing's reliability group. I ask him about important stunt statistical matters, but he isn't that interested. He also wants to go to the wrong restaurant.

Randy Powell · « **Reply #15 on:** March 13, 2015, 02:41:09 PM »

I think you guys are really overthinking this. But what, you say, an engineer overthinking something. Nah, that never happens.

Tim Wescott · « **Reply #16 on:** March 13, 2015, 03:44:43 PM »

Engineers are not prone to overthinking. Everyone else is prone to underthinking.

'nuff said.

Howard Rush · « **Reply #17 on:** March 13, 2015, 03:46:53 PM »

Quote from: Randy Powell on March 13, 2015, 02:41:09 PM

I think you guys are really overthinking this. But what, you say, an engineer overthinking something. Nah, that never happens.

Next time you are in a metal tube in the stratosphere, consider the consequences of that vehicle having been designed by people only capable of underthinking.

Tim Wescott · « **Reply #18 on:** March 13, 2015, 03:58:39 PM »

'struth -- no one accuses the engineering staff of overthinking when there's a bunch of reporters and cameras standing around a smoking hole in the ground with little bits of plane and passenger strewn about.

John Leidle · « **Reply #19 on:** March 14, 2015, 12:33:26 PM »

The Engineering degrees are fine with me but I'm more interested if the judge actually knows when the maneuver starts & stops, 45 from 60 degrees, how many inverted laps are required & what laps are scored & what ever else the rule book has written. Playing favorites or dislikes has no place either... how accurate do we want to make this?
John

Tim Wescott · « **Reply #20 on:** March 14, 2015, 01:23:22 PM »

Quote from: John Leidle on March 14, 2015, 12:33:26 PM

The Engineering degrees are fine with me but I'm more interested if the judge actually knows when the maneuver starts & stops, 45 from 60 degrees, how many inverted laps are required & what laps are scored & what ever else the rule book has written. Playing favorites or dislikes has no place either... how accurate do we want to make this?

Engineering jokes aside, I think that Howard is addressing the event in the national Stunt community that comes right after the Walker Cup flyoffs -- it's the one where folks start to complain about how the judging is biased, about how so-and-so (who always seems to be the locutor's best friend, or the locutor himself) should have won, about how the AMA is just a bunch of selfish, quad-copter-flying jerks, etc. Having a reasoned, well thought out means of selecting judges, and an impartial process for sorting people out in qualifying, helps to defray the (apparently and unfortunately) inevitable grousing.

Personally I'm pleased to see people working hard to remove as much variation in the process as possible, even though I do think that as long as you have humans as administrators, judges and pilots that you will never, ever, get a perfect process.

Howard Rush · « **Reply #21 on:** March 14, 2015, 02:35:40 PM »

Since Tim showed me an easy way to do it, I calculated the probabilities of judges getting the outcome wrong if the actual flights are five points apart, in addition to the previous calculation of the actual flights being two points apart. These numbers are based on variances taken from the earlier Monte Carlo run. This pretty much guarantees that the guy who came in last place at last year's Nats Open finals deserved to come in last.

peabody · « **Reply #22 on:** March 14, 2015, 03:09:23 PM »

What do the numbers on the left mean?

Howard Rush · « **Reply #23 on:** March 14, 2015, 03:21:49 PM »

Quote from: peabody on March 14, 2015, 03:09:23 PM

What do the numbers on the left mean?

The probability that the judges will pick the wrong guy as the winner when one guy actually flies a two-(or five-) point better flight. A coin toss would have a probability of .5.

John Leidle · « **Reply #24 on:** March 14, 2015, 08:25:40 PM »

Last place in Open last year..... that be me again.

phil c · « **Reply #25 on:** March 14, 2015, 09:32:21 PM »

Quote from: Tim Wescott on March 13, 2015, 12:54:35 PM

.....
Upon what science do you base this claim?...

Student's T test to evaluate the difference between two Gaussian population distributions-all the judges scores for each maneuver for a pilot. In the stunt case I've never seen a table that would even give a value for less than 5 samples.
I did a little comparison. Suppose you have 5 judges who are nearly perfect-4 score every maneuver the same within a 4 pt range, the fifth scores everything 1pt higher. That gives the result that a 3pt difference between scores would be 95% significant, 1 out 20 chance of being wrong. If you go to 20 judges you get down close to a 1.5pt difference. You need to go to a phenomenal number of judges to get down near a 1pt difference.

If you go to 20 judges, 18 of whom are absolutely perfect and score every maneuver exactly the same and 2 who track the same scores but vary +/- 1 pt randomly between maneuvers you can get the confidence interval down to 1 pt. That won't happen in my grandkids lifetime.

Howard Rush · « **Reply #26 on:** March 15, 2015, 12:28:11 AM »

Quote from: phil c on March 14, 2015, 09:32:21 PM

Suppose you have 5 judges who are nearly perfect-4 score every maneuver the same within a 4 pt range, the fifth scores everything 1pt higher.

If the 5th guy is similarly accurate, but consistently scores every flier one point per maneuver more than he deserves, then those five judges will rank the fliers correctly.

Brett Buck · « **Reply #27 on:** March 15, 2015, 01:23:48 AM »

Quote from: Howard Rush on March 12, 2015, 05:35:57 PM

I've been trying to figure out some stuff about stunt scoring. One question I had was how increasing the number of judges affects accuracy in placing contestants. I figured that if one could represent each judge's accuracy by a probability distribution, one could calculate the effect of number of judges judging a flight. I don't remember enough probability to figure this out analytically, so I decided to brute-force it with a Monte Carlo simulation.

One problem is picking a probability distribution function to represent how accurately a judge scores a maneuver. I assumed that a judge's performance could be represented by a simple function, which may not be too pure an assumption. I didn't consider the high-scoring-judge-low-scoring-judge issue, which doesn't matter (to be proven separately), nor favoritism, nor other judging vagaries.

I think this not a good assumption. I don't think it changes the answer but I don't buy the premise.

My objection to the premise is that there's no reason to believe that the differences between the judges scores or rankings is the result of random errors, of *any* distribution. In fact, I think a far bigger factor is the tendency to weight errors differently from judge to judge. Those aren't errors in any sense of the word I understand. They are preferences and a matter of weighting, not skill or "accuracy".

Assuming I am correct, then it STILL makes sense to use as many *trained* judges as possible (within reason), because the chances of them having a reasonably even collective weighting for all the errors is increased. Ideally you would also somehow select for a range of "weighting preferences" but most of the attempts to determine that have been more-or-less failures or wildly selective. As far as I know, it is not mathematically quantifiable.

In the time it has been done, the Open Final has used as many as 7 judges, and (mistakenly, in my opinion) tossed high and low. Too many, and it becomes unwieldy because there isn't enough room for them to view the maneuvers from close enough to the right place.

Brett

Howard Rush · « **Reply #28 on:** March 15, 2015, 01:44:53 PM »

Quote from: Brett Buck on March 15, 2015, 01:23:48 AM

I think this not a good assumption. I don't think it changes the answer but I don't buy the premise.

My objection to the premise is that there's no reason to believe that the differences between the judges scores or rankings is the result of random errors, of *any* distribution. In fact, I think a far bigger factor is the tendency to weight errors differently from judge to judge. Those aren't errors in any sense of the word I understand. They are preferences and a matter of weighting, not skill or "accuracy".

Assuming I am correct, then it STILL makes sense to use as many *trained* judges as possible (within reason), because the chances of them having a reasonably even collective weighting for all the errors is increased. Ideally you would also somehow select for a range of "weighting preferences" but most of the attempts to determine that have been more-or-less failures or wildly selective. As far as I know, it is not mathematically quantifiable.

If you hypothesize two guys being two points apart for the exercise of seeing the effect of number of judges, wouldn't the preferences shake out into a similar distribution amenable to analysis? Maybe there would be negative correlation between a corner-happy judge and one who emphasizes shapes. Beats me.

Quote from: Brett Buck on March 15, 2015, 01:23:48 AM

In the time it has been done, the Open Final has used as many as 7 judges, and (mistakenly, in my opinion) tossed high and low.

I don't think that's a matter of opinion. Tossing out high and low scores is a bad idea.

Tim Wescott · « **Reply #29 on:** March 15, 2015, 03:53:57 PM »

Quote from: Brett Buck on March 15, 2015, 01:23:48 AM

My objection to the premise is that there's no reason to believe that the differences between the judges scores or rankings is the result of random errors, of *any* distribution. In fact, I think a far bigger factor is the tendency to weight errors differently from judge to judge. Those aren't errors in any sense of the word I understand. They are preferences and a matter of weighting, not skill or "accuracy".

It is not uncommon in signal processing (and by extension, control system design) to treat deterministic but intractable deviations from ideal as random noise ("errors" in your parlance). Consider quantization noise, which takes a device with a perfectly well defined but nonlinear input/output relationship, and treats it as a gizmo that adds "random noise" to a perfectly linear input/output relationship. If you define a deviation from the ideal (or just a deviation from the consensus) as "error", then you can go forth from there.

Quote from: Howard Rush on March 15, 2015, 01:44:53 PM

If you hypothesize two guys being two points apart for the exercise of seeing the effect of number of judges, wouldn't the preferences shake out into a similar distribution amenable to analysis? Maybe there would be negative correlation between a corner-happy judge and one who emphasizes shapes. Beats me.

I've seen a lot of successes with treating something that is theoretically deterministic (ah, if only I knew how Judge A weighed a nice sharp corner vs. a bottom that's off by a few inches!) but practically intractable as a random process. So from that perspective I think that starting with the assumption that a judge's score is some "ideal" score with some random number added on to it is a valid way to do a useful analysis, even if the assumption itself is incorrect.

Tim Wescott · « **Reply #30 on:** March 15, 2015, 04:06:04 PM »

This is slightly off topic, but just as a thought experiment, let's posit the following:

Say that at the Spring Tune Up, the judging happens to be perfect, whatever that is. Meaning that the two judges judge each maneuver identically to each other and fairly to the pilot. Whatever the "ideal" score that the pilot should have been awarded for each maneuver, each judge rounds it to the nearest whole number and writes it down.

Since we're deep into fantasy land, let's assume that I've improved about three years' worth by then, and that I'm perfectly consistent. I show up with a perfect airplane (remember, I'm dreaming) and get 20 appearance points. Then I go out and I fly a pattern, each maneuver of which is "ideally" worth 35.51 points.

I get 585 points (35 * 15 + 20 + 25) for a flight that "ideally" was worth 577.65.

Howard flies too. Now we've all noticed that Howard is a wee bit less consistent than other top flyers. So let's say that Howard does 10 maneuvers that are "ideally" worth 35.49 points, and 5 maneuvers that are "ideally" worth 36.49 points.

So Howard gets 575 points for a flight that "ideally" was worth 582.35.

I win -- with an inferior flight and "perfect" judging.

So maybe having "imperfect" judging isn't such a bad thing...

Tim Wescott · « **Reply #31 on:** March 15, 2015, 06:34:28 PM »

Quote from: phil c on March 14, 2015, 09:32:21 PM

Student's T test to evaluate the difference between two Gaussian population distributions-all the judges scores for each maneuver for a pilot. In the stunt case I've never seen a table that would even give a value for less than 5 samples.
I did a little comparison. Suppose you have 5 judges who are nearly perfect-4 score every maneuver the same within a 4 pt range, the fifth scores everything 1pt higher. That gives the result that a 3pt difference between scores would be 95% significant, 1 out 20 chance of being wrong. If you go to 20 judges you get down close to a 1.5pt difference. You need to go to a phenomenal number of judges to get down near a 1pt difference.

If you go to 20 judges, 18 of whom are absolutely perfect and score every maneuver exactly the same and 2 who track the same scores but vary +/- 1 pt randomly between maneuvers you can get the confidence interval down to 1 pt. That won't happen in my grandkids lifetime.

The t distribution does not apply in this case -- it is for evaluating the validity of conclusions that are based on data drawn from a single random process with a fixed mean and variance.

In this case, we are positing judges whose deviation from ideal is more accurately modeled as

y = ax + b + n

where y is the score, x is the "ideal" score, a is a number close to, but not necessarily equal to one, b is some overall bias inherent in that particular judge, and n is random noise with a deviation that is peculiar to that judge. Neither the fact that the "gain" varies from judge to judge, nor the fact that the judge's variation from ideal varies from judge to judge is taken into account in Student's t test.

phil c · « **Reply #32 on:** March 17, 2015, 01:13:38 PM »

When you're dealing with people scoring something you never get a Gaussian or any other kind of standard deviation model in scores. Regardless of the scoring method used judges tend to center on certain spots on a scale, whether is 10-40, 0-10, or marking a straight line. For 10-40 they'd first tend toward whole fractions, 19,20,21 29,30,31. they avoid the ends so 39 or 11 is would often be more common. In between there would tend to clumps around 5 point divisions such as 24,25,26 or 29,30,31. With a 10pt scale as in F2B clumps occur around the whole numbers and .5's. With enough judge maybe even clumps around .25's.

If you really wanted to, especially at a large contest like the Worlds or the NATS, you could calculate the types of deviations that actually occurred across many variables- pilot, maneuver, judge, flight number, even wind direction, sun angle, as much trashy data as you wanted to gather.

Working with this kind of data for many years we found(and this has been studied too) that a Gaussian distribution works OK. An analysis of variance across the main variables, in this case pilot, maneuver, judge, maybe first round or second round, real differences will show up. If they aren't pilots, maneuvers, and judges in that order, then Houston we have a problem. The biggest problem industry is that the folks in charge don't want to see their favorite theory or product dumped and ask for another round of experiments, if they can afford it and have the time. This whole area of testing psychology is why so much educational research, nutritional research, climate research, and even drug research is so problematical. The data is very fuzzy, the methods used are often untried and the problems poorly defined, the testing poorly structured.

If you are comfortable lumping together individual judging accuracy, individual judges varying criteria, lack of concentration, getting tired, along with the pilot's jitters the 3600 or so judgements made in a NATS top twenty are more than enough to figure out what is going on.

I can pretty much guarantee any hard core PA flyer will not like the results.

Howard Rush · « **Reply #33 on:** March 18, 2015, 01:38:47 AM »

Quote from: John Leidle on March 14, 2015, 08:25:40 PM

Last place in Open last year..... that be me again.

I was talking about 5th in the top 5. It was a good solid 5th out of 5.