I began thinking about this again for two reasons. One, Adrian Peterson just came within 9 yards of Eric Dickerson's season rushing record. With 348 rushes for Peterson and 379 for Dickerson, that comes out to a standard deviation for the combined rounding errors of 7.8 yards, and about a 12% chance that the 9 yard difference is entirely due to rounding errors.
The other reason is that Brian Burke pointed out in the comments of the original article that the rounding errors of plays in the NFL are not independent. The total yardage gain for each drive has to round off to the correct figure. From Brian's comment:
"One other way to state this is that if a team has 2 plays in a row, and one goes for 4.5 yards but is scored as 4, and the next goes for 5.5 yds, it can't be scored as 5. It must be scored as a 6 yd gain because the ball is very clearly 10 yds further down field, not 9."
I wanted to try to account for this constraint and see how much difference it would make.
Note: the following is mostly dry and math-related, so if you want to skip it, I estimate the chance of rounding errors covering the 9 yard difference between Dickerson and Peterson at about 14%.
In the previous article, we started with the assumption that the rounding error for a single play followed a uniform distribution from -.5 to .5 yards. While this may not be perfectly true, it is probably close enough to work with. We saw that from there, the combined rounding error for two independent plays forms a triangular distribution ranging from -1 to 1 yards.
Rounding Error for One Play:
Rounding Error for Two Plays:
With Burke's restriction, however, the combined rounding error for two plays has the same error distribution as for one play, the uniform distribution from -.5 to .5 yards.
This has an interesting effect on the second play of the drive. While the sum of all rounding errors for the drive has to follow the same uniform distribution as the first play alone, the error for the second play alone no longer follows the same uniform distribution.
Let's say that the first play goes for 4.5 yards and gets scored as a five yard gain. The next play goes for 4.8 yards, so the total for the drive is 9.3 yards. If the total yardage credited to the two plays has to be 9 yards, that means the second play must be scored as a 4 yard gain. That's a rounding error of -.8 yards, outside of the original uniform distribution. In fact, the rounding error for the second play alone follows the same triangular distribution as the error for two independent plays.
*****
The rounding error for any single play depends on two things: the total rounding error of the drive up to that point, and the precise yardage gain of the play. If the total rounding error before the play is -.5 yards and the gain is 1.3 yards, that means you have to round up to 2 yards, for an error of .7 yards, to keep the total error within the -.5 to .5 range (-.5 + .7 = .2, whereas rounding down would give -.5 - .3 = -.8). Those two factors can determine the rounding error for every singular play.
As a result, each individual play after the first has the same triangular error distribution as the second play. They all start off with the same uniform distribution for the total rounding error of the drive, so there is nothing that would change the distribution of errors for any single play the more plays you add before it (plays that end in a touchdown are an exception, because the only rounding error will be from the starting point).
In the previous article, we found the distribution for the total rounding error of a series of plays by adding the variances of the individual distributions. As Brian pointed out, that only works if the distributions are independent of each other, which they are clearly not if the total error distribution for the drive never grows when we add plays. Consecutive plays are highly correlated, to the point that any number of consecutive plays adds no variance to the total error distribution.
What about non-consecutive plays, though? We know that the total error for the team's drive can't exceed .5 yards, but the error for subsets of the drive can (for example, any single play except the first can have an error of more than .5 yards). What about if we want to know the combined error for the first, third, and fifth plays?
If these errors were independent, we would simply add the variances for the individual plays, which are 1/12 for the first play, 1/6 for the third play, and 1/6 for the fifth play (1/12 is the variance of the uniform distribution, 1/6 is the variance of the triangular distribution). That gives a total variance of 5/12. Now the question is how much of that variance is reduced by correlation between the individual errors.
Intuitively, we would think there should be some correlation. If the first play has a negative rounding error, and the total rounding error after three plays is as likely to be positive as negative, then it stands to reason that the second and third plays are more likely to have a positive rounding error than a negative rounding error.
That is true of the second play. It is not, however, true of the third play. The reason is that at the start of every play after the first, the total rounding error is going to follow the same uniform distribution. Whether the first play has a rounding error of -.5 or 0 or .5, the total error distribution after the second play is going to be the exact same. All of the correlation that goes into re-centering the total error distribution at zero is absorbed by the second play alone.
Put another way: If a play starts just short of a hash mark, it is no more or less likely to end just short of another hash mark than it is to end just past another hash mark. This is the nature of the uniform distribution of errors. A negative rounding error after one play is no more likely to lead to a negative cumulative rounding error after the following play.
So while the errors for consecutive plays are correlated, the errors for non-consecutive plays are not. You can simply add the variances as we did in the previous article.
Let's return to Adrian Peterson. He ran 348 times this year, and we want to know the total variance for the rounding error distribution for those 348 plays. We can use the following rules to find the total variance (each of these rules has been confirmed by simulation):
-the first play of a drive adds 1/12 to the total variance
-any play after the first play of the drive adds 2/12 to the total variance, assuming Peterson did not also rush on the play before
-any play immediately following another Peterson rush adds 0 variance, so that any string of consecutive plays adds only the variance of the first play (i.e. if the string started on the first play, the whole thing adds 1/12, otherwise the whole thing adds 2/12)
Using the same Brian Burke's published football PBP database, we can categorize each of Peterson's rushes into one of these three rules. Doing so gives a total variance for the rounding error of Peterson's rushes of 410/12, rather than the 348/12 we would get assuming each play was independent. It may seem counterintuitive that this restriction can increase the rounding error because it introduces correlation between errors, but remember that it also widens the distribution of errors on each play, and the correlation between errors only holds for consecutive plays.
I can't repeat this analysis on Dickerson's season because I don't have PBP data, but let's assume that Dickerson's rounding error distribution widened similarly to Peterson's. If Peterson's rounding error has a variance of 410/12 and Dickerson's is something like 440/12, that means there is about a 14% chance that rounding errors cover the entire 9 yard difference in their credited totals.
-CAVEATS-
This analysis incorporates only the restriction that the total error for each drive has to remain within the -.5 to .5 yard range at all times. There are likely additional restrictions, for example:
-Burkes' comment also mentions that consecutive non-scoring drives will be constrained with each other, which would mean the first play of some drives could also follow the triangular distribution instead of the uniform distribution
-The impact of touchdowns reducing the variance of scoring plays (due to the end point being exactly precise) is not considered in the 410/12 variance figure. Changing the variance on Peterson's 12 touchdown rushes would still round the final estimate to 14%, though.
-There could also be a first down restriction, i.e. that a series of downs can't be rounded up to 10 yards no matter what, so that 9.9 would have to be rounded down to 9 if the scorers can't credit 10 yards without reaching the first down marker (I don't know if this is true, but I am guessing it might be).
Also, as with the previous article, this is only addressing rounding error, not spotting errors on the part of the officials or misjudgments about where the ball is by the scorer.
My best guess would be that actual chance that Peterson out-gained Dickerson, once you incorporate the spotting errors, etc, is probably closer to 20% or so. The spotting errors are probably the biggest additional factor, and if they have a standard deviation of about 6 inches to a foot or so and are independent of the rounding errors, that would give something like a 17-22% chance. That's purely guesswork on the magnitude of spotting errors, though.
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.