the **spread** the (data) science of sports

Sun 09 February 2014

Making the best use of all of the available data

Now that we've hit the offseason, I thought I'd publish some of my earlier football analytics work. The following was a spec project demonstrating the use of multilevel regression models to forecast quarterback performance. It's a very simple model and could be an excellent jumping-off point for someone to take up as a project. The model was built during the break between the 2012 and 2013 seasons, so now I have a chance to go back and look at how it performed in projecting 2013 numbers. You canfind all of the code on GitHub.

Projecting the passing productivity for quarterbacks is particularly important for an NFL team, as quarterbacks are frequently the highest-paid or second-highest paid player on the roster. Importantdecision points include the draft and the end of the third year of the player's contract, when a decision is usually made about exercising a fifth-year contract option.

Lisk 2008, Stuart 2013, and Burke 2011 each approach the question in separate ways, but all are focused on the question of quarterback age curves. Each tries to identify the age at which quarterbacks are likely to peak, coming to slightly different conclusions.

While identifying the average peaking age is useful for GMs, the richness of individual player histories is not represented when condensing the data like this. I instead approached this as a question appropriate for a statistical technique called "multilevel modeling." This technique uses all of the available information about each player, takes into account that performance is correlated year-upon-year, and creates both an overall estimate for the "average QB" as well as an individualized model for each QB.

Many existing approaches make arbitrary decisions about which players to
include and exclude in their models along the lines of "I only selected
players who played more than four seasons, started at least 12 games,
and had at least 100 attempts in a season." Doing this biases your model
because you are doing what is known as **selecting on the dependent
variable**. You are trying to model some measure of success and you are
deciding who is in your sample by the values on this variable. If you
only model successful people, you will get a biased view of what drives
success.

Think about it this way. You're trying to build a model that will help
you forecast how players will do because you're deciding whether or not
to spend millions of dollars on them. You hope that this model will
generalize and you will be able to use it in many decisions. You need
the best, most unbiased information possible. If you let information
from outside of the model leak into the model, it will decrease the
ability of the model to generalize.

Multilevel modeling allows you to include all of your observations. The model creates two components, a "population-level" component which tries to model a typical player in the absence of any specific information, and then player-specific adjustments (effects) that move that player's projections up or down.

- First, as it means there is no need to disregard players with very short careers. Their impact to the overall model is weighted less than players with longer careers, but they still contribute information to the model of the "average" player.
- Second, it is interactive and allows for projections for both a) unobserved years for existing players, and b) unobserved years for players not in the model. Obviously, uncertainty for the latter projections is higher than for existing players, but it is a useful starting point.

The interesting thing about the problem of projecting quarterback performance is how much it benefits from a standard approach to modeling nested and correlated data, rather than trying to reinvent the wheel. Using an overall aging curve for a random player is less useful from a management perspective, as it is almost never the case that a player is signed or re-signed without any existing information. Rather than using a general rule-of-thumb about the player's contract, it makes much more sense to use all of the available information for both the player in question and all other players in that position.

Using the `XML`

package for the R statistical language, I scraped names
and draft
dates
for all players drafted at quarterback from 1980 - 2012 (the "modern
era") from Pro Football Reference. I then scraped the passing
information for each player for each year (example using
2012) and
matched the draft information to passing statistics.

Following previous analyses, I used adjusted net yards per
attempt
(**ANY/A**) as the measurable outcome of quarterback performance. The
formula for computing this is:

(pass yards + 20*(pass TD) - 45*(interceptions thrown) - sack yards)/(passing attempts + sacks)

I then estimated a multilevel regression model using the player's age at each measurement of ANY/A, the player's age squared (to capture the non-linear rate at which age affects performance), and a variable that was set equal to 1 if the player was a starter (generously defined as starting more than 8 games in a season) and 0 otherwise.

The model produces an intercept for the overall model, an intercept for each player, an overall slope for each variable, and a slope for each variable for each player. The intercept can be thought of as the starting estimate of ANY/A in the absence of any information about age or player. The slopes can be thought of as the effect of each additional year of age (or year squared) on ANY/A plus a fixed amount of influence for being a starter.

Because the model is interactive, it is not as easy to report the findings in a short summary of "quarterback age peaks." However, here are the so-called "fixed effects", the coefficients that are stable from player to player and season to season. Note that I don't really like the language of "fixed" and "random" effects, as they tend to vary in definition from discipline to discipline.

~~~~ {tabindex="0"} Fixed effects: Estimate Std. Error t value (Intercept) -9.13057 5.82462 -1.568 age 0.82139 0.39566 2.076 I(age^2) -0.01299 0.00654 -1.986 starter 1.81190 0.22154 8.179

```
Then, each player has individual coefficients (called "random effects")
that add or subtract from these coefficients to adjust to the
information given about each player.
Because there are coefficients for each player, I can't really just spit
out a list of them all here. However, we can look and see how well the
model fits the training data (both good and bad):
Player Actual Y4 Estimated Y4 Actual Y5 Estimated Y5
Tom Brady 5.94 5.49 6.92 6.47
Rex Grossman 5.21 5.49 3.91 5.00
Byron Leftwich 5.34 5.62 2.69 3.86
David Carr 3.77 5.54 4.57 5.49
Now that the 2013 season's over, we can take a look and see how well the
model did in predicting new data. Let's take a look at players who were
rookies in the 2012 season first.
~~~~ {tabindex="0"}
name age starter anya_12 pred_2013 actual_2013
47 Andrew Luck 23 1 5.66 5.155733 6.06
195 Brandon Weeden 29 1 4.98 5.480730 4.51
245 Brock Osweiler 22 0 3.00 2.856016 4.83
1142 Kirk Cousins 24 0 7.53 4.351131 3.67
1434 Nick Foles 23 0 5.13 3.839286 9.18
1590 Russell Wilson 24 1 7.01 5.476200 7.10
1610 Ryan Lindley 23 0 1.89 2.466979 DNP
1612 Ryan Tannehill 24 1 5.23 4.998627 5.00
```

Some hits and some misses here. We would expect these to not be great for rookies, as they only have one year of data and the effect of the overall league average will be strong on their predictions. We can see that the model predicted Luck to take a little bit of a slide in 2013, but he actually improved. Weeden, Osweiler, and Cousins all have overly optimistic predictions, though none of them played many games. Nick Foles obviously exceeded everyone's expectations, and Russell Wilson continued to impress. Tannehill is on the money.

If we look at the players who ended their third year in 2012, let's check their 2013 predictions and their actual performance. I limited this to players who had not already left the league.

~~~~ {tabindex="0"} name age starter anya_12 pred_13 actual_2013 1 Chase Daniel 26 0 10.00 5.018754 5.30 2 Colt McCoy 25 0 4.74 3.898679 13.0 3 John Skelton 24 0 4.50 3.703542 DNP 4 Rusty Smith 25 0 6.80 3.607257 DNP 5 Sam Bradford 25 1 5.64 5.090951 6.10 ~~~~

Yikes. Daniel actually only played 5 games, McCoy a single game, and Bradford played seven games before a season-ending injury.

This simple model is built on only two measures of a quarterback: his
age and whether or not he is a starter. One of the issues with the data
set is that it only measures *outcomes*, which are biased towards
successful players anyway. Trying to disentangle outcomes from process
would be an important contribution. Second, it does not help in
evaluating potential draft candidates. It would need to incorporate
college data to do so, and college statistics are notoriously poor at
forecasting NFL performance. Third, it does not take into account that
the quarterback is not solely responsible for his performance. It does
not account for talented receivers, effective offensive lines, or a
heavy run game.

Clearly, the model isn't magic and has to be considered with other information and in context. However, I was able to produce a model with very few features that produced reasonable forecasts about the future and allowed us to use all available data rather than selecting arbitrary cutoff points. I look forward to updating it in the future.