the spread the (data) science of sports

Forecasting QB performance using multilevel regression

Sun 09 February 2014

Making the best use of all of the available data

Now that we've hit the offseason, I thought I'd publish some of my earlier football analytics work. The following was a spec project demonstrating the use of multilevel regression models to forecast quarterback performance. It's a very simple model and could be an excellent jumping-off point for someone to take up as a project. The model was built during the break between the 2012 and 2013 seasons, so now I have a chance to go back and look at how it performed in projecting 2013 numbers. You canfind all of the code on GitHub.

Projecting the passing productivity for quarterbacks is particularly important for an NFL team, as quarterbacks are frequently the highest-paid or second-highest paid player on the roster. Importantdecision points include the draft and the end of the third year of the player's contract, when a decision is usually made about exercising a fifth-year contract option.

Lisk 2008, Stuart 2013, and Burke 2011 each approach the question in separate ways, but all are focused on the question of quarterback age curves. Each tries to identify the age at which quarterbacks are likely to peak, coming to slightly different conclusions.

While identifying the average peaking age is useful for GMs, the richness of individual player histories is not represented when condensing the data like this. I instead approached this as a question appropriate for a statistical technique called "multilevel modeling." This technique uses all of the available information about each player, takes into account that performance is correlated year-upon-year, and creates both an overall estimate for the "average QB" as well as an individualized model for each QB.

Many existing approaches make arbitrary decisions about which players to include and exclude in their models along the lines of "I only selected players who played more than four seasons, started at least 12 games, and had at least 100 attempts in a season." Doing this biases your model because you are doing what is known as selecting on the dependent variable. You are trying to model some measure of success and you are deciding who is in your sample by the values on this variable. If you only model successful people, you will get a biased view of what drives success.

Think about it this way. You're trying to build a model that will help you forecast how players will do because you're deciding whether or not to spend millions of dollars on them. You hope that this model will generalize and you will be able to use it in many decisions. You need the best, most unbiased information possible. If you let information from outside of the model leak into the model, it will decrease the ability of the model to generalize. 

Multilevel modeling allows you to include all of your observations. The model creates two components, a "population-level" component which tries to model a typical player in the absence of any specific information, and then player-specific adjustments (effects) that move that player's projections up or down.

The interesting thing about the problem of projecting quarterback performance is how much it benefits from a standard approach to modeling nested and correlated data, rather than trying to reinvent the wheel. Using an overall aging curve for a random player is less useful from a management perspective, as it is almost never the case that a player is signed or re-signed without any existing information. Rather than using a general rule-of-thumb about the player's contract, it makes much more sense to use all of the available information for both the player in question and all other players in that position.

Using the XML package for the R statistical language, I scraped names and draft dates for all players drafted at quarterback from 1980 - 2012 (the "modern era") from Pro Football Reference. I then scraped the passing information for each player for each year (example using 2012) and matched the draft information to passing statistics.

Following previous analyses, I used adjusted net yards per attempt (ANY/A) as the measurable outcome of quarterback performance. The formula for computing this is:

(pass yards + 20*(pass TD) - 45*(interceptions thrown) - sack yards)/(passing attempts + sacks)

I then estimated a multilevel regression model using the player's age at each measurement of ANY/A, the player's age squared (to capture the non-linear rate at which age affects performance), and a variable that was set equal to 1 if the player was a starter (generously defined as starting more than 8 games in a season) and 0 otherwise.

The model produces an intercept for the overall model, an intercept for each player, an overall slope for each variable, and a slope for each variable for each player. The intercept can be thought of as the starting estimate of ANY/A in the absence of any information about age or player. The slopes can be thought of as the effect of each additional year of age (or year squared) on ANY/A plus a fixed amount of influence for being a starter.

Because the model is interactive, it is not as easy to report the findings in a short summary of "quarterback age peaks." However, here are the so-called "fixed effects", the coefficients that are stable from player to player and season to season. Note that I don't really like the language of "fixed" and "random" effects, as they tend to vary in definition from discipline to discipline.

~~~~ {tabindex="0"} Fixed effects: Estimate Std. Error t value (Intercept) -9.13057 5.82462 -1.568 age 0.82139 0.39566 2.076 I(age^2) -0.01299 0.00654 -1.986 starter 1.81190 0.22154 8.179

Then, each player has individual coefficients (called "random effects")
that add or subtract from these coefficients to adjust to the
information given about each player.

Because there are coefficients for each player, I can't really just spit
out a list of them all here. However, we can look and see how well the
model fits the training data (both good and bad):

            Player            Actual Y4 Estimated Y4 Actual Y5 Estimated Y5
            Tom Brady         5.94      5.49         6.92      6.47
            Rex Grossman      5.21      5.49         3.91      5.00
            Byron Leftwich    5.34      5.62         2.69      3.86
            David Carr        3.77      5.54         4.57      5.49

Now that the 2013 season's over, we can take a look and see how well the
model did in predicting new data. Let's take a look at players who were
rookies in the 2012 season first.

~~~~ {tabindex="0"}
               name age starter  anya_12 pred_2013 actual_2013
47      Andrew Luck  23       1     5.66  5.155733        6.06
195  Brandon Weeden  29       1     4.98  5.480730        4.51
245  Brock Osweiler  22       0     3.00  2.856016        4.83
1142   Kirk Cousins  24       0     7.53  4.351131        3.67
1434     Nick Foles  23       0     5.13  3.839286        9.18
1590 Russell Wilson  24       1     7.01  5.476200        7.10
1610   Ryan Lindley  23       0     1.89  2.466979         DNP
1612 Ryan Tannehill  24       1     5.23  4.998627        5.00

Some hits and some misses here. We would expect these to not be great for rookies, as they only have one year of data and the effect of the overall league average will be strong on their predictions. We can see that the model predicted Luck to take a little bit of a slide in 2013, but he actually improved. Weeden, Osweiler, and Cousins all have overly optimistic predictions, though none of them played many games. Nick Foles obviously exceeded everyone's expectations, and Russell Wilson continued to impress. Tannehill is on the money.

If we look at the players who ended their third year in 2012, let's check their 2013 predictions and their actual performance. I limited this to players who had not already left the league.

~~~~ {tabindex="0"} name age starter anya_12 pred_13 actual_2013 1 Chase Daniel 26 0 10.00 5.018754 5.30 2 Colt McCoy 25 0 4.74 3.898679 13.0 3 John Skelton 24 0 4.50 3.703542 DNP 4 Rusty Smith 25 0 6.80 3.607257 DNP 5 Sam Bradford 25 1 5.64 5.090951 6.10 ~~~~

Yikes. Daniel actually only played 5 games, McCoy a single game, and Bradford played seven games before a season-ending injury.

This simple model is built on only two measures of a quarterback: his age and whether or not he is a starter. One of the issues with the data set is that it only measures outcomes, which are biased towards successful players anyway. Trying to disentangle outcomes from process would be an important contribution. Second, it does not help in evaluating potential draft candidates. It would need to incorporate college data to do so, and college statistics are notoriously poor at forecasting NFL performance. Third, it does not take into account that the quarterback is not solely responsible for his performance. It does not account for talented receivers, effective offensive lines, or a heavy run game.

Clearly, the model isn't magic and has to be considered with other information and in context. However, I was able to produce a model with very few features that produced reasonable forecasts about the future and allowed us to use all available data rather than selecting arbitrary cutoff points. I look forward to updating it in the future.

blog comments powered by Disqus