Sun 08 December 2013
This begins a series of posts on building a win probability model. In actuality, we're going to be building a lot of models that will be combined into one, a technique known as ensemble learning. There are several advantages to doing this. First, we can start very simply and measure how our win probability model does against existing models and, second, it will allow us to iterate and improve in small steps.
Tackling any data science problem requires (wait for it...) data. I'll be using play-by-play data from Armchair Analysis, combined with data from NFL Savant and Advanced NFL Stats. The Armchair Analysis data costs \$25, but it's been cleaned quite well and comes separated into various tables to be loaded into a SQL database. It does not include 2013 data, so in order to make predictions and do out-of-sample testing, I'll supplement using the NFL Savant and ANS data.
We're going to approach this particular data science problem as an example of a classification problem (or, more specifically, a class probability estimation problem). In plain English, we're trying to figure out which class, or group, each of our plays belongs to -- are they are winners or losers? What our classification model will do is estimate the probability that each play belongs to either the winners class or the losers class.
To begin to estimate the probability that a team wins before a given play situation, we need a few things. We need our variables, or features as they are called in machine learning, and we need an outcome, or target in machine language terms. If you have a more traditional science or experimental background, you might recognize these as the independent and dependent variables, respectively. We'll try and figure out what combination of our features best predicts the target and with what probability. What are our features and target in this case?
We'll add more variables, like time outs remaining, as we build the model, but this is a good start. For now, I've excluded kickoffs, no plays, onside kicks, two-point conversions, punts, and field goal/extra point lines from the data as well as all post-season games.
Why exclude the post-season? When data scientists create models, they're often operating under the assumption that all of the data were produced by the same data generating process. This is just a way of saying that the same basic decision-making processes were used by coaches to generate the data we have here. We can't know what the coaches were thinking or what they meant to do, only what they did. The post-season is a different monster. Since one loss will end your season and send your team on vacation, coaches may employ different strategies and attempt things they might not normally do during the season. This implies that the post-season data might be generated via a different process.
Now, notice that I just made an assumption about the data generating process. I don't know if it's true. One of the things we can do later is test it. That's one of the foundations of doing good data science -- make your assumptions explicit and test to see whether or not the data support them.
Here are the first five rows of my data, which I've stored in a Postgres database (note I start using data from 2001, even though Armchair Analysis begins with 2000, because of a data issue): [table id=2 /]
Inspecting the data
The next step is to check the quality of the data and see if there are any obvious extreme values (i.e., a field position of greater than 100 yards, seconds greater than 3600, etc.) Some tables and plots when you first get your hands on a new data set can go a long way toward avoiding headaches later on.
To do this, I'll be using my language of choice, Python, but you can do any of this in any language. All of my Python code will be submitted to my Github account. If you're completely new to this, you may want to check out Python for Data Analysis to get the basics of the PyData stack and/or Machine Learning for Hackers for an overview of some of the methods used here (although in R).
Just looking at the distribution of seconds remaining across all of the plays, we already see a few interesting things. The beginnings and ends of quarters and halves immediately jump out -- more plays will have 3600, 2700, 1800, or 900 seconds remaining. We also see a bunching up around halftime and the end of the game (0 on the x-axis), presumably because more time outs are called here. Everything looks good so far.
This looks good too. Let's look at two more, field position and final score differential before we wrap up this post.
Again, nothing too surprising here. Most teams don't spend a whole lot of time backup up to their own end zone, but we see a big spike at 20 yards due to touchbacks. This then decreases steadily from about the halfway mark.
Finally, let's look at the final score differential.
Looks like the most common score differential is pretty low with a few extreme values at the other end of the distribution from rare blowouts. Let's take a more granular look at the same data, increasing the number of bins in the histogram.
Now we see that the most common score differential is 3 points, with the next bump at 7. This makes prefect sense.
Wrapping up, we haven't found any crazy surprises in simple plots of our data. This is great news! Our next step is to start building the model. If you want a sneak preview of how we'll do that initially, check out Yhat on building random forests in Python.