Wed 18 December 2013
Our first model
Now that we have play-by-play data in a format ready for analysis, have selected our features and target, and have given some thought to why the world needs another win probability model, it's time to start modeling. The first modeling technique we'll be trying is called a random forest (or, if you want to use the non-copyrighted name, forests of randomized decision trees). This is a really popular method these days, and if you take a look at data science competitions on Kaggle you'll see that many of the winners use random forests as part of their modeling toolkit.
Random forests are popular for a number of reasons. First, they don't overfit as easily as many other methods. Second, they're really robust to non-linear interactions among your features. Third, they're surprisingly accurate in lots of modeling situations. And, fourth, they're easy to run in parallel, which means that you can estimate random forests on really honking big data sets across lots of computers with lots of CPUs.
So... what exactly are they and how do they help us estimate win probability? A full discussion of random forests would take way too long and, besides, there are lots of them out there. I'll try and give you the high-level view, though. Random forests are a kind of ensemble model. Instead of building one model, you build lots of small (usually simple, bad) models and combine their output, which often leads to more accurate predictions than using a single, highly tuned model.
One way that random forests do this is through a process known as bagging, which is just short for bootstrap aggregating. Still with me? OK. What's going on under the hood here is that many small, random samples are taking from your original data with replacement. Taking samples with replacement is a way of simulating taking many random samples from the entire population when you only have a sample.
So, while you only have one dataset, you're simulating many datasets that were (hopefully) produced by the same data generating process. You then build a model for each of these little samples. Random forests take this one step further and do the same thing with all of your features as well. So, for each of these little models, you have a random subset of your features. Each of these little models then makes a prediction about the observations in its subsample -- for our purposes, that means for each play, the small models each make a prediction about if the play belonged to the winning team or not (and with what probability).
Each of these predictions from the subsamples are sometimes called votes. We can then tally up all of those votes using some predetermined rule (often just a majority) and get more accurate predictions. Each of the small models that are built are decision tree classifiers. These are basically flowcharts used to make predictions. The random forest algorithm keeps splitting the subsamples up based on values of the features until it reaches a predetermined stopping point. There are a number of ways to determine how to make these splits, but suffice to say that they come from information theory. Here's an example of a decision tree from Wikipedia.
A nice side effect of this and the random sampling of the features between models is that we can estimate how important various features are for predicting win probability. If we can remove a feature from our model and our errors in predicting wins don't get much worse, that feature probably isn't that important to our model. In the end, this means that all of your observations and all of your features are used in building the model, but not at the same time. This helps with overfitting and allows you to use as much of your data as possible when building the model.
Importantly, each of these votes also provides a predicted probability of winning, which will enable us to quantify the uncertainty surrounding our final win probability estimates. We'll just take each of the predicted probabilities for each observation, sort them in ascending order, and take the 2.5th and 97.5th percentiles to give us a 95% confidence interval. Hopefully, this will mean that game situations that usually belong to a winning team will have more precise estimates of win probability than those situations that are less associated with winning. We can also examine individual game situations visually. Here's an example.
This violin plot represents the distribution of estimated win probabilities by the random forest model for a team receiving the ball at the beginning of the game on their own 20 via a touchback from the opening kick. The plot gets wider as there is more 'density' or more votes at that probability. The dashed line represents the median vote -- the 50th percentile of the votes and the two dotted lines are the 25th and 75th percentiles. Interestingly, the model says that teams who receive the ball first have about a 47% win probability (.474, actually). One thing we can ask -- does this make sense? We'll explore that soon.
[Technical sidenote: Random forests require some tuning to find the optimal number of observations to include in each subsample, how many features to use in each model, how many 'trees to grow', and so on. There are different ways to do this. I used what's known as grid search, which tries many different parameters and selects the best one based on cross-validation error. More technical posts on this later. My model uses 150 trees, with a max of 3 features per tree, and a minimum of 100 samples per leaf.]
In this post, I'll show you what plotting the win probability looks like for a single game. In a following post, I'll discuss different ways to validate a model; i.e., how do we know if we have a good model? For now, suffice to say that I'm testing the model on 2013 data, which were not used in building the model.
Modeling a single game
Let's start examining what the model can do by picking a single game, this season's opening game. After a slow start and trailing at the half, Denver defeated Baltimore by a score of 49-27. Here's what the win probability plot looks like without any kind of uncertainty estimates. The red lines indicate the start of each quarter.
Some important things to notice off the bat. Roughly halfway through the first quarter, we see that the model is giving Baltimore roughly 70% probability of winning the game. Denver's only down by a touchdown and it's not even the second quarter yet! Crazy. Clearly, this leads us to question what it means to have a 'win probability' in the first quarter (hat tip to Joel Grus for that insight).
Let's take a look at a first pass at uncertainty. I stored the votes from each of the 150 trees of the random forest for each play and used it to construct what I'll call a 95% uncertainty interval. This isn't really a confidence interval or a prediction interval, but it does gives us some idea of how stable the probability estimates are.
OK, clearly this needs some work -- what are those stalactites in the 3rd and 4th quarters all about? Is this a good way to quantify uncertainty? One thing that's clear. The 70% win probability attributed to Baltimore in the first quarter also received 95% of its estimates as low as 52% and as high as 84%. That's a pretty wide range, and conveys a lot more information than 70%. The uncertainty estimates get narrower as the game goes on, as we would hope they would, with those obvious exceptions.
In the coming posts, we'll test out some other models -- gradient boosted decision trees and good old fashioned logistic regression and see how they fare. We'll also look at feature engineering -- the process of selecting features and combining features to increase the accuracy of our model.
Code for this post is available on Github.