the **spread** the (data) science of sports

Wed 18 December 2013

Our first model

Now that we have play-by-play data in a format ready for analysis, have
selected our features and target, and have given some thought to why the
world needs another win probability model, it's time to start modeling.
The first modeling technique we'll be trying is called a
*random forest* (or, if you want to use the non-copyrighted name,
forests of randomized decision trees). This is a really popular method
these days, and if you take a look at data science competitions on
Kaggle you'll see that many of the winners use random forests as part of
their modeling toolkit.

Random forests are popular for a number of reasons. First, they
don't *overfit*
as easily as many other methods. Second, they're really robust to
non-linear interactions among your features. Third, they're surprisingly
accurate in lots of modeling situations. And, fourth, they're easy to
run in parallel, which means that you can estimate random forests on
really honking big data sets across lots of computers with lots of CPUs.

So... what exactly *are* they and how do they help us estimate win
probability? A full discussion of random forests would take way too long
and, besides, there are lots of them out there. I'll try and give you
the high-level view, though. Random forests are a kind of ensemble
model. Instead of building one model, you build lots of small (usually
simple, bad) models and combine their output, which often leads to more
accurate predictions than using a single, highly tuned model.

One way that random forests do this is through a process known
as *bagging*, which is just short for *bootstrap aggregating.* Still
with me? OK. What's going on under the hood here is that many small,
random samples are taking from your original data *with
replacement. *Taking samples with replacement is a way of simulating
taking many random samples from the entire population when you only have
a sample.

So, while you only have one dataset, you're simulating many datasets that were (hopefully) produced by the same data generating process. You then build a model for each of these little samples. Random forests take this one step further and do the same thing with all of your features as well. So, for each of these little models, you have a random subset of your features. Each of these little models then makes a prediction about the observations in its subsample -- for our purposes, that means for each play, the small models each make a prediction about if the play belonged to the winning team or not (and with what probability).

Each of these predictions from the subsamples are sometimes
called *votes*. We can then tally up all of those votes using some
predetermined rule (often just a majority) and get more accurate
predictions. Each of the small models that are built are *decision tree
classifiers*. These are basically flowcharts used to make predictions.
The random forest algorithm keeps splitting the subsamples up based on
values of the features until it reaches a predetermined stopping point.
There are a number of ways to determine how to make these splits, but
suffice to say that they come from information theory. Here's an example
of a decision tree from Wikipedia.

A nice side effect of this and the random sampling of the features between models is that we can estimate how important various features are for predicting win probability. If we can remove a feature from our model and our errors in predicting wins don't get much worse, that feature probably isn't that important to our model. In the end, this means that all of your observations and all of your features are used in building the model, but not at the same time. This helps with overfitting and allows you to use as much of your data as possible when building the model.

Importantly, each of these votes also provides a predicted probability of winning, which will enable us to quantify the uncertainty surrounding our final win probability estimates. We'll just take each of the predicted probabilities for each observation, sort them in ascending order, and take the 2.5th and 97.5th percentiles to give us a 95% confidence interval. Hopefully, this will mean that game situations that usually belong to a winning team will have more precise estimates of win probability than those situations that are less associated with winning. We can also examine individual game situations visually. Here's an example.

This violin plot represents the distribution of estimated win probabilities by the random forest model for a team receiving the ball at the beginning of the game on their own 20 via a touchback from the opening kick. The plot gets wider as there is more 'density' or more votes at that probability. The dashed line represents the median vote -- the 50th percentile of the votes and the two dotted lines are the 25th and 75th percentiles. Interestingly, the model says that teams who receive the ball first have about a 47% win probability (.474, actually). One thing we can ask -- does this make sense? We'll explore that soon.

I've used scikit-learn, a machine learning library for Python, to create my model, but you can use just about any statistics package to build a random forest. You can even use Excel!

[Technical sidenote: Random forests require some *tuning* to find the
optimal number of observations to include in each subsample, how many
features to use in each model, how many 'trees to grow', and so on.
There are different ways to do this. I used what's known as grid search,
which tries many different parameters and selects the best one based on
cross-validation error. More technical posts on this later. My model
uses 150 trees, with a max of 3 features per tree, and a minimum of 100
samples per leaf.]

In this post, I'll show you what plotting the win probability looks like
for a single game. In a following post, I'll discuss different ways
to *validate *a model; i.e., how do we know if we have a **good** model?
For now, suffice to say that I'm testing the model on 2013 data, which
were not used in building the model.

**Modeling a single game**

Let's start examining what the model can do by picking a single game, this season's opening game. After a slow start and trailing at the half, Denver defeated Baltimore by a score of 49-27. Here's what the win probability plot looks like without any kind of uncertainty estimates. The red lines indicate the start of each quarter.

Some important things to notice off the bat. Roughly halfway through the first quarter, we see that the model is giving Baltimore roughly 70% probability of winning the game. Denver's only down by a touchdown and it's not even the second quarter yet! Crazy. Clearly, this leads us to question what it means to have a 'win probability' in the first quarter (hat tip to Joel Grus for that insight).

Let's take a look at a first pass at uncertainty. I stored the votes
from each of the 150 trees of the random forest for each play and used
it to construct what I'll call a 95% uncertainty interval. This isn't
really a confidence interval *or* a prediction interval, but it does
gives us some idea of how stable the probability estimates are.

OK, clearly this needs some work -- what are those stalactites in the 3rd and 4th quarters all about? Is this a good way to quantify uncertainty? One thing that's clear. The 70% win probability attributed to Baltimore in the first quarter also received 95% of its estimates as low as 52% and as high as 84%. That's a pretty wide range, and conveys a lot more information than 70%. The uncertainty estimates get narrower as the game goes on, as we would hope they would, with those obvious exceptions.

In the coming posts, we'll test out some other models -- gradient
boosted decision trees and good old fashioned logistic regression and
see how they fare. We'll also look at *feature engineering* -- the
process of selecting features and combining features to increase the
accuracy of our model.

Code for this post is available on Github.