the spread the (data) science of sports

Win probability, uncertainty, and overfitting

Sun 01 December 2013

Uncertainty estimates

As a first exercise, I'm building a per-play win probability calculator. In an effort to be more transparent and make this more of a teaching exercise, I'll walk you through my thought process on why there's room for another win probability calculator as I show you how I build it.

A few already exist, the most well-known being the Advanced NFL Stats WP model. Pro Football Reference has also implemented one this year based on Wayne Winston's model in Mathletics. One of the areas that I see as ripe for improvement with these models is the explicit incorporation of uncertainty estimates.

Most models we encounter produce what are called point estimates;these are the single number that represents our best guess of a particular outcome (like the probability of winning a game). Most of these guesses are probabilistic, though, which means we can be more or less certain how likely our guess is going to be (close to) the real outcome.

If you think in terms of probability, you're already used to incorporating uncertainty into your estimates whether or not you realize it. For instance, if you say there's a 10% chance of a team winning a game, you're saying that your best guess (your point estimate) is that they'll lose the the game, but that there's a small (but non-zero) chance that they'll actually win.

We can take this a step further. When we say that a team has a 10% chance to win, are we saying that the team has exactly a 10% chance of winning? No, 10% is a point estimate. Depending on how much information we have about teams in this particular position, we can be more or less certain about that 10%. If it is a situation with gobs of data, we may actually think that the team has between a 5% and 15% chance of winning, with 10% being the middle of that uncertainty interval. However, maybe only a few teams have ever been in this situation, so it's actually the case that the team has between a 1% and a 19% chance of winning. In the second scenario, the team may have as almost twice the probability of winning as in the first scenario, but they both are reported as a 10% win probability.

I think about this a lot when I see articles talking about wild swings in win probability or teams with extremely high win probabilities before tanking in the end. There's no question that these things happen, I just wonder how wild the swings are. Small (and sometimes bigger) changes in win probability from one play can fall comfortably within the uncertainty estimate of the win probability on the next play. Not only does providing uncertainty estimates allow for a more accurate representation of the likelihood of an outcome, it also helps in decision-making.

Overfitting

This brings me to my second point: overfitting. This is a topic that deserves its own series of posts, but I'll hit the high points here before discussing it at more length later. When data scientists build predictive models, we usually build the models on a set of data known as the training set (the model-building step is also called model training). But what we really want, in many cases, is to be able to make predictions about the future.

There are two obvious problems with that goal. First, we can't check and see how good we are at predicting the future until it actually happens. Second, doing that could be tremendously costly. So what we normally do is reserve a portion of the data we have, called the test set. We don't use this set of data when we build the model. Once the model is built, we use it to make predictions using all of the data in the test set but not telling the model the outcome. We let the model make predictions and then expose the actual outcomes and see how well we did in predicting them. How well the model does is known as out-of-sample performance.

Why is this important? Why wouldn't we want to use all of our data in building the absolute best model? We do -- and more posts are to come about this. The problem, though, is when we gauge the model's performance only by using the data used in building the model, we are at risk of overfitting. The goal of the model is to find the underlying patterns that do the best job of predicting outcomes in as many situations as possible; that is, the goal is for the model to generalize. If we only have one set of data and we use all of it in the model, we could continue to build a model more and more complicated until it perfectly predicted all of the outcomes in the training set. The only problem is that in doing so we have built a complicated description of the data we already have and that probably will fall apart some when faced with new data.

This is a really important concept, and it's taught much more in computer science/machine learning than in traditional statistics courses. I'll be returning to it frequently.

Summing up

So far we know our model needs two things to build upon the win probability models that already exist -- quantification of uncertainty and a measure of out-of-sample performance. Luckily, there's a class of models that let us do these things easily. They're called ensemble methods; I'll discuss them in the next post.

Update: Here's a great post on Kaggle detailing the dangers of overfitting.

blog comments powered by Disqus