the **spread** the (data) science of sports

Sun 01 December 2013

Uncertainty estimates

As a first exercise, I'm building a per-play win probability calculator. In an effort to be more transparent and make this more of a teaching exercise, I'll walk you through my thought process on why there's room for another win probability calculator as I show you how I build it.

A few already exist, the most well-known being the Advanced NFL Stats
WP model. Pro
Football
Reference has
also implemented one this year based on Wayne Winston's model in
*Mathletics*.
One of the areas that I see as ripe for improvement with these models is
the explicit incorporation of uncertainty estimates.

Most models we encounter produce what are called *point estimates;*these
are the single number that represents our best guess of a particular
outcome (like the probability of winning a game). Most of these guesses
are probabilistic, though, which means we can be more or less certain
how likely our guess is going to be (close to) the real outcome.

If you think in terms of probability, you're already used to incorporating uncertainty into your estimates whether or not you realize it. For instance, if you say there's a 10% chance of a team winning a game, you're saying that your best guess (your point estimate) is that they'll lose the the game, but that there's a small (but non-zero) chance that they'll actually win.

We can take this a step further. When we say that a team has a 10%
chance to win, are we saying that the team has **exactly** a 10% chance
of winning? No, 10% is a point estimate. Depending on how much
information we have about teams in this particular position, we can be
more or less certain about that 10%. If it is a situation with gobs of
data, we may actually think that the team has between a 5% and 15%
chance of winning, with 10% being the middle of that uncertainty
interval. However, maybe only a few teams have ever been in this
situation, so it's actually the case that the team has between a 1% and
a 19% chance of winning. In the second scenario, the team may have as
almost twice the probability of winning as in the first scenario, but
they both are reported as a 10% win probability.

I think about this a lot when I see articles talking about wild swings in win probability or teams with extremely high win probabilities before tanking in the end. There's no question that these things happen, I just wonder how wild the swings are. Small (and sometimes bigger) changes in win probability from one play can fall comfortably within the uncertainty estimate of the win probability on the next play. Not only does providing uncertainty estimates allow for a more accurate representation of the likelihood of an outcome, it also helps in decision-making.

**Overfitting**

This brings me to my second point: overfitting. This is a topic that
deserves its own series of posts, but I'll hit the high points here
before discussing it at more length later. When data scientists build
predictive models, we usually build the models on a set of data known as
the *training set *(the model-building step is also called model
training). But what we really want, in many cases, is to be able to make
predictions about the future.

There are two obvious problems with that goal. First, we can't check and
see how good we are at predicting the future until it actually happens.
Second, doing that could be tremendously costly. So what we normally do
is reserve a portion of the data we have, called the *test set. *We
don't use this set of data when we build the model. Once the model is
built, we use it to make predictions using all of the data in the test
set but not telling the model the outcome. We let the model make
predictions and then expose the actual outcomes and see how well we did
in predicting them. How well the model does is known as *out-of-sample
performance.*

Why is this important? Why wouldn't we want to use all of our data in
building the absolute best model? We do -- and more posts are to come
about this. The problem, though, is when we gauge the model's
performance only by using the data used in building the model, we are at
risk of *overfitting. *The goal of the model is to find the underlying
patterns that do the best job of predicting outcomes in as many
situations as possible; that is, the goal is for the model
to *generalize. *If we only have one set of data and we use all of it in
the model, we could continue to build a model more and more complicated
until it perfectly predicted all of the outcomes in the training set.
The only problem is that in doing so we have built a complicated
description of the data we already have and that probably will fall
apart some when faced with new data.

This is a really important concept, and it's taught much more in computer science/machine learning than in traditional statistics courses. I'll be returning to it frequently.

**Summing up**

So far we know our model needs two things to build upon the win
probability models that already exist -- quantification of uncertainty
and a measure of out-of-sample performance. Luckily, there's a class of
models that let us do these things easily. They're called *ensemble
methods;* I'll discuss them in the next post.

**Update**: Here's a great
post on
Kaggle detailing the dangers of overfitting.