the spread the (data) science of sports

What we talk about when we talk about win probability

Sat 03 October 2015

Win probability has been a popular topic for this blog. I've walked through some of the mechanics of building a win probability model and how to evaluate those models. In this post, I'm going to push a little deeper on the underlying questions that win probability models are actually trying to answer and what we really mean when we talk about win probability.

I've been thinking about these topics a lot lately as I've been building the model that backs the New York Times 4th Down Bot. End of game situations are particularly tricky to model but are the most scrutinized, the highest leverage, and the ones in which the model must perform in order to be taken seriously and used as a serious decision-making tool.

Win probability models drive all sorts of arguments within football analytics, they're the very backbone of most fourth down models that exist, and feature prominently in discussions of clock management, two-point conversion strategy, and more. The esteemed Brian Burke has been quoted as saying that win probability models are basically the 'holy grail' of sports analytics.

I'll walk through a few of these issues below and then show you how they actually matter by working through an example.


Let's start with the most basic but often overlooked issue -- probabilities are inherently unobservable. The very thing we're trying to model is not something that can ever be directly observed. What does it mean to have a win probability of 0.75? We could take a frequentist approach and say that means that, in the situation we're currently in, if the game were played an infinite number of times, we would observe teams in this situation to win 75% of the time.

We could also treat the probability of winning as a latent variable that can never be truly observed but has some unknown value. We then fit a model to our data that makes most likely the observed patterns of wins and losses and how they correlate with the predictors.

There are additional approaches, but the take home message is that we can't observe a probability, so that makes arguing about probabilities inherently very difficult. Teams with a WP of 0.01 will still win (hopefully 1% of the time, if you have a well-calibrated model), does that mean the model was wrong? No. But we can certainly talk about which models are better instead of which models are right.

Picking the right model

In previous posts, I walked through how to evaluate win probability models. There are lots of different metrics you can use: out-of-sample prediction error, cross-validation error, precision, recall, F1-scores, log-loss, how well calibrated the model is, etc.

The dirty secret is it's pretty easy to build a win probability model that does well on all of these metrics and, beyond a few important features, different kinds of models perform about equally well on all of them. Whether you choose a logistic regression or a random forest, you're likely to see roughly equivalent performance. However, there are tons of corner cases and non-linearities (like the end of game situations mentioned above) in which the models don't seem to pass the smell test (see the example below).

There are 'do or die' situations in which a win hangs entirely on the success of a single play. We all recognize these situations, but models have a very hard time with them. In these cases, we know that we can't observe the probability, but we have a vague sense that the probability produced by the model is 'too low.' The normal solution for something like this is to introduce interaction terms in the model or use a model that better handles lots of non-linearities, such as a tree-based model. But, as I'll show below, that still doesn't always work. (And, in the case of the 4th Down Bot, we need to make sure the model is fast at both training and prediction time, which random forests can be not great at).


Uncertainty is a topic I've certainly said a good deal about, but I've got more to say. Because probabilities are inherently unobservable, and we're fitting models that merely estimate these probabilities based on the observed outcomes (wins and losses), estimating the uncertainty of those probability estimates is a tricky business. There are plenty of arguments in the statistical literature about what it means to have a confidence interval on a predicted probability. Tools even differ in their ability to try to do this. While R's predict.glm function will estimate these confidence intervals, Python's statsmodels will not (though if this is interesting to you, please check out Tom Augspurger's notebook on doing so!). scikit-learn doesn't produce standard errors or a variance-covariance matrix, which means we have no measure of uncertainty for these probabilities.

Unfortunately, if we're going to use win probability models to make important decisions, we have to have some kind of idea about uncertainty -- if we don't, we won't have any idea what a meaningful change in win probability means. If going for it on 4th down produces an expected win probability of .65 and attempting a field goal produces an expected win probability of .63, that's a difference of two percentage points. Is that enough of a difference to go for it? How do we know? Both of those probabilities are estimates with some amount of (unknown!) uncertainty surrounding them.

An example

Let's work through an example to make all of this concrete. Here's a realistic situation we're all familiar with. There are 40 seconds left in the 4th quarter. Your team trails by 2 points, your opponent has no timeouts remaining. You're on your opponent's 15-yard line and it's 4th down with 2 yards to go.

Parsing that situation, we basically see that a field goal wins the game because your opponent can't stop the clock and you only need 3 points to win. That means your kicker is facing a roughly 32-yard field goal, which is between 93% and 95% probable to succeed.

Most fans look at this situation and it's pretty obvious that the probability of winning the game is equal to the probability of kicking that field goal successfully. If you kick the field goal as time expires, you win. If you miss the field goal, your opponent can kneel and end the game.

Let's see what a normal win probability model that takes into account down, yards to go, field position, time remaining, score difference, and timeouts. We even have some custom features in here like an interaction term between the quarter and the score difference and whether a team can kneel down to end the game

In [32]: logit.predict_proba(situation)
Out[32]: array([[ 0.39417384,  0.60582616]])

Hmm, a 60% chance of winning. That seems way too low. Let's try a random forest, a tree-based model with lots of trees, depth, and that will understand all of these non-linearities.

In [33]: rf.predict_proba(situation)
Out[33]: array([[ 0.41979832,  0.58020168]])

About the same. What gives? We 'know' that these probabilities are not correct. The game is over once the field goal is kicked. The 'model' knows this, too.

What to do?

We have a couple of options -- we can continue to add more and more features to the model to try and capture all of these corner cases and non-linearities, but we run up against the bias-variance tradeoff as we continue to do this.

We can add post-processing steps that identify these situations and adjust the probability accordingly (NB, this is what we currently do with the 4th down bot). However, this introduces the uncomfortable question of "what does that mean for the probabilities that we produced via the model?" If we were plotting these probabilities throughout the game, this would probably produce some fairly large jumps in the win probability graph, a good sign that you may be overfitting the model.

This isn't purely navel-gazing or a what does it all mean academic exercise. These issues are at the very core of what it means to make decisions based on win probabilities -- and I don't have good answers right now.

blog comments powered by Disqus