the spread the (data) science of sports

Building a win probability model part 4: Feature engineering and model evaluation

Wed 01 January 2014

How do we continue to improve the model?

So far we've used a fairly simple set of features in the win probability model. We saw that it performed pretty well on the training set and performed slightly less well, but still much better than chance, on the test set. Now it's time to delve deeper into increasing the accuracy of the model and assessing attempts to do so. This involves two processes: feature engineering and model evaluation.

I already covered some of the details of model evaluation in the previous post in this series. We'll use the metrics I introduced in that post to assess how good our feature engineering attempts are. Feature engineering is the process of selecting the features that you think will predict your target, transforming them to reflect their relationship with the target, and potentially creating new features out of the information you have to get the information you want.

We want to find the features that best classify plays as belonging to winning or losing teams, so we want to try a variety of them. However, every time you add new features to the model, you increase the complexity of the model which can, in turn, lead to overfitting. This means that you'll need to test how well your model does out of sample each time you add new features to the model to see if it improves or decreases the model's accuracy.

In doing so, though, you run the risk of overfitting to the testing set. By repeatedly checking to see how well the model predicts out-of-sample on the same sample, you're essentially using the testing set as an extension of the training set. One way to work around this is to have a small dataset called a validation set that you use for all of this model tuning before you finally roll to evaluating its accuracy on the test set. For simplicity, I'm not going to do that right now, but we'll return to it in a later post. For a good explanation of the differences between the training, validation, and test sets, see this Stack Overflow question.

Feature engineering

We've already engaged in some feature engineering -- by transforming the minutes and seconds to a 'seconds adjusted' feature. We're going to add three more features to the model: the Vegas line, the Vegas total line, and a new transformed time feature. The first two are fairly obvious -- including the spread will allow us to incorporate (perceived) team strength into our estimates and the total line will allow for situations when various amounts of offense are expected.

The third feature is a way to account for the fact that time has an unevenly distributed effect on win probability. Being down by 9 points in the first quarter isn't great, but it's not the kiss of death, but being down by 9 points with less than a minute remaining in the game is a very different situation. So we need a number that stays relatively constant but starts changing more as the number of seconds remaining increases. For this feature, we'll use 1/sqrt(seconds remaining + .01).

The training data, from Armchair Analysis, already conveniently has the spread and over/under in the data. The test data does not, so some data munging is required. I'll post the code shortly. I obtained all of the Vegas data here.

Model evaluation

Let's look at two models. The random forest model that we've been using up to now, which we'll call the 'limited model', and the new random forest model that includes our new features, which we'll call the 'full model.' You can click on any image to see a larger version.

roc_full_vs_limited

It looks like the full model, with the blue line, does do slightly better than the limited model, the purple line. To verify, let's look at some evaluation statistics.

[table id=4 /]

WOW! That's not much of an improvement. Surprising, given how much new information we've included. Only a percent improvement in accuracy here and there. While there were some non-trivial gains in the training set, the out-of-sample performance didn't change much (although it didn't worsen, which is always a good thing). This is definitely something we'll want to revisit over time. Feature engineering and model selection is an ongoing process.

Model evaluation

Since we're trying to make predictions about future events, we want to know the model's weak spots. We've gotten a birds-eye view with the above metrics, but is this accuracy evenly distributed throughout game situations? Let's find out. One way to look at this is to look at the average accuracy of the model at various points in the game. To do this, I've tested the model for each minute from 0 to 60 and plotted the accuracy score.

accuracy_by_minute

Unsurprisingly, the model gets more accurate as the end of a game approaches. Notice that the accuracy increase in the test set isn't as smooth as in the training set. This is to be expected. Interestingly enough, there are actually points where the limited model is more accurate out-of-sample than the full model, though not consistently. This just underscores the idea that there usually isn't the answer to a modeling problem often isn't 'a model', but rather 'several models.'

Now for something really interesting. Since we're making predictions about wins, one metric we might be interested in is how calibrated our estimated probabilities are. Plays that a well-calibrated model estimates as having a win probability of 0.5 should be wins about 50% of the time; a win probability of 0.75 should belong to winning teams about 75% of the time, and so on.

win_probability_by_wins

This is super interesting. Essentially what we see here is that the model is wrong at the extremes, but in different ways. Teams with very, very low win probabilities still end up winning the game about 20% of the time. Teams with win probabilities as high as 80% may only win the game 60% of the time. I've also included the 95% confidence interval here because this is based on a relatively small test set.

It's important to think about this, especially when you see games with wild win probability graphs. Thinking about this is one of the reasons I started this project in the first place. This is a really important plot, and not one you'll see on a lot of sites. I tweeted about this, and Tempo Free Gridiron, who produces win probabilities for college football, helpfully produced one as well.

Brenton Kenkel asked me on Twitter why the errors in the above plot aren't symmetric -- i.e., why isn't the model wrong the same amount of time around 0 and around 1? I don't have an immediate answer for this, but it's a good question! I'd love to hear your thoughts.

Closing

Let's revisit the Baltimore-Denver game that opened up the 2013 season and that I used as an example in a previous post. Here's the win probability for that game using the full model. I've altered the uncertainty estimates, though, and am instead using a bootstrapped 95% confidence interval based on bootstrapped samples of the estimated probabilities from each tree of the random forest.

bal_den_win_prob_bootstrapThe bootstrapped confidence interval got rid of some of the wild swings in uncertainty in the final portions of the game, but the interval also appears to be wider in general than the previous one. That being said, I think I trust these estimates more than the previous hack-y ones.

Also of note is that Denver, which was a 7.5 point favorite in this game, starts the game with approximately a 75% win probability and, despite trailing at the half, never dropped below a 50% win probability (although the confidence interval at halftime stretches from 20% to 90%. That tells a much richer story than the \~58% mean win probability expressed by the single blue line.

Progress! Next up, I'm implementing a tool that will allow you to interactively compare and plot game situations with all of the uncertainty and evaluation metrics included. I'm still working on live, in-game win probability estimates for the playoffs. I don't know if that will happen or not.

The code for this post will be posted ASAP!

blog comments powered by Disqus