the spread the (data) science of sports

Building a win probability model, part 3: What's a good model?

Sun 22 December 2013

How do we know when we have a good model?

The win probability model is well on its way and we can now produce probabilities for any given down, distance, score differential, field position, and time remaining. Yet we haven't really evaluated our model. How do we know if it's any good? This ends up being a more complicated question than one might think at first.

There are some standard techniques for evaluating the performance of a classifier (remember -- the model is trying to predict if a given play belongs to the 'winning' class or not). I'll walk through some of these and discuss how they give us a better understanding for how much trust we can put in the model. After all, if you want to base decision-making on the model, you'd like to have some idea of how well it performs. As far as I know, most existing win probability models out there don't present any kind of model-checking diagnostics or let you see behind the curtain.

I've already mentioned that we're primarily concerned with how well the model generalizes out of our training set of plays. In other words, we want to have confidence that the model will perform reasonably well when it's presented with new data. We need more than a model that does a good job of explaining past games -- we already know who won those games. This is where the learning part of machine learning comes in. We want our model to learn about the things that identify when a team is likely to win a game and, once it's learned to identify those things, to apply this knowledge in new places and situations.

But in order to do that, our model first needs to learn about the training data. This means we need to know how well the model performs on the 2002-2012 play-by-play data. There are a few ways to do this. The easiest first stop is to ask our model to make predictions on every play in the training set and then compare that against the truth. This will give us an overall accuracy score. In the case of our random forest model, our accuracy score is 0.76.

Put another way, when asked to guess if each play in the training data belongs to a winning team or a losing team, the model is right 76% of the time. Pretty good, right? Well, maybe. When you're evaluating how well a prediction is doing, you need to know the base rate -- how often each class occurs in your data set. Maybe it's just that winning teams run a lot more plays than losing teams. If winning teams run 76% of the plays, our model isn't really doing any better than chance.

Luckily, winning teams run 50.82% of the plays in the training data. So, we're doing better than chance on that front.

Out of sample testing

So, how does the model do out of sample? Its average accuracy score on 2013 games is 0.63, or it gets the class right about 63% of the time. Uh oh. That's 13% worse than the training set! Is this a problem? No. We'd love for the model to do better on the test set if possible, but you would expect performance to drop. The 2013 games might contain plays the model has never seen before, or the same plays the model has seen before but with different outcomes, or any combination of these two.

So far, we've only talked about 'accuracy.' But there are different ways to be right and wrong. Each prediction the model makes has one of four possible characteristics: a true positive(the model correctly predicts a winning team), a false positive (the model predicts a winning team that is actually a losing team), a true negative (the model correctly predicts a losing team), or a false negative (the model predicts a losing team that is actually a winning team). Building a predictive model means figuring out the right tradeoffs between the correct and the incorrect predictions. You can imagine there are cases when a false positive is more costly than a false negative, so we might pick the model that prioritizes one over the other.

Precision and recall from
Wikipedia.

We can put a number to these rates. Precision is the proportion of our predicted wins that were actually wins. The random forest model has a precision score of .63, meaning of all the plays it predicted to belong to winning teams, it was right 63% of the time. Recall is another common metric: out of all the actual plays belonging to winning teams, what proportion of them did the model catch? In the random forest model case, it's .63 again. This is pretty good news, it means our model is not biased toward positive or negative classification. For a visual depiction, see the graphic from Wikipedia above.

One common way of presenting how the model does on true/false positives and negatives is a confusion matrix. It looks like this:

[table id=3 /]

Thresholds

So far, I've been talking about the model making predictions about winning and losing, but we also know that the model is actually producing probabilities. This was a little sleight of hand on my part. Because the model is actually producing probabilities, we can select a probability above which we're willing to believe the play belongs to a winning team. Most software packages use .5, or 50/50, as the threshold, but this is often not a good choice. Particularly in many noisy situations when we can't know the 'true' probability of an event, we may just be willing to accept relative probabilities that ranked in order.

So, how to pick a threshold? One popular way is to find the threshold that maximizes the true positive rate and minimizes the false positive rate. We can ask the model to make predictions at many different thresholds and record these numbers for each threshold and then plot them against each other. What this produces is something called a receiver operating characteristic curve (often abbreviated as ROC curve).It looks like this:

rf_roc

For our model, it turns out that the threshold that satisfies this condition is 0.492906.

We can also use the ROC curve to select compare different models against each other. Here's an example where I compard our random forest to a couple of other models, using gradient boosted decision trees and logistic regression:

model_rocs

In this case, we want to select the model that maximizes the area under the curve (AUC). Unfortunately, it looks like all three of these models perform almost equally well. That's a potentially troubling sign and it might mean that there's so much noise in the play-by-play data, that producing a better model might be hard work. Then again, we're using a relatively small number of features right now.

You've made it this far and you might be asking yourself -- so, do we have a good model here or not? And the answer to this question is, unfortunately, another question -- compared to what? That's the question data scientists are always asking themselves. Does this model perform as well as any other model we've tried so far? Yes. Is it doing better than chance? Yes. Is it doing a lot better than chance? Eh, we're getting there. Can we do better? We're going to try.

Checking out how these models compare to one another with some new features is the next step. We'll also examine how to decide how often to retrain the model with new data. Until next time.

blog comments powered by Disqus