the **spread** the (data) science of sports

Sun 22 December 2013

How do we know when we have a good model?

The win probability model is well on its way and we can now produce probabilities for any given down, distance, score differential, field position, and time remaining. Yet we haven't really evaluated our model. How do we know if it's any good? This ends up being a more complicated question than one might think at first.

There are some standard techniques for evaluating the performance of a classifier (remember -- the model is trying to predict if a given play belongs to the 'winning' class or not). I'll walk through some of these and discuss how they give us a better understanding for how much trust we can put in the model. After all, if you want to base decision-making on the model, you'd like to have some idea of how well it performs. As far as I know, most existing win probability models out there don't present any kind of model-checking diagnostics or let you see behind the curtain.

I've already mentioned that we're primarily concerned with how well the
model *generalizes* out of our training set of plays. In other words, we
want to have confidence that the model will perform reasonably well when
it's presented with new data. We need more than a model that does a good
job of explaining past games -- we already know who won those games.
This is where the *learning* part of *machine learning* comes in. We
want our model to learn about the things that identify when a team is
likely to win a game and, once it's learned to identify those things, to
apply this knowledge in new places and situations.

But in order to do that, our model first needs to learn about
the *training data. *This means we need to know how well the model
performs on the 2002-2012 play-by-play data. There are a few ways to do
this. The easiest first stop is to ask our model to make predictions on
every play in the training set and then compare that against the truth.
This will give us an overall accuracy score. In the case of our random
forest model, our accuracy score is 0.76.

Put another way, when asked to guess if each play in the training data
belongs to a winning team or a losing team, the model is right 76% of
the time. Pretty good, right? Well, maybe. When you're evaluating how
well a prediction is doing, you need to know the *base rate* -- how
often each class occurs in your data set. Maybe it's just that winning
teams run a lot more plays than losing teams. If winning teams run 76%
of the plays, our model isn't really doing any better than chance.

Luckily, winning teams run 50.82% of the plays in the training data. So, we're doing better than chance on that front.

**Out of sample testing**

So, how does the model do out of sample? Its average accuracy score on 2013 games is 0.63, or it gets the class right about 63% of the time. Uh oh. That's 13% worse than the training set! Is this a problem? No. We'd love for the model to do better on the test set if possible, but you would expect performance to drop. The 2013 games might contain plays the model has never seen before, or the same plays the model has seen before but with different outcomes, or any combination of these two.

So far, we've only talked about 'accuracy.' But there are different ways
to be right and wrong. Each prediction the model makes has one of four
possible characteristics: a **true positive**(the model correctly
predicts a winning team), a **false positive** (the model predicts a
winning team that is actually a losing team), a **true negative** (the
model correctly predicts a losing team), or a **false negative** (the
model predicts a losing team that is actually a winning team). Building
a predictive model means figuring out the right tradeoffs between the
correct and the incorrect predictions. You can imagine there are cases
when a false positive is more costly than a false negative, so we might
pick the model that prioritizes one over the other.

We can put a number to these rates. *Precision* is the proportion of our
predicted wins that were actually wins. The random forest model has a
precision score of .63, meaning of all the plays it predicted to belong
to winning teams, it was right 63% of the time. *Recall* is another
common metric: out of all the *actual* plays belonging to winning teams,
what proportion of them did the model catch? In the random forest model
case, it's .63 again. This is pretty good news, it means our model is
not biased toward positive or negative classification. For a visual
depiction, see the graphic from
Wikipedia above.

One common way of presenting how the model does on true/false positives
and negatives is a *confusion matrix. *It looks like this:

[table id=3 /]

**Thresholds**

So far, I've been talking about the model making predictions about winning and losing, but we also know that the model is actually producing probabilities. This was a little sleight of hand on my part. Because the model is actually producing probabilities, we can select a probability above which we're willing to believe the play belongs to a winning team. Most software packages use .5, or 50/50, as the threshold, but this is often not a good choice. Particularly in many noisy situations when we can't know the 'true' probability of an event, we may just be willing to accept relative probabilities that ranked in order.

So, how to pick a threshold? One popular way is to find the threshold
that maximizes the true positive rate and minimizes the false positive
rate. We can ask the model to make predictions at many different
thresholds and record these numbers for each threshold and then plot
them against each other. What this produces is something called
a *receiver operating characteristic* curve (often abbreviated as ROC
curve).It looks like this:

For our model, it turns out that the threshold that satisfies this condition is 0.492906.

We can also use the ROC curve to select compare different models against each other. Here's an example where I compard our random forest to a couple of other models, using gradient boosted decision trees and logistic regression:

In this case, we want to select the model that maximizes the *area under
the curve* (AUC). Unfortunately, it looks like all three of these models
perform almost equally well. That's a potentially troubling sign and it
might mean that there's so much noise in the play-by-play data, that
producing a better model might be hard work. Then again, we're using a
relatively small number of features right now.

You've made it this far and you might be asking yourself -- so, **do we
have a good model here or not? **And the answer to this question is,
unfortunately, another question -- **compared to what? **That's the
question data scientists are always asking themselves. Does this model
perform as well as any other model we've tried so far? Yes. Is it doing
better than chance? Yes. Is it doing a lot better than chance? Eh, we're
getting there. Can we do better? We're going to try.

Checking out how these models compare to one another with some new features is the next step. We'll also examine how to decide how often to retrain the model with new data. Until next time.