Wed 23 September 2015
One of the points I've made over and over is about making sure your model performs well out-of- sample. I've argued against overfitting and for holding out a portion of your data to test how well it will do at predicting the future. This is all well and good, but what if you don't have a lot of data? What if you want to use all of your data? Holding some data back to test the model is a luxury that we often have in machine learning because we're dealing with big data sets. Football? Not so much.
Luckily, there's an answer that allows us to use all of our data, escntimate our model's predictive accuracy, and not overfit to our training data! It's not magic, it's called cross-validation. In this post, I'll walk through what cross-validation is and the logic of why it works.
Cross-validation (which I'll abbreviate as CV for parts of this post) starts with the same intution as using a test set to estimate how well your model will perform when it sees new data that wasn't used when you built the model. Technically what we're trying to do is balance the amount of bias in our model -- that's how well the model performs on the training data -- with the amount of variance in our model -- that's how well the model does on new data.
There are a few different ways to do this, but the most common way is called k-fold cross-validation. It's very simple and follows these simple steps:
If you're deciding between several models -- maybe you want to include seconds remaining squared, or you are trying to decide if you want to include an interaction between seconds and yards to go, or whtaever -- the model with the lowest cross-validation error is the model that will most likely perform the best on new data. It is common after selecting a final model to use through cross-validation to then refit the model on the entire dataset.
The really nice thing about all of this is we've used all of our data, but we didn't overfit to it because we never used all of the data at once to make decisions.
Of course, there are some caveats and some warning signs to watch out for. Namely, the more folds you split your dataset into, the more variance you're going to see in the prediction errors (because the sample size for the predictions will be smaller). In fact, you can take this to the extreme and split your dataset into N folds (where N is your entire data set's size), making predictions on a single observation each time. This is called leave-one-out cross-validation, which has the lovely initialism LOOCV.
You should also keep an eye on how much variance there is in your predictions for each fold -- if they're bouncing around all over the place, you might have one of a few problems: you might have some "influential" cases (some people call these outliers) or you might have a bad model (sometimes there's just noise, and no signal).
Cross-validation is really popular in both statistics and machine learning these days -- so much so that the statistics section of the popular question and answer site StackExchange is called CrossValidated. It works well with lots of different kinds of models. Need to choose some parameters for a random forest? A lambda parameter for a regularized regression? You can use cross-validation to help.
Now you have no excuse to ignore out-of-sample performance, even with small samples.