Sat 01 February 2014
What's the problem with prediction?
So far I've emphasized constructing a win probability model that generalizes as well as possible and maximizes out-of-sample prediction.The argument for this has been that we want to create a model that best captures the relationship between variance in the features (variables like seconds remaining, score difference, etc.) and variance in the outcome (does this play belong to a winning team or not).
I've done a little feature engineering by hand. Using domain knowledge that plays at the end of the game are more likely to be important to the outcome than plays near the beginning, I created a feature that increases in a non-linear fashion as game time dwindles. This seemed to improve the performance of the model, but not by a tremendous amount.
Thinking about features like this is an important part of building a predictive model. However, sometimes it only gets you so far. Sometimes you have many more features than you know what to do with (known as the curse of dimensionality), or your features are highly correlated, or maybe you don't even know what your features are.Maybe you find that taking the cube root of one of your features produces a big jump in model performance but you can't explain why.
Reducing the number of features can be accomplished in many ways. You can combine features to create an index or you can reduce the dimensionality by using a procedure like principal components analysis(PCA). PCA takes a set of features and transforms them into a new set of uncorrelated features called principal components. This is extremely useful if you suspect you have many correlations between your features (called collinearity) and if you are using a modeling technique that has a hard time with this state of affairs (like many linear models).
Often times preprocessing your data using something like PCA will produce pretty significant model performance gains. The problem is that instead of a set of features like [down, distance, yards from own goal, etc.] you now have a set of features like [principal component 1, principal component 2]. The units on the new features don't tend to make a lot of sense, and it's not totally clear what they mean. Each feature can contribute to multiple components to different degrees (we say that they "load" on components) so we're no longer operating in a world where we can say "each second that ticks off the clock is work .001 win probability points."
This highlights one of the fundamental challenges that data scientists face -- communicating models to people who use them to make decisions. One of the greatest things about data science is the amount of time you can spend tuning your model's parameters and hyper-parameters, doing feature engineering, and eeking out every last little bit of prediction accuracy that you can. However, if the model can't be used to make decisions, it's not that useful. To me, this is what separates data science in industry from data science in the academy.
Decision-making with statistics often occurs in situations where a model must be used in certain ways. In the NFL, computers are not allowed on the sidelines or in the booth. Thus, it's important that any predictive model that's going to be used for decision-making be interpretable without one. A coach or coordinator wants to know if he should go for it on fourth in a particular situation. He needs to know what the impact of picking up a first down will be. Chances are, he's not going to be pleased (and you probably won't have a job for that long) if you tell him, "hold on, let me figure out how these features load on my principal components."
That means while you may be building a sub-optimal model in the short-term, you will be maximizing the utility of your model. However, suppose you have built a vastly superior, but highly difficult to interpret model. As a data scientist, your next step is figuring out a way, possibly via a visualization, to communicate the results in a way that is easily acted upon. Of course, if you're trying to build the best possible model, for betting or general prediction purposes, you're not as limited by such constraints.
In a nutshell, one of the most important things about building a model is knowing not only the necessary "data janitor" work to get started, not only the latest algorithms, but also knowing how the model's output will be used.
On that note, happy Super Bowl weekend! Looking forward to a great game and getting back to writing about data science and football in the coming weeks.