the spread the (data) science of sports

Selecting features vs. selecting samples: Making smart decisions

Sat 22 February 2014

Modeling information rather than excluding it

A few days ago, Bill Mill asked me an excellent question that deserves an extended response. If you'll recall, I recently posted a simple multilevel regression model to forecast quarterback performance as a (partial) function of their age. Bill caught some imprecision in my language and asked if I wasn't contradicting myself.

In that post, I made the following claim:

Many existing approaches make arbitrary decisions about which players to include and exclude in their models along the lines of “I only selected players who played more than four seasons, started at least 12 games, and had at least 100 attempts in a season.” Doing this biases your model because you are doing what is known as selecting on the dependent variable.

Then, when I'm described the features that I used in the model, I stated that I used "a variable that was set equal to 1 if the player was a starter (generously defined as starting more than 8 games in a season) and 0 otherwise."

Bill asks -- what's the difference between these two things? The answer lies in the idea of selecting your sample vs. selecting your features.

When we engage in feature engineering, we want our features to improve how our model fits our data. In Silverian (tm) terms, we want to include features that increase the signal more than they increase the noise. The features we include in a model should be associated with the outcome in some systematic way. The more information we can include in our model while still having a model that generalizes out of sample, the better our model will be.

Contrast this with how we select our samples or how we collect our data. We want our data to be as representative as possible of the population we're trying to model. This is a little bit like solving a word problem. You have a question you're trying to answer and you need to collect some data to answer it. You need to make sure that the data you collect is appropriate for answering that question. Otherwise, if you collect data in a way that systematically restricts the way you can answer that question, you're going to get a biased answer.

Let's think about this in more concrete terms. The question I was trying to answer was, "How does age predict quarterback performance?" Notice I said "quarterback performance" and not "starting quarterback performance." Since the question I am trying to answer has to do with all quarterbacks, I need to make sure that my data represent all quarterbacks. Otherwise, I'm asking the first question, but if I arbitrarily decide to only include starting quarterbacks in my sample, I'm actually answering the second question.

This doesn't seem like that big of a deal until you try and use a model that you built with starting quarterbacks to forecast the performance of a non-starting quarterback. At this point you'll probably have an overly rosy view of how the non-starter will do and you won't realize it. Then, you'll make a bad signing decision could either lose your job or your fantasy league (depending on who you are).

OK, you might still be asking why this is any different than including a feature that indicates if the player was a starter or not. The difference is that we're now answering the question we set out to answer, but we're also including information in the model that acknowledges a simple fact: some quarterbacks are better than others. What ends up happening in a linear model like the one in the post is that players who were starters get a fixed "boost" to their projections that takes into account that they are probably better players. Players who were not starters don't get this boost.

Is this the best way to do this? Probably not, it's a very rough indicator. But it helps improve the model performance out of sample. I could have included, for instance, the number of games the player started in a season, the number of snaps they took, or something completely different. The key point is that I answered the question using as much data as I could, didn't make arbitrary decisions about inclusion in the model that could bias my results, and tried to build player differencesinto the model rather than exclude them.

blog comments powered by Disqus