Sun 13 April 2014
... or extreme values?
We've all heard the word (thanks in no small part to Malcolm Gladwell). What exactly are they and what do we do about them?
[Note: I'm aware that the plots have been cropped and pixelated. I'm walking out the door now, but will repair ASAP.]
Many machine learning and data analysis tutorials out there often contain some version of the following phrase as one of the preliminary steps to building a model: "Identify outliers in your data and remove them." Sounds simple, right? Unfortunately, almost none of these tutorials spend any time talking about what an outlier actually is and what the consequences of removing data that fairly or unfairly gets labeled as an outlier does to your model.
I'll try to correct this and walk through a contrived example using football data to show you what you can do with your data points that may or may not be outliers.
As it turns out, there are lots of definitions of outliers and there's no strong agreement on what it means to be an outlier. Many people generally take an outlier to mean a data point that is unlike the other data points in your sample. More formally, this may mean that the offending data was produced by a different data generating process, or less formally, that it belongs to a different distribution than the other data points.
Let's walk through an example. The plot below is a (very ugly) scatter plot of field goal and extra point attempts. The x-axis is the the distance in yards to the end zone and the y-axis is the number of seconds remaining in the game.
Notice that one attempt that sticks out from the rest? It's 58 yards from the end zone (which means an even longer field goal attempt). It's probably not that surprising to learn that it was a field goal attempt by the Raiders. Let's say we're trying to build a simple model and determine if there is a relationship between the quarter of the game (it's very clear there isn't really one, but I did say the example was contrived). Do we throw that data point out as an outlier? Is it an outlier?
Easy first steps
The first few things we can do are easy. We can look at the distribution of field goal attempts.
From this we can see that the data are not really close to being normally distributed (in fact, they're almost uniformly distributed!) The mean of this distribution is 18.54 yards and the standard deviation is 9.98 yards. Since the distribution isn't normal, it wouldn't make sense to just say that we'll exclude the Raiders' attempt because it's more than two standard deviations above the mean.
In this case, we have a big sample (11,329 attempts), so we don't have to worry too much about one potential outlier skewing our measure of the center of the distribution. The mean is 18.54 yards, and the median is 19 yards. We could also look at the interquartile range, the data contained between the 25th and 75th percentiles. That puts us between 10 yards and 27 yards. If we excluded everything outside the IQR, we'd be throwing away a lot of data!
So far, there isn't a really strong case for excluding this value. Before we get more complicated, let's ask ourselves a more existential question.
What do we really want from outliers?
Why do we worry about outliers at all? We worry about data points skewing our models and our conclusions and giving us the wrong answer when we try to generalize beyond our sample. But when we throw away perfectly good data toward that end, we're actually guaranteeing that we'll do the very thing we're trying to avoid.
So, you have to ask yourself. Do you want to be better at predicting events that are closer to the average event, knowing that you might get blindsided by a rare event? Or do you want to include those rare events, account for their rarity somehow, and build a more robust model? You probably won't be surprised to learn that I favor the latter option. That's what multilevel models are for.
More complicated tactics
I estimated a linear regression of distance from the endzone on seconds remaining in the game. You won't be surprised to learn that the effect is not significant. When working in a regression framework, a popular test for outliers is to estimate Cook's distance (commonly referred to as Cook's d). Essentially, Cook's distance tries to find out just how influential data points in a regression model are. By estimating the same regression many times, each time omitting a data point, you find out how much your model's estimates change when a given data point is omitted.
Statsmodels, a popular linear model package for Python, can estimate Cook's d for us easily. Here's a plot of Cook's d for each of the field goal and extra point attempts.
The majority of these values are vanishingly small. There's no hard-and-fast cutoff for how big is too big, but a value of more than one is often used as a quick heuristic to take another look at that case. That doesn't mean you automatically exclude it, just that you might give it some attention.
Machine learning approaches
This post is already growing too long, but we need to give some attention to machine learning, which has developed its own approaches to outlier detection. One approach is to use a one-class support vector machine. Technically this is known as novelty detection, but since we're trying to just figure out which of our cases are unusual, that might be a good first step. Other options rely on distance functions, like the jauntily named Mahalanobis distance. Scikit-learn, the most well-supported machine learning package for Python, has functions implemented for each of these.
In general, we want to throw away as little good data as possible. Obviously, if the data we have is corrupted in some way (e.g., someone entered 999 yards to go instead of 99), we want to get rid of that. However, in many cases it's just not that clear if the data is "bad" or not. I prefer to err on the side of keeping the data in and trying to build in my uncertainty around it.
Especially in a sport with such small sample sizes as football, it's important to make the most of your data whenever you can. If your ultimate goal is (and should be) to minimize prediction error on new data, you need the best, most representative sample you can get. A good idea is often to build your models with and without the suspect data points and see how different they are. Computation is cheap. Inferential mistakes are expensive.
Finally, I would like to make a pitch to eliminate the word 'outlier' from most people's vocabulary. As I've tried to drive home, it's often not clear what it even means to be an outlier. Instead, I propose that we use the phrase 'extreme value' to indicate that we are aware that a particular data point is far from the mean/median/mode, but that we don't know for sure if it's been produced by a different data generating process or not.