the **spread** the (data) science of sports

Sun 13 April 2014

... or extreme values?

We've all heard the word (thanks in no small part to Malcolm Gladwell). What exactly are they and what do we do about them?

[Note: I'm aware that the plots have been cropped and pixelated. I'm walking out the door now, but will repair ASAP.]

Many machine learning and data analysis tutorials out there often
contain some version of the following phrase as one of the preliminary
steps to building a model: "Identify outliers in your data and remove
them." Sounds simple, right? Unfortunately, almost none of these
tutorials spend any time talking about what an outlier *actually is* and
what the consequences of removing data that fairly or unfairly gets
labeled as an outlier does to your model.

I'll try to correct this and walk through a contrived example using football data to show you what you can do with your data points that may or may not be outliers.

As it turns out, there are lots of definitions of outliers and there's
no strong agreement on what it means to be an outlier. Many people
generally take an outlier to mean a data point that is unlike the other
data points in your sample. More formally, this may mean that the
offending data was produced by a different *data generating process, *or
less formally, that it belongs to a different distribution than the
other data points.

Let's walk through an example. The plot below is a (very ugly) scatter plot of field goal and extra point attempts. The x-axis is the the distance in yards to the end zone and the y-axis is the number of seconds remaining in the game.

Notice that one attempt that sticks out from the rest? It's 58 yards
from the end zone (which means an even longer field goal attempt). It's
probably not **that** surprising to learn that it was a field goal
attempt by the Raiders. Let's say we're trying to build a simple model
and determine if there is a relationship between the quarter of the game
(it's very clear there isn't really one, but I did say the example was
contrived). Do we throw that data point out as an outlier? **Is** it an
outlier?

*Easy first steps*

The first few things we can do are easy. We can look at the distribution of field goal attempts.

From this we can see that the data are not really close to being normally distributed (in fact, they're almost uniformly distributed!) The mean of this distribution is 18.54 yards and the standard deviation is 9.98 yards. Since the distribution isn't normal, it wouldn't make sense to just say that we'll exclude the Raiders' attempt because it's more than two standard deviations above the mean.

In this case, we have a big sample (11,329 attempts), so we don't have
to worry too much about one potential outlier skewing our measure of the
center of the distribution. The mean is 18.54 yards, and the median is
19 yards. We could also look at the *interquartile range*, the data
contained between the 25th and 75th percentiles. That puts us between 10
yards and 27 yards. If we excluded everything outside the IQR, we'd be
throwing away a lot of data!

So far, there isn't a really strong case for excluding this value. Before we get more complicated, let's ask ourselves a more existential question.

*What do we really want from outliers?*

Why do we worry about outliers at all? We worry about data points
skewing our models and our conclusions and giving us the wrong answer
when we try to generalize beyond our sample. But when we throw away
perfectly good data toward that end, we're actually **guaranteeing**
that we'll do the very thing we're trying to avoid.

So, you have to ask yourself. Do you want to be better at predicting events that are closer to the average event, knowing that you might get blindsided by a rare event? Or do you want to include those rare events, account for their rarity somehow, and build a more robust model? You probably won't be surprised to learn that I favor the latter option. That's what multilevel models are for.

*More complicated tactics*

I estimated a linear regression of distance from the endzone on seconds
remaining in the game. You won't be surprised to learn that the effect
is not significant. When working in a regression framework, a popular
test for outliers is to estimate Cook's
distance (commonly
referred to as Cook's *d*).* *Essentially, Cook's distance tries to find
out just how *influential* data points in a regression model are. By
estimating the same regression many times, each time omitting a data
point, you find out how much your model's estimates change when a given
data point is omitted.

Statsmodels, a popular linear model package for Python, can estimate
Cook's *d *for us easily. Here's a plot of Cook's *d *for each of the
field goal and extra point attempts.

The majority of these values are vanishingly small. There's no
hard-and-fast cutoff for how big is too big, but a value of more than
one is often used as a quick heuristic to take another look at that
case. That **doesn't** mean you automatically exclude it, just that you
might give it some attention.

*Machine learning approaches*

This post is already growing too long, but we need to give some
attention to machine learning, which has developed its own approaches to
outlier detection. One approach is to use a one-class support vector
machine.
Technically this is known as *novelty detection*, but since we're trying
to just figure out which of our cases are unusual, that might be a good
first step. Other options rely on *distance functions*, like the
jauntily named Mahalanobis
distance. Scikit-learn,
the most well-supported machine learning package for Python, has
functions implemented for each of these.

*Take aways*

In general, we want to throw away as little good data as possible. Obviously, if the data we have is corrupted in some way (e.g., someone entered 999 yards to go instead of 99), we want to get rid of that. However, in many cases it's just not that clear if the data is "bad" or not. I prefer to err on the side of keeping the data in and trying to build in my uncertainty around it.

Especially in a sport with such small sample sizes as football, it's important to make the most of your data whenever you can. If your ultimate goal is (and should be) to minimize prediction error on new data, you need the best, most representative sample you can get. A good idea is often to build your models with and without the suspect data points and see how different they are. Computation is cheap. Inferential mistakes are expensive.

Finally, I would like to make a pitch to eliminate the word 'outlier' from most people's vocabulary. As I've tried to drive home, it's often not clear what it even means to be an outlier. Instead, I propose that we use the phrase 'extreme value' to indicate that we are aware that a particular data point is far from the mean/median/mode, but that we don't know for sure if it's been produced by a different data generating process or not.