Wed 23 September 2015
Most of the posts so far have focused on what data scientists call supervised methods -- you have some outcome you're trying to predict and you use a combination of predictor variables to do so. Another important class of methods are called unsupervised. In this case, you might not know what exactly you're looking for or what metric you want to optimize for, but you want to explore the data and identify similarities among cases. For example, you might want to identify a list of "similar" players for your fantasy draft. This is a little late for the start of fantasy season, but with the rise of daily fantasy sports, perhaps not. However, maybe you don't know what "similar" means in this case or you don't have a single number or index that you want to match on. Perhaps you just want to find players with similar production to hedge against bye weeks or injuries.
This is where unsupervised methods come in. We'll be focusing on a popular unsupervised method called clustering. You'll see these kinds of methods used on a number of sports sites. Boris Chen, a data scientist at the New York Times, uses a kind of clustering to produce his fantasy football player tiers. Krishna Narsu recently used a kind of clustering to redefine the defensive positions in the NBA.
One popular method is called k-means clustering. (Note, this isn't the same k as in k-fold cross-validation, k is just a common stand-in for an unknown integer value.) I'll be working through an example clustering wide receivers using their 2013 statistics. K-means is really beautifully simple. The basic idea is that we want to take our entire data set and divide the observations into k sections and have each of the observations be as similar to each other as possible (and potentially as dissimilar to every other cluster as possible). Each cluster has what's known as a 'center' or 'centroid', which is the point against which all of the observations in that cluster are compared. You can think of it as the "ideal" or "prototypical" observation that typifies each cluster.
EDIT: As always, code for this example is up on GitHub.
To do so, we'll have to define what we mean by "similarity." Most implementations of k-means clustering use what's called Euclidean distance, which is the sum of the squared differences between each observation's value and the center of the cluster. The steps look a little bit like this:
I'll do this with all of the wide receivers who played in 2013 using the following variables: targets, receptions, receiving yards, receiving touchdowns, fumbles, and fantasy points.
This is easy, right? Of course, there are a couple of gotchas. There's always a catch.
First, how do you pick k, the number of clusters? Good question -- this is an active area of research (eyes glaze over), but there are some commonly used rules-of-thumb. One way is to pick the number of clusters that maximizes what's known as a silhouette score, which is essentially the ratio of the within-cluster distance to the between-cluster distance. We want to maximize the former and minimize the latter. By running our k-means algorithm multiple times, we can pick the k that maximizes the silhouette score (which is bounded on the interval [-1, 1]). I did this for each k between 3 and 11, and it looks like this:
We see that the silhouette score is maximized at k = 4, meaning 4 clusters of wide receivers, so we'll go with that.
Second, notice that the first step of the algorithm is to randomly pick the centers for each cluster. This means that the results you get can be highly dependent on this initial position. So, you'll need to re-run the algorithm multiple times with different start points to see if your results are robust. Scikit-learn takes care of this for you, but it's important to be aware of.
Third, you'll need to center & scale your data if the different variables aren't in the same or comparable units. Centering means subtracting the mean of each variable from each observation and scaling means dividing by the standard deviation of that variable. You're left with standardized scores that are basically interpretable as "how many standard deviations above or below the mean is this observation on this variable."
Fourth, and probably most importantly, how do you know if you even have good clusters? This is trickier than it seems. There are some technical solutions like model-based clustering as well as the less rigorous "eyeball data analysis." Ultimately, you want to be wary of counterintuitive results that you get from unsupervised methods. The clusters should make sense using the knowledge that you already have.
I didn't use any out-of-sample validation here and I don't know how predictive these clusters are of future performance. One thing I could do is look at the cluster that each player is assigned to and see how predictive it is in the future of performance. It is entirely possible to "overfit" your clusters to historical data.
Finally, your clusters are highly dependent on what variables you use in your clustering! If I added a new variable in, say yards after catch, we might see a number of players switch cluster assignments. You need to be wary of this and be careful of treating your newly found clusters as the absolute truth.
Let's take a look at how we did clustering our wide receivers. Using k = 4 clusters, we can look at the centers of each cluster and try to interpret them. Remember, these are in standardized form, not raw numbers.
With higher numbers being better on all of these metrics (except fumbles), we see that cluster 2 is probably our highest performing wide receivers. The ideal player in this cluster is targeted a lot (almost 1.75 standard deviations above the mean), catches a lot of passes for a lot of yards, scores a lot of touchdowns, and doens't fumble a ton. Assigned to this cluster are players like Larry Fitzgerald, Reggie Wayne, Randall Cobb, and Julian Edelman. Only 24 out of 197 WRs were assigned to this cluster (so, not even one per team).
On the flip side, cluster 0 looks to be pretty terrible. These players are all below average on every metric, although they don't fumble that much. 91 out of 197 players were assigned to this cluster.
Here's a random sample of 20 receivers and the cluster to which they were assigned:
And that's k-means clustering, in a nutshell.