the **spread** the (data) science of sports

Sun 08 December 2013

Data preparation

This begins a series of posts on building a win probability model. In
actuality, we're going to be building a lot of models that will be
combined into one, a technique known as *ensemble learning*. There are
several advantages to doing this. First, we can start very simply and
measure how our win probability model does against existing models and,
second, it will allow us to iterate and improve in small steps.

Tackling any data science problem requires (wait for it...) data. I'll be using play-by-play data from Armchair Analysis, combined with data from NFL Savant and Advanced NFL Stats. The Armchair Analysis data costs \$25, but it's been cleaned quite well and comes separated into various tables to be loaded into a SQL database. It does not include 2013 data, so in order to make predictions and do out-of-sample testing, I'll supplement using the NFL Savant and ANS data.

We're going to approach this particular data science problem as an
example of a *classification *problem (or, more specifically, a *class
probability estimation* problem). In plain English, we're trying to
figure out which *class, *or group, each of our plays belongs to -- are
they are winners or losers? What our classification model will do is
estimate the probability that each play belongs to either the winners
class or the losers class.

To begin to estimate the probability that a team wins before a given
play situation, we need a few things. We need our variables,
or *features* as they are called in machine learning, and we need an
outcome, or *target* in machine language terms. If you have a more
traditional science or experimental background, you might recognize
these as the independent and dependent variables, respectively. We'll
try and figure out what combination of our features best predicts the
target and with what probability. What are our features and target in
this case?

- Score. We need the score as it was before each play was run. In
order to simplify things, we're actually going to use the
*score differential*with respect to the offense. This is, at its heart, an offensive model, so it makes sense to frame things offensively. So, each row of our dataset should have the point differential. If it's positive, the offense is leading by that many points, if negative, the offense is trailing by that many points, and if it's zero, the teams are tied. - Down and distance. Pretty self explanatory, but we'll need separate columns for down and yards to go until the next first down.
- Field position. Similar to score differential, we'll code this in terms of the offense. Rather than use 50-yard increments, we'll convert field position into 100-yard increments and code it as distance from own end zone. A team that is receiving the ball for the first time following a touchback would be scored as on the 20; the same position in the other team's half (the beginning of the 'red zone') would be the 80.
- Time remaining. Again, we could break this up a bunch of different ways: quarters, minutes, seconds, etc. If we begin with an assumption that win probability estimates become more certain as the amount of time in the game decreases, we should use a constant unit. I'll use seconds remaining as my unit.
- Outcome. If we're going to build the model, we need to know if teams actually ended up winning from a given position or not. This might require a little data fu to compile for some. Again, we'll code this in terms of the offense, coding this variable as a 1 if the offense ended up winning and a 0 otherwise.

We'll add more variables, like time outs remaining, as we build the model, but this is a good start. For now, I've excluded kickoffs, no plays, onside kicks, two-point conversions, punts, and field goal/extra point lines from the data as well as all post-season games.

Why exclude the post-season? When data scientists create models, they're
often operating under the assumption that all of the data were produced
by the same *data generating process. *This is just a way of saying that
the same basic decision-making processes were used by coaches to
generate the data we have here. We can't know what the coaches were
thinking or what they meant to do, only what they did. The post-season
is a different monster. Since one loss will end your season and send
your team on vacation, coaches may employ different strategies and
attempt things they might not normally do during the season. This
implies that the post-season data might be generated via a different
process.

Now, notice that I just made an assumption about the data generating process. I don't know if it's true. One of the things we can do later is test it. That's one of the foundations of doing good data science -- make your assumptions explicit and test to see whether or not the data support them.

Here are the first five rows of my data, which I've stored in a Postgres database (note I start using data from 2001, even though Armchair Analysis begins with 2000, because of a data issue): [table id=2 /]

Inspecting the data

The next step is to check the quality of the data and see if there are any obvious extreme values (i.e., a field position of greater than 100 yards, seconds greater than 3600, etc.) Some tables and plots when you first get your hands on a new data set can go a long way toward avoiding headaches later on.

To do this, I'll be using my language of choice, Python, but you can do any of this in any language. All of my Python code will be submitted to my Github account. If you're completely new to this, you may want to check out Python for Data Analysis to get the basics of the PyData stack and/or Machine Learning for Hackers for an overview of some of the methods used here (although in R).

Just looking at the distribution of seconds remaining across all of the plays, we already see a few interesting things. The beginnings and ends of quarters and halves immediately jump out -- more plays will have 3600, 2700, 1800, or 900 seconds remaining. We also see a bunching up around halftime and the end of the game (0 on the x-axis), presumably because more time outs are called here. Everything looks good so far.

This looks good too. Let's look at two more, field position and final score differential before we wrap up this post.

Again, nothing too surprising here. Most teams don't spend a whole lot of time backup up to their own end zone, but we see a big spike at 20 yards due to touchbacks. This then decreases steadily from about the halfway mark.

Finally, let's look at the final score differential.

Looks like the most common score differential is pretty low with a few extreme values at the other end of the distribution from rare blowouts. Let's take a more granular look at the same data, increasing the number of bins in the histogram.

Now we see that the most common score differential is 3 points, with the next bump at 7. This makes prefect sense.

Wrapping up, we haven't found any crazy surprises in simple plots of our data. This is great news! Our next step is to start building the model. If you want a sneak preview of how we'll do that initially, check out Yhat on building random forests in Python.