the spreadhttp://thespread.us/2015-10-18T15:13:44-07:00Win probability plots -- useful tool?2015-10-18T15:13:44-07:00Trey Causeytag:thespread.us,2015-10-18:win-probability-plots.html<p>It's been an interesting weekend for win probability models. In case you missed
it, on Saturday, Michigan State improbably returned a fumble for a touchdown
to win a game in which they never held a lead. This morning, ESPN Stats and
Info tweeted the following, indicating that MSU had, in a single play,
transitioned from a 0.2% win probability to 100%.</p>
<p><center><a href="https://twitter.com/ESPNStatsInfo/status/655760329708695552"><img src='images/mich-msu-wp.png' width=480></a></center></p>
<p>Following up on this, we saw wild swings in the Broncos - Browns game, with
a number of football analytics accounts tweeting pictures of the wildly oscillating and
criss-crossing win probabilities. I jokingly tweeted that '[e]ventually all of
football analytics Twitter will just be hundreds of very slightly different
win probability graphs.'</p>
<p>A bit snarky, to be sure, but I think it underscores an important point I made
more forcefully about the <a href="http://thespread.us/sorry-state.html">state of football analytics</a>.</p>
<h3>What are these plots telling us?</h3>
<p>In the MSU example above, the plot is essentially telling us that MSU lodged a
very unlikely victory on the last play of the game -- but we already knew that!
The plot doesn't tell us much, other than drawing a straight line between 0.2%
and 100%. In terms of data-ink ratio, it's probably more informative to just
read the previous sentence than to plot it.</p>
<p>What is the lesson from the win probability plot above? Don't fumble a punt? Don't
lose a game on the last play?</p>
<p>Win probability plots provide an intuitive and important way to summarize
the events of the game. It's easy to see if one team dominated, if the game
had a bunch of lead changes, and so on. But it feels as if in order to be
"in football analytics" these days, you need to produce win probability plots
in real-time.</p>
<h3>A lack of innovation</h3>
<p>To me, this feels like stagnation. <a href="http://thespread.us/talking-win-probability.html">Win probability</a>
underscores a large amount of work in our field, but how we <em>use</em> win probability
to drive decision-making and understanding of the game is what's interesting
and innovative. Win probability in and of itself is a mostly descriptive tool
that many will (fairly) argue just gives us a single-number summary of a complicated
game. Win probability plots are mostly backward-looking, telling us what has happened
and if it was 'important' or not.</p>
<p>One of the reasons I've enjoyed working on the <a href="http://nyt4thdownbot.com">NYT 4th Down Bot</a>
so much is that it's <em>proactive</em> and makes live calls about what the optimal
decision is. Does it do it perfectly every time? Of course not, but we're making
an effort to shift from "that was a dumb call" to "here's the call you should make."</p>
<p>Do I want win probability plots that incorporate some measure of uncertainty about
our estimates? Of course. That's one of the first things I investigated when I
started this blog. But I don't think that this will be a game-changer for many
readers. And, to be clear, I'm <strong>not</strong> arguing we should do away with win
probability plots -- just that they constitute well-trodden territory.</p>
<h3>How we got here</h3>
<p>There's a significant amount of isomorphism in sports analytics. This makes sense,
and I'm certainly guilty of it. Important problems have been identified by people
we respect. One of the ways we learn is to try and replicate and improve upon
their work. However, we shouldn't stop there. We should be asking what kinds
of interesting questions we can answer with win probability models, how we can
advance our knowledge about the game, and how we can communicate that information
to people that are interested and invested.</p>
<p>For those of us who are interested in changing the product that we watch
every Sunday (and Monday and Thursday and Saturday), I have a proposal.</p>
<p>Let's do original and open research that builds upon existing win
probability models rather than treating win probability itself as the goal.
Let's work on research that's both interesting to us and
helps move the needle towards a more analytical and rational NFL. Let's figure
out how to model the interactive and complex nature of individual player
contribution. Let's tackle player fit and figure out how to model and
predict how players will perform when they change teams.</p>
<p>Let's stop ignoring statistics and machine learning and start doing careful
and rigorous work that stands up to both football and analytical challenges.</p>
<p>I'll do my best to practice what I preach.</p>What we talk about when we talk about win probability2015-10-03T19:05:10-07:00Trey Causeytag:thespread.us,2015-10-03:talking-win-probability.html<p>Win probability has been a popular topic for this blog. I've walked through some of the mechanics
of building a win probability model and how to evaluate those models. In this post, I'm going to
push a little deeper on the underlying questions that win probability models are actually trying
to answer and what we really mean when we talk about win probability. </p>
<p>I've been thinking about these topics a lot lately as I've been building the model that backs the
<a href="http://nyt4thdownbot.com">New York Times 4th Down Bot</a>. End of game situations are particularly
tricky to model but are the most scrutinized, the highest leverage, and the ones in which the model
must perform in order to be taken seriously and used as a serious decision-making tool. </p>
<p>Win probability models drive all sorts of arguments within football analytics, they're the very
backbone of most fourth down models that exist, and feature prominently in discussions of
clock management, two-point conversion strategy, and more. The esteemed Brian Burke has been
<a href="http://www.nessis.org/nessis13/lock.pdf">quoted</a> as saying that win probability models are
basically the 'holy grail' of sports analytics. </p>
<p>I'll walk through a few of these issues below and then show you how they actually matter
by working through an example.</p>
<h3>Observability</h3>
<p>Let's start with the most basic but often overlooked issue -- probabilities are inherently
<em>unobservable</em>. The very thing we're trying to model is not something that can ever be directly
observed. What does it <em>mean</em> to have a win probability of 0.75? We could take a frequentist
approach and say that means that, in the situation we're currently in, if the game were played
an infinite number of times, we would observe teams in this situation to win 75% of the time.</p>
<p>We could also treat the probability of winning as a <em>latent variable</em> that can never be truly
observed but has some unknown value. We then fit a model to our data that makes most
likely the observed patterns of wins and losses and how they correlate with the predictors.</p>
<p>There are additional approaches, but the take home message is that we can't observe a probability,
so that makes arguing about probabilities inherently very difficult. Teams with a WP of 0.01
will still win (hopefully 1% of the time, if you have a well-calibrated model), does that mean
the model was wrong? No. But we can certainly talk about which models are <em>better</em> instead of
which models are <em>right</em>.</p>
<h3>Picking the right model</h3>
<p>In previous posts, I walked through how to evaluate win probability models. There are lots of
different metrics you can use: out-of-sample prediction error, cross-validation error,
precision, recall, F1-scores, log-loss, how well calibrated the model is, etc. </p>
<p>The dirty secret is it's pretty easy to build a win probability model that does well on all of
these metrics and, beyond a few important features, different kinds of models perform about
equally well on all of them. Whether you choose a logistic regression or a random forest, you're
likely to see roughly equivalent performance. However, there are tons of corner cases and non-linearities
(like the end of game situations mentioned above) in which the models don't seem to pass
the smell test (see the example below). </p>
<p>There are 'do or die' situations in which a win hangs
entirely on the success of a single play. We all recognize these situations, but models have a
very hard time with them. In these cases, we know that we can't observe the probability,
but we have a vague sense that the probability produced by the model is 'too low.' The normal
solution for something like this is to introduce interaction terms in the model or use a model
that better handles lots of non-linearities, such as a tree-based model. But, as I'll show
below, that still doesn't always work. (And, in the case of the 4th Down Bot, we need to make
sure the model is fast at both training and prediction time, which random forests can be
not great at). </p>
<h3>Uncertainty</h3>
<p>Uncertainty is a topic I've certainly said a good deal about, but I've got more to say. Because
probabilities are inherently unobservable, and we're fitting models that merely estimate these
probabilities based on the observed outcomes (wins and losses), estimating the uncertainty
of those probability estimates is a tricky business. There are plenty of arguments in the
statistical literature about what it <em>means</em> to have a confidence interval on a predicted
probability. Tools even differ in their ability to try to do this. While R's <code>predict.glm</code>
function will estimate these confidence intervals, Python's <code>statsmodels</code> will not (though if
this is interesting to you, please check out Tom Augspurger's <a href="http://nbviewer.ipython.org/gist/TomAugspurger/0168441381f4b2d21f90">notebook</a>
on doing so!). <code>scikit-learn</code> doesn't produce standard errors or a variance-covariance
matrix, which means we have no measure of uncertainty for these probabilities.</p>
<p>Unfortunately, if we're going to use win probability models to make important decisions,
we have to have some kind of idea about uncertainty -- if we don't, we won't have any
idea what a <em>meaningful</em> change in win probability means. If going for it on 4th down
produces an expected win probability of .65 and attempting a field goal produces an
expected win probability of .63, that's a difference of two percentage points. Is that
enough of a difference to go for it? How do we know? Both of those probabilities are <em>estimates</em>
with some amount of (unknown!) uncertainty surrounding them.</p>
<h3>An example</h3>
<p>Let's work through an example to make all of this concrete. Here's a realistic situation we're all
familiar with. There are 40 seconds left in the 4th quarter. Your team trails by 2 points,
your opponent has no timeouts remaining. You're on your opponent's 15-yard line and it's 4th down
with 2 yards to go. </p>
<p>Parsing that situation, we basically see that a field goal wins the game because your opponent can't
stop the clock and you only need 3 points to win. That means your kicker is facing a roughly
32-yard field goal, which is between 93% and 95% probable to succeed. </p>
<p>Most fans look at this situation and it's pretty obvious that the probability of winning the game
is equal to the probability of kicking that field goal successfully. If you kick the field goal
as time expires, you win. If you miss the field goal, your opponent can kneel and end the game.</p>
<p>Let's see what a normal win probability model that takes into account down, yards to go, field position,
time remaining, score difference, and timeouts. We even have some custom features in here like
an interaction term between the quarter and the score difference and whether a team can kneel down
to end the game</p>
<div class="highlight"><pre><span class="n">In</span> <span class="p">[</span><span class="mi">32</span><span class="p">]:</span> <span class="n">logit</span><span class="o">.</span><span class="n">predict_proba</span><span class="p">(</span><span class="n">situation</span><span class="p">)</span>
<span class="n">Out</span><span class="p">[</span><span class="mi">32</span><span class="p">]:</span> <span class="n">array</span><span class="p">([[</span> <span class="mf">0.39417384</span><span class="p">,</span> <span class="mf">0.60582616</span><span class="p">]])</span>
</pre></div>
<p>Hmm, a 60% chance of winning. That seems <em>way</em> too low. Let's try a random forest, a tree-based
model with lots of trees, depth, and that will understand all of these non-linearities.</p>
<div class="highlight"><pre><span class="n">In</span> <span class="p">[</span><span class="mi">33</span><span class="p">]:</span> <span class="n">rf</span><span class="o">.</span><span class="n">predict_proba</span><span class="p">(</span><span class="n">situation</span><span class="p">)</span>
<span class="n">Out</span><span class="p">[</span><span class="mi">33</span><span class="p">]:</span> <span class="n">array</span><span class="p">([[</span> <span class="mf">0.41979832</span><span class="p">,</span> <span class="mf">0.58020168</span><span class="p">]])</span>
</pre></div>
<p>About the same. What gives? We 'know' that these probabilities are not correct. The game is over
once the field goal is kicked. The 'model' knows this, too. </p>
<h3>What to do?</h3>
<p>We have a couple of options -- we can continue to add more and more features to the model to try
and capture all of these corner cases and non-linearities, but we run up against the
<a href="http://scott.fortmann-roe.com/docs/BiasVariance.html">bias-variance tradeoff</a> as we continue to do
this. </p>
<p>We can add post-processing steps that identify these situations and adjust the probability accordingly
(<em>NB</em>, this is what we currently do with the 4th down bot). However, this introduces the uncomfortable
question of "what does that mean for the probabilities that we produced via the model?" If we were
plotting these probabilities throughout the game, this would probably produce some fairly large jumps
in the win probability graph, a good sign that you may be overfitting the model. </p>
<p>This isn't purely navel-gazing or a <em>what does it all mean</em> academic exercise. These issues are
at the very core of what it means to make decisions based on win probabilities -- and I don't
have good answers right now.</p>The sorry state of football analytics2015-09-23T18:27:43-07:00Trey Causeytag:thespread.us,2015-09-23:sorry-state.html<blockquote>
<p>"I got a lot of respect for analytics and numbers, but I'm not going to make
judgments based on those numbers. The game is the game. It's an emotional one
played by emotional and driven men. That's an element of the game you can't
measure. Often times decisions such as that weight heavily into the equation."</p>
</blockquote>
<p>That's Pittsburgh head coach Mike Tomlin, quoted on <a href="http://www.steelersdepot.com/2015/09/tomlin-prefers-feel-over-analytics/">Steelers Depot</a>,
brought to my attention by <a href="https://twitter.com/StatsbyLopez/status/646388478813540352">Mike Lopez</a>.
This comes after it was announced that the same Pittsburgh team <a href="http://espn.go.com/blog/pittsburgh-steelers/post/_/id/14521/why-the-steelers-hired-a-carnegie-mellon-professor-for-advanced-analytics">hired Karim Kassam</a>,
a former Carnegie Mellon professor, to head up their analytics effort full time.
The cognitive dissonance, it is strong. So is the irony, as Tomlin appears
to be describing some sort of optimization procedure by which one assigns
weights to factors that contribute to some outcome. <a href="https://en.wikipedia.org/wiki/Regression_analysis">If only such a procedure
existed outside of magical thinking.</a></p>
<p>NOTE: I have met Karim Kassam, he's very intelligent and a great analyst. None
of this applies to him or his work and, for all I know, he is a respected and
valued voice in the Pittsburgh front office, but the juxtaposition of these
two stories was too much to resist as a motivation for this post.</p>
<p>I wish that this were an isolated example, but this is a frankly unsurprising occurrence.
The sad truth of the matter is that <strong>the state of football analytics in 2015 is not good
and isn't showing signs of improving.</strong> This is especially true in the NFL, though I think
a lot of this applies to college football as well.</p>
<p>The body of football research is not advancing with the same rate and is not of the same
quality as in basketball, baseball, or hockey. At least publicly, teams are not
generally investing in analytics talent in the same way that other sports are. Even
when they are, as evidenced by the Steelers above, there is little evidence that
teams are incorporating many of the most basic quantitative
lessons from the analytics community either on or off the field.</p>
<h3>Conference presentations: where are they?</h3>
<p>The New England Symposium on Statistics in Sports (NESSiS) is this weekend and
looks to feature, as always, work that is not only interesting but methodologically
sound. As I was reading the <a href="http://www.nessis.org/program.html">program</a>, though,
I was struck that there were only two football-related papers. Further, one was
on the topic of Deflategate and the other was actually about ticket prices. I did
some digging and, in the history of NESSiS, which has happened every other year
since 2007, there have only ever been two other NFL papers. One of them is from
2009 (6 years ago!) from Ben Alamar, the current head of ESPN's Analytics department,
on how NFL coaches do not act rationally -- something that seems to be about
as constant as death and taxes.</p>
<p>Sloan doesn't present a rosier picture. Football-related research papers in the past
5 years or so have addressed scheduling of games, the inefficiency of the draft,
and predicting field goal success. This is among many more high profile panels
with such heart-warming titles such as "Gut vs. Data -- How Do Coaches Make Decisions?" and
"In-Game Innovations: Genius or Gimmick?"</p>
<p>It's not as if there are other conferences that are showcasing football analytics
work, such as baseball's SABR conferences. The work just isn't being done.</p>
<h3>Empirical disappointments: 4th downs and the draft</h3>
<p>If there's innovation occurring in football, it's not being presented at conferences.
But maybe it's just the case that the work is being done behind closed doors?
If so, we certainly aren't seeing evidence of it. </p>
<p>There aren't that many quantitative "truths" in football, but we know of at least
two: teams should go for it on fourth down and they overvalue picks early in the
draft. We've known the former since at least 2002 when David Romer wrote his famous
paper and we've known the latter since at least 2005 when Massey & Thaler wrote their
seminal work on the draft. And yet, based on my calculations, the rate of going
for it on 4th down has remained mostly stable (it was 12.3% last year, compared to
12.3% in 2001). There was a brief period in 2007-2009 when the rate crept up to 14%
but has since come back down to 2001 levels. </p>
<p>We also know that teams routinely trade up in the draft. In this year's 2015 NFL
Draft, San Diego traded a 4th round pick and a future 5th round pick to move up
two spots in the 1st round.</p>
<p>The fact that we're still arguing over things that have been known for more than
a decade is absurd.</p>
<h3>New kinds of data: good for what?</h3>
<p>We're hearing a lot about the use of <a href="http://www.buzzfeed.com/brendanklinkenberg/heres-the-nfls-blisteringly-accurate-new-way-to-track-player">Zebra</a>
data to track players' locations on the field to the millisecond level. I was
initially quite excited for this, as I've seen how much interesting work has come
out of the SportVU technology used in the NBA. Hard and previously unanswerable questions like
<a href="http://grantland.com/features/department-of-defense/">quantifying defense</a> are being
tackled using this data and sophisticated methods. However, I am highly skeptical
that many NFL teams a) have analysts capable of dealing with and extracting
useful information from multi-terabyte files, b) are willing to invest in
hiring people that can do so, and c) will actually use that information once they do.
We are still arguing over very, very basic questions about expected value on play types
with very acceptable sample sizes. </p>
<p>If you can't convince a coach or front office that there are wins laying
on the table by not going for it on 4th down, or that trading up for a player who
has a significant probability of being a draft bust is a waste of resources, how are you
supposed to convince that coach that your model found something scouts didn't see?</p>
<h3>Communication red herrings</h3>
<p>The answer to the above question, of course, is always that the burden is on the
analyst to make their work approachable and digestible by a coach or GM. Of course that's true,
but it's also a a red herring. Dozens of people have made the case for why going for it on
4th down is a good idea in many situations (or at least more situations than currently
observed) -- it's not a crazy idea. Yet, there's very little buy-in. Coaches have a million
special reasons why it wasn't the right time to go for it or why the models can't account
for whatever micro-climate existed on the field that day.</p>
<p>It's a convenient
way of using a tired stereotype (the ivory tower academic who doesn't know football and
can't talk to football people) to justify continuing to move the goalposts. Notice, too,
that the onus is never on the front office to try to understand analytics; it's a one-way
street.</p>
<h3>Brain drain on the horizon</h3>
<p>We continue to see people make arguments that football is more complicated than baseball,
basketball, or hockey. That may be true. We also hear that the sample sizes are smaller.
Also true! But nearly <em>every other industry</em> in the world is pursuing the use of data science
and quantitative methods to gain a competitive advantage. Do you think organizing the world's
information is difficult? It is, but Google seems to be doing OK. Uber seems like it's solving
complex optimization problems successfully. Video games seem to manage to produce hyper-realistic
versions of the same game we say is too complex to model with statistics.</p>
<p>Football is not a special
snowflake that has somehow miraculously produced the most unique phenomenon on the planet
that can't be studied quantitatively. Ironically, we see extreme faddishness in the league
around other topics, just not analytics (Wildcat offense, anyone?)</p>
<p>I fear this is going to lead (or continue) a "great stagnation" in football. Teams won't
compete with other industries on pay or benefits -- that much is clear -- but they
claim people will line up for the privilege of working in sports. Yet, when coaches and
GMs routinely and publicly throw analytical work and analysts under the bus, why would
a rational person stick around
the sports business for long? </p>
<p>I know one of the most important things for me in any
job is the feeling that I'm being heard, my work is important, and I'm having an impact.
We're hearing the exact opposite from coaches and GMs on a regular basis, and that simply
isn't tenable if you want to attract and retain talented people to work on hard problems. You
can't remove both extrinsic and intrinsic rewards and expect success.</p>
<h3>Conclusion</h3>
<p>I am not optimistic for the future of football analytics, which is truly sad. Innovation
keeps games interesting and innovation from multiple sources encourages the evolution of the
game. As it stands right now, football appears to have walled itself off from meaningful
quantitative innovation without any signs of lowering the gate.</p>How to ask for (and receive) help from strangers on the internet2015-09-23T18:13:29-07:00Trey Causeytag:thespread.us,2015-09-23:asking-for-help.html<p>One of the purposes of this site is to help people learn more about data science and how they can think more rigorously about sports analytics. I've made a conscious effort to be approachable and offer up my time and energy to teach others. I even wrote a <a href="http://treycausey.com/getting_started.html">guide to getting started in data science</a> that received and continues to receive a decent amount of traffic. I get a <em>lot</em> of emails asking for help. Sometimes this is more rewarding than other times. </p>
<p>With this in mind, I thought I'd offer up some suggestions on how to ask strangers for help on the internet and how to maximize the chances that those strangers will respond favorably. Pretty much no one that you're writing to is getting paid to answer your questions, and they're using their valuable free time to do so. </p>
<ol>
<li><strong>RTMP</strong>.</li>
<li><strong>Be nice.</strong> </li>
<li><strong>Say thank you. More than once if necessary.</strong></li>
<li><strong>Don't ask lots of unsolicited follow-ups.</strong></li>
</ol>
<h2>RTMP</h2>
<p>This is my own kinder, gentler version of the more common <a href="http://en.wikipedia.org/wiki/RTFM">RTFM</a>. It stands for <strong>r</strong>ead <strong>t</strong>he <strong>m</strong>anual <strong>p</strong>lease. I've written a lot of posts about data science both relating to and not relating to football. Did you check and make sure I haven't covered this topic already? You can do a custom Google search. It would save us both a lot of time.</p>
<p>That being said, the reason I chose the more forgiving RTMP instead of RTFM is that I remember being a beginner and the frustrating feeling of not knowing what to search for because I didn't know the right words yet. That's OK. The R-help mailing list is <a href="http://badhessian.org/2013/04/has-r-help-gotten-meaner-over-time-and-what-does-mancur-olson-have-to-say-about-it/">famously mean</a>, and it discourages a lot of people just getting started. Documentation is often written for people who already know how to use the software or method. Help files are really hard to write. Have you ever looked at the Wikipedia entries for most statistical topics? Yeah.</p>
<h2>Be nice.</h2>
<p>The fact that I have to even say this makes me a little depressed. When you ask for help, make sure you ask nicely. If your email signals "probably not going to say thanks" or "didn't make an effort to write a decent email" to me, I probably won't answer it.</p>
<h2>Say thank you. More than once if necessary.</h2>
<p>You would be surprised (or maybe you wouldn't be) how many people I write detailed responses to, only to have the message disappear into the ether with nary a response. Are you freaking kidding me? I just spent ten minutes of my Saturday morning responding to a stranger's question and they can't respond with a simple thank you? This really burns me up and has led me to consider not answering email questions more than once.</p>
<p><strong>Don't be an asshole.</strong> Say thank you. Do it initially with your question ("thanks in advance for any help you can provide!") and once the question has been answered.</p>
<h2>Don't ask lots of unsolicited follow-ups.</h2>
<p>So you read the manual, you were nice, and you got a response! Congratulations. You just got asked for and received help from a stranger on the internet. But wait, you have follow-up questions or something in the answer wasn't clear. What to do? It's fine to ask a follow-up question. However, don't start down new lines of questioning or send five follow-up questions or assume that because we now have an open channel of communication, I'm on call for your technical support. Often I will end an email with "please let me know if this isn't clear or you have any more questions." This is an invitation to write back if you need it. If I end an email with "best of luck", I'm letting you know this conversation is over for me.</p>
<h2>Conclusion</h2>
<p><strong>Of course</strong> most of this is common sense. I just don't want to turn into one of those embittered people that doesn't answer emails or respond to questions because I've been mistreated so often. People that answer your questions and people that teach are providing a public, free service. Respond in kind.</p>Why is it so hard to know if changing coaches has any effect?2015-09-23T18:13:29-07:00Trey Causeytag:thespread.us,2015-09-23:causal-effects-of-coaching-changes.html<p>Most football fans know the pain of suffering through seasons with what is obviously a terrible coach. If only your team would move on from that terrible coach, your losing days would be over (or at least fewer in number)! You look wistfully at teams with stable coaching situations, who plan for the long term and don't make stupid decisions, both strategic and tactical. Once the season's over and you get that new coach in, your problems will be solved.</p>
<p>Of course, your problems probably won't be solved. Most of the <a href="http://fivethirtyeight.com/datalab/theres-not-much-evidence-a-new-coach-will-help-the-jets-49ers-or-falcons/">people</a> <a href="http://freakonomics.com/2011/12/24/%E2%80%9Cfootball-freakonomics%E2%80%9D-does-firing-your-head-coach-fix-anything/">who have</a> <a href="http://archive.advancedfootballanalytics.com/2009/02/fighter-pilots-and-firing-coaches.html">asked</a> "how much will changing the coach help the team" have found the answer to be somewhere between "a little" and "it won't." Yet, it seems so incredibly <strong>obvious</strong> that some coaches are <em>bad</em> and getting rid of them will help the team improve. So why can't we demonstrate that this is true?</p>
<p>I propose that this is essentially a case of what economists call an <a href="http://en.wikipedia.org/wiki/Parameter_identification_problem">identification problem</a>. We're trying to estimate the effect of a coaching change on success, but we're having a hard time doing it for a number of reasons. I'll outline them below, giving both "practical" explanations and "statistical" explanations.</p>
<h2>Measurement error</h2>
<p>Notice above I said we were interested in the effects of coaches on "success." Well, what does that mean? In football, we usually mean wins, but we all know that the NFL is characterized by small sample sizes and high variance. A coach who's given two years to prove himself (I wish I could say or herself here) only has 32 games to do so, barring playoff appearances. So, obviously wins aren't a great metric. </p>
<p>There are adjusted metrics out there, like Pythagorean wins, strength-adjusted wins, etc., but ultimately we're facing a problem where there is a lot of <em>measurement error</em> in the outcome.The noisier the outcome of interest is (i.e., the higher the variance), the larger the sample we will need to establish a correlation between our treatment (coaching) and our outcome (wins). We're facing the worst of both worlds here. </p>
<p>Additionally, we're facing low <em>statistical power</em> due to our small sample size and (likely) small effect size. Even if there is a true but small causal effect of coaching changes, it would take a much larger sample size to detect it. Underpowered studies are a real problem, especially given how many people misinterpret non-significant statistical tests as "accepting the null hypothesis."</p>
<h2>Collinearity in the predictors</h2>
<p>The vast majority of teams change coaches because they have either fired their coach or their coach has been recruited to replace a recently fired coach. All of the things that cause a team to be successful or unsuccessful co-occur with the coaching changes. </p>
<p>When you have a lot of collinearity in the predictors, it's very difficult to precisely estimate the effects of each individual predictor. Most regression techniques require you to be able to identify the independent impact of each predictor on the outcome. If the variables mostly vary together, you can't identify this individual impact, because you've never observed what happens when one thing changes and the other doesn't! Unfortunately, once again, the solution is usually a larger sample.</p>
<p>In other words, <em>variables</em> have to <em>vary</em> -- which sounds ridiculous when you put it that way, but it's a fact often overlooked. </p>
<h2>Lack of variation</h2>
<p>This time I'm referring to variation in the underlying abilities of coaches. It's clear that there are some truly special coaches out there. Bill Belichick obviously comes to mind. It's also clear that there are obviously some truly bad coaches out there (insert your favorite loser here). Yet, I'm proposing that the vast majority of coaches are probably roughly equal in skill / talent / ability / whatever you want to call the latent characteristic that predicts success. Essentially, I'm arguing that most coaches are replacement level. </p>
<p>I'm definitely <em>not</em> saying that you could just stick anyone in a head coach job and expect similar results to Norv Turner. I'm merely saying that most coaching changes are probably like-for-like exchanges. Combine this with the fact that very often when teams make coaching changes, they pull from the same rotating pool of coaches and coordinators that are always mentioned for job openings. Why would you expect a coach to be wildly successful at the Raiders if he wasn't wildly successful at the Jets? </p>
<p>Making matters even worse, we don't actually have a measure of talent / skill / whatever (that's what we're trying to get at here!), but even if we did, we probably would observe a lot of <a href="http://en.wikipedia.org/wiki/Peter_Principle">Peter Principle</a> at work. When successful offensive or defensive coordinators are recruited to be a head coach at a (likely losing) team, they are required to use an entirely different set of skills than what made them successful in their previous position. </p>
<h2>Endogeneity</h2>
<p>It's pretty obvious that teams that change coaches don't do so randomly. Many of these things are completely unrelated to the coach as a person, but may have been related to his hiring and firing. These are often pointed to as the "culture" of organizations, a sponge that soaks up a lot of variance. In reality, it could be many things -- the way the owner treats the GM, the way contracts are structured, the quality of scouting, etc.</p>
<p>In other words, there is some unobserved variable that is causing both the team's success and the coach's hiring and firing. Many social sciences call these "omitted variables" or "confounding variables" and economists call this problem endogeneity. This introduces <em>bias</em> into our estimates of the effects of coaching because we think we've controlled for all relevant variables related to the coaching change and success, but we haven't. </p>
<h2>What to do about it?</h2>
<p>Of course, there are lots of other more mundane explanations -- reversion to the mean being a notable one. Teams that have especially bad years may fire their coach, but teams having especially bad years are more likely to have slightly less bad years the next year, coaching change or not. Don't forget the <a href="http://en.wikipedia.org/wiki/Hawthorne_effect">Hawthorne Effect</a> either.</p>
<p>So what do we do about it? There aren't easy answers, but more simple regressions are definitely not the solution. Thinking hard about causal inference is the bread and butter of econometrics. I see two possible approaches here (and definitely read <a href="http://chrisblattman.com/2010/10/27/the-cardinal-sin-of-matching/">Chris Blattman</a> for thinking about when each is appropriate).</p>
<p><em>Experiments</em>. The gold standard for establishing causation is a randomized experiment, meaning randomly change coaches in a way that is uncorrelated with the measure of success. Obviously this is a non-starter. </p>
<p><em>Instrumental variables</em>. The first solution is to find an <a href="http://en.wikipedia.org/wiki/Instrumental_variable"><em>instrument</em></a> for the effect of coaching changes. An instrument is some variable that affects our outcome <em>only</em> through its effects on the causal variable of interest. This is a really tricky topic to think about, and I recommend reading through some examples to ge ta grip on it. Much of the "Freakonomics" movement in economics revolved around finding clever instruments (such as rainfall) to solve tricky causal problems.</p>
<p><em>Matching</em>. Another approach is to use <a href="http://en.wikipedia.org/wiki/Matching_%28statistics%29">matching methods</a>, which involves pairing all of your observations up to find the most similar pairs on all of your other independent variables except the one of interest (a coaching change). You then estimate the causal impact of that change by looking at mean differences in success between the pairs. This does <strong>not</strong> solve endogeneity, however.</p>
<p><em>Natural experiments</em>. Sometimes "nature" presents us with an experiment such as expansion, unexpected retirements or deaths, and so on. In these cases, we observe a coaching change (the treatment) in a way that is (hopefully) uncorrelated with the outcome. Unfortunately, these are few and far between, which puts us back in the underpowered scenario due to small sample sizes. So, when your grandkids are analyzing this question, maybe we'll have enough data (doubtful).</p>
<h2>Conclusion</h2>
<p>As you can see, the cards are basically stacked against being able to precisely estimate the "true" effect of a coaching change. This <strong>doesn't</strong> mean we should stop trying! It just means we need to be more creative and perhaps a bit more sophisticated in how we try to answer the question.</p>Blueprint for an Analytical NFL Franchise, version 0.12015-09-23T18:13:26-07:00Trey Causeytag:thespread.us,2015-09-23:analytical-blueprint.html<p>What should an analytically driven NFL franchise look like? </p>
<p>I've been having versions of this conversation a lot lately, with folks both inside and outside of the industry. Then, this morning I was catching up on <a href="http://threeconedrill.com/2014/10/31/three-cone-drill-podcast-episode-8-1-game-theory-coordinators/">Three Cone Drill</a>, where Rivers and Danny were having a very similar discussion. It's almost certainly <a href="http://en.wikipedia.org/wiki/Confirmation_bias">confirmation bias</a> (more on that below), but it feels like we're having another one of those moments where some fans are pining for a more analytical NFL. Here are my very nascent thoughts on this topic. <a href="http://www.alamarsportsanalytics.com/Book--Sports-Analytics.html">Ben Alamar</a>, the newly appointed Director of Production Analytics at ESPN, has doubtlessly covered much of this material in his popular book. However, I must sheepishly admit to not having read it yet and therefore any duplication of concepts is coincidental.</p>
<p>Analytics is not just a matter of hiring someone who's good with spreadsheets (in fact, being good with spreadsheets may be a weak, but not deterministic, signal towards an inability to manage and analyze larger data sources). In fact, analytics is <strong>not</strong> just about number-crunching. It's about decision-making. Being an analytical organization necessitates hiring people who are knowledgeable in the statistical analysis of data but also in decision-making, game theory, behavioral economics, and (last but certainly not least) football. This may seem like a broad range of topics, but fundamentally so much of what an organization does is about making decisions with data under conditions of imperfect information and very limited time. Thankfully, this is an <a href="http://en.wikipedia.org/wiki/Decision_theory">extremely well-studied area</a> with lots of robust empirical findings. Chances are you're not going to find a single person to fill this role, but rather a team of people with complementary skillsets. </p>
<h2>The Analytics Coordinator</h2>
<p>The analytics division should hold coordinator-level authority and be on the same horizontal as the other coordinators in the org chart. This means not treating the position like an internship or an entry-level position. It means hiring both a leader and an individual contributor. Being a coordinator is as much about management as it is about expertise. The team must be prepared to look outside of the current football hiring pool for this person/persons and be prepared to spend the required money to secure talented individuals who are highly sought after for non-football positions. Organizational diversity is good and leads to better decision-making. Hiring exclusively people with close personal connections to the game leads to groupthink and conventional wisdom.</p>
<p>The analytics coordinator should be comfortable not only conducting original research but relating and defending the findings of his or her subordinates. He or she should be familiar with all or most of the topics that I've covered on this blog and not fall prey to the common traps of overfitting, star-gazing/p-hacking, and overinterpreting spurious relationships. This person needs to be able to stand on equal footing in a senior coaches' meeting. Data-driven decision-making means not deferring to "football people" when the stakes are high. It also means being honest about how certain or uncertain one is about one's findings. Data-driven decision-making means not bending the results of the analysis or selectively interpreting findings. It means overruling traditional football people when the data are competently and thoroughly analyzed and differ with the conventional football wisdom. The analytics coordinator needs to be humble but firm and be adept at communicating complicated analyses to a skeptical non-expert audience. </p>
<p><strong>EDIT</strong>: As the always insightful <a href="https://twitter.com/SethPartnow">Seth Partnow</a> pointed out, I want to stress that the analytics coordinator should absolutely be able to "talk football" with the coaches. The coordinator should have broad and deep football knowledge, not be an overzealous quant who thinks, with the right models, he or she can show up the football people. As a slight counterpoint to Seth, I also firmly believe that the coaches should be expected to speak analytics with the coordinator. The burden to be an expert in everything shouldn't be placed solely on the analytics coordinator.</p>
<h2>Institutional Support</h2>
<p>This position will be unsuccessful without significant institutional support. It is incumbent, starting with ownership and working down the org chart, that an analytical and data-drive culture are seen as vital to the organization's survival and success. This means that the analytics coordinator provides an equal voice in team discussions, that the team adopts a process-driven rather than outcome-driven style of decision-making, and that employee evaluation is backed by this. Coaches who maximize their chances of winning by going for it on fourth down in the appropriate situations but that fail to convert are not punished. Conversely, coaches who overrule the analytics coordinator for a suboptimal decision but are still successful are subject to criticism. The latter is probably the most difficult hurdle for an organization, but not as much as one might think. Players that miss their assignments on plays are routinely criticized by coaches even if the play turned out successfully. Execution is about process and making decisions that put you in a winning position -- <strong>not</strong> rolling dice that are weighted against you and getting lucky. </p>
<p>A crucial part of institutional support is extending the <a href="http://en.wikipedia.org/wiki/Time_horizon">time horizons</a> of the people involved. So much of the decision-making in football is suboptimal due to the rotating door of management. Tying employee evaluation to cooperation with an analytical strategy helps with this. But the opposite needs to be true -- the analytics coordinator and his or her subordinates need to have assurances that they are working in a stable, future-oriented organization rather than one that will sweep away their jobs with the next general manager. </p>
<p>This further means that the analytics coordinator needs to have equal footing in how their input is implemented, including in-game decision-making, player evaluation, draft research, and game planning. This does not mean that the analytics coordinator trumps the football people, only that their voice is an equal one. </p>
<p>In order to do so, the analytics coordinator will need a budget that allows him or her to recruit and retain talented employees and purchase and train on the necessary technology. It's quite remarkable how often a few hundred thousand dollars are seen as essentially rounding error on a replacement level player's contract, but could employ an analyst with every available database for several years. </p>
<h2>Hiring</h2>
<p>As referenced above, hiring is a tricky practice, especially for an organization that is currently not analytically driven. I've talked about this on Twitter before, but I'll reiterate my thoughts here. There is a serious <a href="http://en.wikipedia.org/wiki/Cold_start">cold start</a> problem here -- how does an organization with no analytics talent successfully evaluate candidates for an analytics position? In the absence of the ability to competently evaluate talent, most organizations will tend to favor easily observable signals for candidate quality -- an Ivy League degree, an MBA from a prominent business school, a career in finance. These may be useful heuristics, but they're also highly noisy. </p>
<p>Teams hire consultants for a variety of tasks all the time. I'd recommend doing the same for the initial analytics hires. There are plenty of highly visible people in the analytics world who don't/can't work for teams who are still involved with teams and have their respect. Hire these people as consultants to help in the search. Consider looking to the tech world and talking to people who have hired data scientists or advanced analysts before. Also consider talking to academics in statistics and decision sciences departments. Have a review system in place where the resumes and interviews are reviewed independently (i.e., without interaction between the reviewers) to get a semi-unbiased view of the candidates. Hire a consultant to help you draft up some homework / audition problems for people. Don't just go for a flashy degree or connections in the sport. </p>
<h2>Behavioral Economics and Decision-Making</h2>
<p>An analytically driven NFL franchise invests in learning about the core <a href="http://en.wikipedia.org/wiki/Cognitive_bias">cognitive biases</a> and implements <em>organizational</em> and <em>institutional</em> safeguards for counteracting them. Every member of the analytics organization should, at the very least, read <em><a href="http://www.amazon.com/Thinking-Fast-Slow-Daniel-Kahneman/dp/0374533555">Thinking Fast and Slow</a></em> and understand how decision-making is systematically affected by the shortcuts our brains make to make our lives easier. Further, the non-analytics organization needs to be coached on these things in a way that connects with their daily duties. This is extremely important material and communicating it in a way that makes it compelling for other coaches is vital. Coaches <em>need</em> to understand loss aversion. General managers <em>need</em> to understand recency bias in player evaluation. <strong>For God's sake</strong>, coaches should understand confirmation bias.</p>
<p>Most teams spend a significant amount of time drilling on highly specific situations -- the ball has been fumbled, when is it appropriate to knock the ball out of bounds? Players and coaches spend a great deal of time making these decisions become automatic. The same rigor should be demonstrated on the coaching end. When is it worth it to take a five-yard delay of game penalty instead of burning a timeout? How can we avoid burning a timeout and then <em>still</em> punting the ball? The analytics coordinator should be in the booth with the other coaches providing information about this to the head coach in real time. When this information is provided in-game, the analytics coordinator must have buy-in and authority in order for this system to be effective.</p>
<h2>Roadmap</h2>
<p>This is a multiple year project. The organization cannot expect magic from the analytics team ever, but certainly not in the first year. A team embracing this approach will have to evolve in the way that it hires coordinators, evaluates employees, and conducts daily business. This might mean pulling from less well-known talent pools, recruiting more heavily from college, and taking some risks. The first year should be able establishing the personnel and technological infrastructure needed to move forward. The analytics coordinator and his or her team should take on an advisory and educational role during this time period. Subsequent seasons should see an increase in involvement until the ideal of an equal voice is met. </p>
<p>Along the way, the analytics organization must adopt a different stance than I often seen in the analytics world. They must not be condescending, they must make themselves part of the team. Senior leadership and ownership must reinforce this. Doing so will mean being a little less secretive than many teams are currently comfortable with -- acknowledging the importance of their analytics organization, allowing the coordinator to speak with the press and be seen as an integral part of leadership. The culture should be one of <strong>collaboration</strong> between analytics and football operations.</p>
<h2>Conclusion</h2>
<p>Various teams in the leagues are doing some of these things to various degrees already, but to my knowledge, no one is "all in" on being an analytically driven franchise. There's a tremendous amount of low-hanging fruit for an enterprising team to grab. Teams are making <strong>demonstrably</strong> bad decisions every week -- this isn't even debatable or "lying with statistics." </p>
<p>These are not just pie-in-the-sky ideas. Some team is going to really latch onto these ideas and exploit the hell out of them for a while. It's simply a matter of which team and when. Playing catchup is going to be a lot more difficult at that point. All I can say is:</p>
<ul>
<li>Hire the right people.</li>
<li>Get outside help in hiring these people.</li>
<li>Pay them well and give them the financial and institutional resources they need to succeed.</li>
<li>Give them an equal voice.</li>
<li>Put yourself in a position to win.</li>
</ul>Situational thinking in football - How can data help?2015-09-23T18:13:26-07:00Trey Causeytag:thespread.us,2015-09-23:areas-of-research.html<p>What is the current state of data-driven football research? Where can we improve? I've written before about <a href="win-probability-uncertainty-and-overfitting.html">smugness and overconfidence</a> in sports analytics. It's a real problem. But we also know quite a bit. As an exercise, I thought I'd break down open areas of research into categories and identify where we have room to grow (and I'm sure a lot of this could apply to other sports as well).</p>
<p>As I see it, there are three major categories of open research. Obviously, these categories are not totally independent, nor are they exhaustive, but they give us a potentially useful rubric for thinking about how to frame our questions.</p>
<ul>
<li>Team-level </li>
<li>Player-level</li>
<li>Situational decision-making</li>
</ul>
<p><strong>Team-level</strong> Possibly the most well-developed and mature, research in this area involves questions like 'who is more likely to win', 'who has the best defense', and 'who is better or worse at drafting.' The prevalence of this kind of research is driven by widely available (and easily manageable) data aggregated to the team level and sports betting. We're pretty good at predicting the winners of games, and we can argue about the nuances of the other questions, but these seem like questions with essentially "knowable" answers given the current state of data. </p>
<p><strong>Player-level</strong> All of those team-level questions could probably be better answered with improved player-level data. After all, what are teams except for aggregated and interacting individuals? Most research in this area is still somewhat rudimentary. There are serious data availability and data quality issues. We don't have a good way to compare players across positions (some will argue about this point). We don't have reliable data for college players. We are notoriously bad at predicting how the careers of draft picks will turn out.</p>
<p>Pro-level evaluation is going to improve some as data from motion-tracking systems like <a href="http://techcrunch.com/2014/07/31/the-nfl-gets-quantified-intelligence-courtesy-of-shoulder-pad-mounted-motion-trackers/">Zebra</a> are more widely adopted, but it's almost certain that the data will remain proprietary and owned by teams and the league. Further, making use of motion data has its own unique challenges, as many SportVU analysts can tell you. However, having more data doesn't tell you what questions to ask, and Zebra can't solve the problem of evaluating college players.</p>
<p><strong>Situational decision-making</strong> By far the most immature area of research involves in-game decision-making. Unfortunately, most of the yet-to-be-conducted research in this area will have effects that cascade through both team-level and player-level research. We think we know a lot about <a href="http://nyt4thdownbot.com/">fourth downs</a>, but most of our knowledge about fourth downs is biased by the fact that a) teams don't go for it on fourth very much, b) teams that do go for it on fourth don't go for it randomly, and c) we don't know much about the plays being called, the defense being faced, the specific personnel on the field, and so on. In fact, the data get very sparse the more specific you want to make the game situation. Fourth and three from the eleven facing Cover 2 with 21 personnel? Without looking, I'm guessing that situation exists less than 15 times in the modern game. Restrict it to run or pass? You're probably looking at less than 7 plays. With less than a minute remaining? Oops, you're probably looking at a single play (if that) now. </p>
<p>This is compounded by the fact that some formations are only played once or twice a game (a small sample problem), and that many teams have similar plays with different names. Charting organizations like Pro Football Focus and TruMedia are collecting data on these things, but most of it won't be available publicly. Motion-tracking data will help with a lot of this, too, as we'll be able to build models that group plays together, regardless of team, based on the motion of position players, but we're a long way from that.</p>
<p>Unfortunately, situational awareness and the nitty-gritty specifics are what so-called "football people" often like to use to criticize the analytics crowd. "Your model can't account for Peyton Manning," they'll say. And they'll be <strong>half</strong> right. Most models can't account for Peyton Manning or the fact that the other team's strong safety injured his hamstring last week. These things probably do matter to some extent. However, they're also half wrong. We can't predict specific, rare outcomes with a high degree of certainty. It's not possible in most sciences, and certainly not in football. </p>
<p>It also bears thinking about how this would help teams make decisions. Let's say we know a lot more about specific situations. In order to build accurate predictive models, we need a training data set with "true" labels/measurements. For instance, if a coach is interested in going for a two-point conversion, he might be interested in more than the league-wide average. He might be more interested in how that conversion rate is affected by certain personnel packages.</p>
<p>This is only knowable to an extent, though. Predictive models that are going to be used in non-contrived settings should only be trained using data that the model "would" know about if it were trying to predict in real-time. This is the problem of overfitting. For instance, if we were to hypothetically know that two-point conversions succeed 40% of the time, but can adjust that to 35% of the time if facing an all-out blitz, we're basically no better off than not knowing that, because we can't know in a real-time game situation if the other team is going to all-out blitz or not. The models should mirror the information available to the human decision-makers as much as possible.</p>
<p>This is not to say that we shouldn't be studying these questions -- we should! There are lots of unanswered questions surrounding strategy and situations. I think there's been something of a divide between football stats people and football "tape" people, and that we could overcome a lot of this with more constructive dialog. But that's going to require some humility on both sides and a shared desire to know more about the sport rather than to prove the other side wrong.</p>Using k-means clustering to find similar players2015-09-23T18:13:26-07:00Trey Causeytag:thespread.us,2015-09-23:clustering.html<p>Most of the posts so far have focused on what data scientists call <em>supervised</em> methods -- you have some outcome you're trying to predict and you use a combination of predictor variables to do so. Another important class of methods are called <em>unsupervised</em>. In this case, you might not know what exactly you're looking for or what metric you want to optimize for, but you want to explore the data and identify similarities among cases. For example, you might want to identify a list of "similar" players for your fantasy draft. This is a little late for the start of fantasy season, but with the <a href="https://sports.vice.com/article/the-daily-fantasy-sports-takeover">rise of daily fantasy sports</a>, perhaps not. However, maybe you don't know what "similar" means in this case or you don't have a single number or index that you want to match on. Perhaps you just want to find players with similar production to hedge against bye weeks or injuries.</p>
<p>This is where unsupervised methods come in. We'll be focusing on a popular unsupervised method called <em>clustering</em>. You'll see these kinds of methods used on a number of sports sites. <a href="https://twitter.com/borisachen">Boris Chen</a>, a data scientist at the New York Times, uses a kind of clustering to produce his fantasy football player tiers. <a href="http://www.vantagesports.com/#documents/VEBh5ykAACoATu8_/new-defensive-positions-rethinking-the-standard-terms">Krishna Narsu</a> recently used a kind of clustering to redefine the defensive positions in the NBA. </p>
<p>One popular method is called <em>k-means clustering.</em> (Note, this isn't the same <em>k</em> as in <em>k</em>-fold cross-validation, <em>k</em> is just a common stand-in for an unknown integer value.) I'll be working through an example clustering wide receivers using their 2013 statistics. K-means is really beautifully simple. The basic idea is that we want to take our entire data set and divide the observations into <em>k</em> sections and have each of the observations be as similar to each other as possible (and potentially as dissimilar to every other cluster as possible). Each cluster has what's known as a 'center' or 'centroid', which is the point against which all of the observations in that cluster are compared. You can think of it as the "ideal" or "prototypical" observation that typifies each cluster.</p>
<p><em>EDIT</em>: As always, <a href="https://github.com/treycausey/thespread/blob/master/clustering_wrs.py">code for this example</a> is up on GitHub.</p>
<h2>Instructions</h2>
<p>To do so, we'll have to define what we mean by "similarity." Most implementations of k-means clustering use what's called <em>Euclidean distance</em>, which is the sum of the squared differences between each observation's value and the center of the cluster. The steps look a little bit like this:</p>
<ol>
<li>Randomly pick <em>k</em> points in space and call them your cluster centers. </li>
<li>Assign each observation in your data set to the cluster that miminizes the Euclidean distance.</li>
<li>Recompute the center of each cluster by taking the means of each of the observations in each cluster.</li>
<li>Repeat steps 2 and 3 until either the centers don't change or your maximum number of iterations has been reached.</li>
</ol>
<p>I'll do this with all of the wide receivers who played in 2013 using the following variables: targets, receptions, receiving yards, receiving touchdowns, fumbles, and fantasy points. </p>
<h2>Gotchas</h2>
<p>This is easy, right? Of course, there are a couple of gotchas. There's always a catch. </p>
<p><strong>First</strong>, how do you pick <em>k</em>, the number of clusters? Good question -- this is an active area of research (<em>eyes glaze over</em>), but there are some commonly used rules-of-thumb. One way is to pick the number of clusters that maximizes what's known as a <em>silhouette score</em>, which is essentially the ratio of the within-cluster distance to the between-cluster distance. We want to maximize the former and minimize the latter. By running our k-means algorithm multiple times, we can pick the <em>k</em> that maximizes the silhouette score (which is bounded on the interval [-1, 1]). I did this for each <em>k</em> between 3 and 11, and it looks like this:</p>
<p><img alt="Silhouette scores" src="images/silhouette.png" style="float:center" /></p>
<p>We see that the silhouette score is maximized at <em>k</em> = 4, meaning 4 clusters of wide receivers, so we'll go with that.</p>
<p><strong>Second</strong>, notice that the first step of the algorithm is to <strong>randomly</strong> pick the centers for each cluster. This means that the results you get can be highly dependent on this initial position. So, you'll need to re-run the algorithm multiple times with different start points to see if your results are robust. Scikit-learn takes care of this for you, but it's important to be aware of.</p>
<p><strong>Third</strong>, you'll need to center & scale your data if the different variables aren't in the same or comparable units. Centering means subtracting the mean of each variable from each observation and scaling means dividing by the standard deviation of that variable. You're left with standardized scores that are basically interpretable as "how many standard deviations above or below the mean is this observation on this variable." </p>
<p><strong>Fourth</strong>, and probably most importantly, how do you know if you even have good clusters? This is trickier than it seems. There are some technical solutions like <a href="http://www.stat.washington.edu/raftery/Research/mbc.html">model-based clustering</a> as well as the less rigorous "eyeball data analysis." Ultimately, you want to be wary of counterintuitive results that you get from unsupervised methods. The clusters should <strong>make sense</strong> using the knowledge that you already have. </p>
<p>I didn't use any out-of-sample validation here and I don't know how predictive these clusters are of future performance. One thing I could do is look at the cluster that each player is assigned to and see how predictive it is in the future of performance. It is <strong>entirely</strong> possible to "overfit" your clusters to historical data. </p>
<p><strong>Finally</strong>, your clusters are highly dependent on what variables you use in your clustering! If I added a new variable in, say yards after catch, we might see a number of players switch cluster assignments. You need to be wary of this and be careful of treating your newly found clusters as the absolute truth.</p>
<h2>Clustering wide receivers</h2>
<p>Let's take a look at how we did clustering our wide receivers. Using <em>k</em> = 4 clusters, we can look at the centers of each cluster and try to interpret them. Remember, these are in <em>standardized</em> form, not raw numbers.</p>
<table>
<thead>
<tr>
<th align="center">Cluster</th>
<th align="right">Targets</th>
<th align="right">Receptions</th>
<th align="right">Yards</th>
<th align="right">TDs</th>
<th align="right">Fumbles</th>
<th align="right">Fantasy Points</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">0</td>
<td align="right">-0.84</td>
<td align="right">-0.82</td>
<td align="right">-0.81</td>
<td align="right">-0.72</td>
<td align="right">-0.4</td>
<td align="right">-0.82</td>
</tr>
<tr>
<td align="center">1</td>
<td align="right">0.49</td>
<td align="right">0.47</td>
<td align="right">0.43</td>
<td align="right">0.41</td>
<td align="right">-0.38</td>
<td align="right">0.45</td>
</tr>
<tr>
<td align="center">2</td>
<td align="right">1.74</td>
<td align="right">1.82</td>
<td align="right">1.9</td>
<td align="right">1.72</td>
<td align="right">0.47</td>
<td align="right">1.93</td>
</tr>
<tr>
<td align="center">3</td>
<td align="right">0.21</td>
<td align="right">0.12</td>
<td align="right">0.08</td>
<td align="right">-0.07</td>
<td align="right">2.46</td>
<td align="right">0.01</td>
</tr>
</tbody>
</table>
<p>With higher numbers being better on all of these metrics (except fumbles), we see that cluster 2 is probably our highest performing wide receivers. The ideal player in this cluster is targeted a lot (almost 1.75 standard deviations above the mean), catches a lot of passes for a lot of yards, scores a lot of touchdowns, and doens't fumble a ton. Assigned to this cluster are players like Larry Fitzgerald, Reggie Wayne, Randall Cobb, and Julian Edelman. Only 24 out of 197 WRs were assigned to this cluster (so, not even one per team).</p>
<p>On the flip side, cluster 0 looks to be pretty terrible. These players are all below average on every metric, although they don't fumble that much. 91 out of 197 players were assigned to this cluster. </p>
<p>Here's a random sample of 20 receivers and the cluster to which they were assigned:</p>
<table>
<thead>
<tr>
<th align="left">Player</th>
<th align="right">cluster</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">A.Hawkins</td>
<td align="right">0</td>
</tr>
<tr>
<td align="left">A.Robinson</td>
<td align="right">0</td>
</tr>
<tr>
<td align="left">B.Golden</td>
<td align="right">0</td>
</tr>
<tr>
<td align="left">C.Johnson</td>
<td align="right">2</td>
</tr>
<tr>
<td align="left">E.Bennett</td>
<td align="right">0</td>
</tr>
<tr>
<td align="left">E.Weems</td>
<td align="right">0</td>
</tr>
<tr>
<td align="left">J.Boyce</td>
<td align="right">0</td>
</tr>
<tr>
<td align="left">J.Criner</td>
<td align="right">1</td>
</tr>
<tr>
<td align="left">J.Ebert</td>
<td align="right">0</td>
</tr>
<tr>
<td align="left">K.Allen</td>
<td align="right">2</td>
</tr>
<tr>
<td align="left">K.Martin</td>
<td align="right">0</td>
</tr>
<tr>
<td align="left">L.Brazill</td>
<td align="right">1</td>
</tr>
<tr>
<td align="left">L.Hankerson</td>
<td align="right">1</td>
</tr>
<tr>
<td align="left">L.Moore</td>
<td align="right">1</td>
</tr>
<tr>
<td align="left">R.Shepard</td>
<td align="right">0</td>
</tr>
<tr>
<td align="left">R.Shepard</td>
<td align="right">0</td>
</tr>
<tr>
<td align="left">R.Woods</td>
<td align="right">1</td>
</tr>
<tr>
<td align="left">R.Woods</td>
<td align="right">1</td>
</tr>
<tr>
<td align="left">S.Smith</td>
<td align="right">1</td>
</tr>
<tr>
<td align="left">V.Jackson</td>
<td align="right">2</td>
</tr>
</tbody>
</table>
<p>And that's k-means clustering, in a nutshell.</p>Using Continuous-Time Markov Chains to Rank College Football Teams2015-09-23T18:13:26-07:00Matt Millstag:thespread.us,2015-09-23:continuous-markov-ratings.html<p>I normally don't write about college football on this blog, but I do write about ranking algorithms. In that tradition, I'm extremely excited to have <a href="http://www.sbnation.com/users/millsGT49">Matt Mills</a> contribute a new post on using continuous-time Markov chains to rank college football teams. Matt blogs for SB Nation about college football, statistics, and Georgia Tech. </p>
<hr />
<p>College Football has tried for years to find the best way to rank teams at the end of the year. From the AP Poll, to the BCS Rankings, and now with the College Football Playoff Committee everyone has tried their hat at ranking college football teams. Currently there are 60 college football rating systems listed in the <a href="http://www.masseyratings.com/cf/compare.htm">Massey Ratings Composite</a> and over 120 ratings tracked in the <a href="http://sports.vaporia.com/fb-fwin.html">Sports Vaporia Comparison</a>. However, that isn’t going to stop me from adding my own to the list. In this post I’ll describe how a continuous-time Markov chain (CTMC) is structured, how we can apply the mathematics of CTMCs to develop a new rating system, and how the CTMC ratings compare to the better-known <a href="http://thespread.us/ranking-algorithms-and-the-nfl-part-1-of-a-series.html">Massey Ratings</a>. I’ve provided all the code used to perform this analysis in R at <a href="https://github.com/mattmills49/CTMC_Rating">my GitHub page</a>. </p>
<p>Continuous-time Markov chains (CTMCs) are mathematical models that are used across many areas of business and academia. If you learned about them in school, you probably used them to help with queuing theory. If you didn’t learn about them in school, don’t fret. The basic structure is easy to visualize. CTMCs consist of a set of states that a system can be in at any given time. For example, the set of states, or state space, of the system can consist of the weather (e.g. raining or not raining), or the number of people in a line (0, 1, 2, etc.). There are also transition rates that tell you how quickly the system can move from any one state to another. These transition rates can be probabilities, in which case the Markov chain would be modeled as a discrete-time Markov chain. In addition to the basic structure of CTMCs, there are more technical requirements for a system to be a true CTMC. The system has to be irreducible, which basically means that you have to be able to, eventually, get from one state to any other in the system in some finite amount of time. What makes the CTMC a Markov Chain is that any future behavior of the model only depends on the current state of the model and is independent of how long the system has been running. This last property is called the Markov Property. There are many great descriptions of Markov Chains online so if you wish to learn any more about these systems I would advise a quick Google search. The point of all these conditions and definitions is that if we can model our system as a CTMC, we can use linear algebra to solve for a steady-state distribution. The steady-state distribution of a CTMC tells you the long-run average fraction of time the system will spend in each state. The goal of this analysis is to model a CTMC with each team in college football representing a state and then determine the quality of each team by computing the steady-state distributions for each state. </p>
<p>Now we need to apply the properties of continuous-time Markov chains to college football. The first step is to define our state space, which is the list of all possible states for the system. In our case this is just the list of all teams in college football that played against at least 2 FBS opponents. I chose this condition so that teams that only played one game were left out of the analysis to avoid weird results. Next we need to define the transition rates between the states. Normally the rates would be exponentially distributed random variables, and in queuing theory they would represent arrival rates or service times. In this example, we will be using the number of points scored by one team against another as the rate of going from one state to the next. I don’t think this should cause any issues, but I admit I may be wrong, so feel free to discuss this with me in the comments. If we model our CTMC like this we will get a transition diagram that looks like this:</p>
<p><center><img src='images/Transition_Diagram.png' width="600px"></center></p>
<p>In this example each node represents a state and each arrow represents the transition rate of going from one state to the next. For example, Georgia Tech scored 28 points against Miami, while Miami scored only 17 against Georgia Tech, so the rate at which a person in state “Georgia Tech” would go to state “Miami” is higher than going from Miami to Georgia Tech. The steady-state distribution of a CTMC can be thought of as the amount of time someone would spend in a state if they were walking along this graph structure. If they were in the state “Georgia Tech” they would be more likely to move to North Carolina than either Duke or Miami, because Georgia Tech scored more points against UNC than against Duke or Miami. This means the rate at which someone leaves Georgia Tech to go to UNC will be higher than that of going to Duke or Miami. The person won’t always go to UNC but will do so more often than going to any other state. Therefore, the system will quickly leave the state of any team that scores a lot of points, because the transition rates will be higher than the average. A team that doesn’t allow many points will have few transitions into their state because the rate at which teams scored against them was low. Because of this behavior the system will spend less time in good teams’ states and more time in bad teams’. The steady-state distributions are now a way to rate college football teams. </p>
<p>Now we have to go about actually generating and solving the CTMC. To do this, I used R’s XML package to scrape the scores from all games in the last four seasons, including this year’s, from www.sports-reference.com/cfb. In order to make the scores location-neutral, I adjusted them by subtracting 1.5 points from the home team and adding 1.5 points to the away team. I also removed any games where both teams didn’t play at least two games against FBS foes in the same season. This was as close as I could get to including only FBS-FBS games because Sports-Reference.com didn’t list the division or conference of each team in their schedule page. Once I had this full list of games, I used the tidy and dplyr packages in R to generate the transition rate diagram for all games in a season. To actually solve for the steady-state distributions of our system, we need to transform the transition rate diagram into the proper format. Here are the actual equations we need to solve to find the steady-state distributions:</p>
<p><center><img src='images/ctmcequation.png' width=600></center></p>
<p>What those equations actually mean is that the rate at which you leave a state, weighted by how often you are in that state, has to equal the rate at which you come into that state, weighted by how often you are in the state from which you are coming. Basically, rate in equals rate out. We must force all the probabilities to sum to one and replace one equation from any state with this equation:</p>
<p><center><img src='images/probequations.png' width=200></center></p>
<p>We can now solve the equations by putting everything on one side of the equals sign and using linear algebra to solve the system of equations. I used R’s solve() function to generate the long-run average fraction of time the system spends in each state. As I said earlier, we can simply equate this result to the strength of a team to get a rating for every team in college football. However, this result isn’t very intuitive; the better teams have smaller numbers. I changed the format of the transition rate diagram so that the system will spend more time in the better teams’ states, instead of less. The correlation between the natural log of the different perspectives of ratings is -.99. The ratings seem to fit a power law distribution, so the log scale makes them linear. With this change, here are the top ten for this season:</p>
<p><center><img src='images/2014CTMCratings.png' width=500></center></p>
<p>A continuous-time Markov chain is a good rating system for college football because of its ability to take a team’s schedule into account. The system naturally spends more time in stronger team’s states. Teams that play these strong teams build connections between the two states by scoring points. So by building a bridge between a your team’s state and a strong team’s state the system will spend more time in your state, even if you got beat by the strong team. This builds up the ratings of teams with tougher schedules. Even if Georgia Tech, for example, gets blown out, there is some probability that the system will transition from another really good team’s state to Georgia Tech’s, thereby boosting Georgia Tech’s rating because it played a tougher schedule. If a team’s opponents aren’t very good, the system won’t spend much time in its state because it won’t spend much time in the states of its opponents. I think this is a good explanation for why Arkansas is so high in these ratings: they have played an incredibly tough schedule. This rating system is not perfect, but I’m okay with this top ten. Six of the teams in the CTMC top ten are in the top ten of the Massey Composite Ratings. </p>
<p>Now that we have a way to rank teams, we also need to check how accurate the ratings are, beyond an eye test. I’ll do this by testing the predictive capability of the CTMC ratings and measure it against the Massey Ratings performance. The Massey Ratings also use linear algebra and the number of points scored by each team in each game to develop a rating for teams that represents the number of points per game above or below an average team. Let’s compare how the CTMC ranking of each team this season compares to the Massey Rating of each team. I’ve highlighted the playoff teams (Alabama, Oregon, Ohio State, and FSU) by their respective team colors. </p>
<p><center><img src='images/seasoncomp.png' width=900></center></p>
<p>The two ratings are very similar, but the CTMC rating seems to have a non-linear scale. The correlation between the Massey Ratings and the natural logarithm of the CTMC ratings is .97, so they are very similar. </p>
<p>The real test between the ratings will be on how they accurately predict games that haven’t happened yet. To test this, I calculated the ratings of each week after week four of the last four seasons of college football. The ratings after each week were used to predict a winner of the games in the upcoming week. The predicted winner was chosen by which team’s rating was higher, not accounting for home field. I’m not trying to make the best prediction but want to see how they compare on a level playing field. The results of the weekly prediction can be seen in the graph below. The dashed horizontal lines are the yearly accuracy rate for each method. </p>
<p><center><img src='images/acccomp.png' width=900></center></p>
<p>In two of the years, the Massey method was the more accurate predictive rating system, and in the other two years there was virtually no difference. The methods fluctuated in their weekly accuracy, but neither consistently outshone the other. The prediction accuracy percentage hovers in the high 60s and low 70s, which is in line with other rating methods on the Sports Vaporia Predicted Wins Summary. </p>
<p>If the CTMC rating has a .97 correlation with the Massey rating then why use it? Well, besides the fact that I think using continuous-time Markov chains to rank college football teams is pretty cool, the CTMC rating has flexibility for future metrics beyond points scored or points allowed. If you think yards per play is a better metric than points scored, you can easily switch the transition rates from points scored to yards per play gained. If you just wanted to measure a team’s passing strength you could use yards per pass only. There are some limitations to the CTMC rating system. I haven’t thought of a way to separate the ratings into offense and defense components like the Massey ratings. There also isn’t a clear link between the ratings and margin of victory like with the Massey rating. And lastly if a team gets shut out then the transition rate for that game would be zero, which may not be the best way to take that into account. And in the spirit of keeping this post from becoming a novel I’ll save that for another time. </p>Counterintuitive findings are not (necessarily) better findings2015-09-23T18:13:26-07:00Trey Causeytag:thespread.us,2015-09-23:counterintuitive-findings.html<p>It's a common scenario -- you've got some methods under your belt, you've got some data, and you're out to prove to the rest of the world just how <em>wrong</em> they are about everything. You massage the data, you build dozens of models, and you finally find a way to prove that Andy Dalton really <em>is</em> better than Tom Brady. You take to Twitter and point out that <em>actually</em> any idiot can see this using your new metric, Boosted Estimated Net Guard-Adjusted Losses (BENGAL). </p>
<p><a href="https://lh4.googleusercontent.com/_fw7iF68JR8k/TX-l6KVkcRI/AAAAAAABn2o/Us27jux7S0M/LeeCorsoSombrero1.jpg">Not so fast</a>. You've fallen for the trap of the counterintuitive finding. Don't feel bad, it's an alluring trap, and one that people much smarter and more accomplished than you and I fall into regularly. In fact, there's an entire subfield of <a href="http://freakonomics.com/">economics</a> based around it. Unfortunately, what this has also led to in a lot of academic research, is a desire to produce clever, counterintuitive findings that grab headlines and convince you that "everything you know about ___ is wrong." Taking down a sacred cow is an express ticket to attention. The reality, however, is that many of these findings are statistical artifacts, false positives, and are not reproducible. </p>
<p>How does this relate to sports? One of the fundamental principles of the original Moneyball movement was to identify market inefficiencies and exploit them. Obviously, in a highly competitive and mostly efficient market, finding these small inefficiencies are hugely valuable. But, they're just not that common and the chances that you've just stumbled upon one is pretty low (but not zero). Let's be clear, I'm <strong>not</strong> arguing that we know everything there is to know (see my recent post on <a href="http://thespread.us/areas-of-research.html">where I see football research headed</a>) or that Phil Simms has fourth-down logic nailed, or that team-employed analysts have cornered the market and identified all the inefficiencies -- I'd argue the opposite, that there's still quite a bit of immature research being conducted, even on teams. </p>
<p>What I <em>am</em> arguing is that you need to check your findings for robustness -- do small changes in how you construct your new metric (BENGAL) significantly change the results? Does omitting a small number of cases change your conclusions? Is it stable season-over-season? Or is it essentially random? Is your new measure or model the only one that reaches the conclusions that you've reached? These are the questions you need to ask before you decide you've found something new that upsets our conventional wisdom. Building a bunch of models and testing hypotheses over and over on the same data set will <em>always</em> produce false positives and often quicker than you might think. </p>
<p>We're often interested in measuring unobservable things like "talent" or "skill", but all we have are measures of outcomes like "touchdowns" and "rushing yards." Error is introduced when we move between the measure and the latent thing we're interested in. Sometimes that error is exacerbated by the way we build models or measures and we confuse it for signal when, in fact, it's just noise.</p>
<p>Does your research simply confirm what we thought we already knew about something? That's OK! As sociologist Duncan Watts <a href="http://thespread.us/areas-of-research.html">says</a>, everything is obvious -- once you know the answer. Conventional wisdom is sometimes the conventional wisdom because it's <em>correct</em>, but we wouldn't know it without repeatedly testing it with data. There is great value to replicating findings, and confirming that an age-old saw still holds with modern data. </p>Building more accurate predictive models with cross-validation2015-09-23T18:13:26-07:00Trey Causeytag:thespread.us,2015-09-23:cross-validation.html<p>One of the points I've made over and over is about making sure your model performs well out-of-
sample. I've argued against overfitting and for holding out a portion of your data to test how well it will do at predicting the future. This is all well and good, but what if you don't have a lot of data? What if you want to use <em>all</em> of your data? Holding some data back to test the model is a luxury that we often have in machine learning because we're dealing with big data sets. Football? Not so much. </p>
<p>Luckily, there's an answer that allows us to use all of our data, escntimate our model's predictive accuracy, and not overfit to our training data! It's not magic, it's called <strong>cross-validation</strong>. In this post, I'll walk through what cross-validation is and the logic of why it works.</p>
<p>Cross-validation (which I'll abbreviate as CV for parts of this post) starts with the same intution as using a test set to estimate how well your model will perform when it sees new data that wasn't used when you built the model. Technically what we're trying to do is balance the amount of <em>bias</em> in our model -- that's how well the model performs on the training data -- with the amount of <em>variance</em> in our model -- that's how well the model does on new data. </p>
<p>There are a few different ways to do this, but the most common way is called <em>k-fold cross-validation</em>. It's very simple and follows these simple steps:</p>
<ul>
<li>Randomly split your entire dataset into <em>k</em> "folds". Choose a reasonable number here, because you're going to be building your model <em>k</em> times. A common choice is 5.</li>
<li>For each of the <em>k</em> folds of your dataset, build your model on <em>k - 1</em> folds of the dataset and then make predictions for the <em>k</em>th fold.</li>
<li>Record the amount of error you see on each of the predictions -- common error metrics include mean squared error. </li>
<li>Repeat this until each of the <em>k</em> folds has served as the test set (the set of data you make predictions for)</li>
<li>The average of your <em>k</em> recorded errors is called the <em>cross-validation error</em> and will serve as your performance metric for the model.</li>
</ul>
<p>If you're deciding between several models -- maybe you want to include seconds remaining squared, or you are trying to decide if you want to include an interaction between seconds and yards to go, or whtaever -- the model with the lowest cross-validation error is the model that will most likely perform the best on new data. It is common after selecting a final model to use through cross-validation to then refit the model on the entire dataset.</p>
<p>The really nice thing about all of this is we've used all of our data, but we didn't overfit to it because we never used all of the data at once to make decisions. </p>
<p>Of course, there are some <strong>caveats</strong> and some warning signs to watch out for. Namely, the more folds you split your dataset into, the more variance you're going to see in the prediction errors (because the sample size for the predictions will be smaller). In fact, you can take this to the extreme and split your dataset into <em>N</em> folds (where <em>N</em> is your entire data set's size), making predictions on a single observation each time. This is called <em>leave-one-out cross-validation</em>, which has the lovely initialism <em>LOOCV</em>. </p>
<p>You should also keep an eye on how much variance there is in your predictions for each fold -- if they're bouncing around all over the place, you might have one of a few problems: you might have some "influential" cases (some people call these outliers) or you might have a bad model (sometimes there's just noise, and no signal).</p>
<p>Cross-validation is really popular in both statistics and machine learning these days -- so much so that the statistics section of the popular question and answer site <a href="http://stackexchange.net">StackExchange</a> is called <a href="http://stats.stackexchange.net">CrossValidated</a>. It works well with lots of different kinds of models. Need to choose some parameters for a random forest? A lambda parameter for a regularized regression? You can use cross-validation to help. </p>
<p>Now you have no excuse to ignore out-of-sample performance, even with small samples.</p>Expected Points Part 1: Building a Model and Estimating Uncertainty2015-09-23T18:13:26-07:00Trey Causeytag:thespread.us,2015-09-23:expected-points.html<p>Super Bowl Sunday is finally here, and discussion of {Ballghazi, Deflategate} has dominated much of the sports analytics world for the past two weeks. So, I thought I'd totally ignore that topic and talk about something else: building an expected points (EP) model. Expected points did get some renewed attention lately, in this <a href="http://fivethirtyeight.com/features/kickers-are-forever/">FiveThirtyEight article</a> on the rise of kicking accuracy over time and how fourth-down decision-making could be affected.</p>
<p>However, lots of expected points models already exist, so my goals are to accomplish the following:</p>
<ol>
<li>Provide code examples for building an expected points model.</li>
<li>Interrogate the assumptions that go into such a model.</li>
<li>Show how to incorporate uncertainty into the model using the bootstrap (<a href="http://thespread.us/building-a-win-probability-model-part-5-feature-engineering-and-model-evaluation.html">previous discussion</a>)</li>
</ol>
<p>If you're not familiar with expected points, I encourage you to read <a href="http://archive.advancedfootballanalytics.com/2010/01/expected-points-ep-and-expected-points.html">excellent descriptions</a> from Brian Burke at Advanced Football Analytics for a more in-depth overview. In the spirit of educating sports analytics newcomers, Brian has also created two YouTube tutorials (<a href="https://www.youtube.com/watch?v=JclgcQgPOcE">1</a>, <a href="https://www.youtube.com/watch?v=IDLCulWNGyk">2</a>) on building an expected points model. This is seriously a great service to the community. These tutorials and Brian's explanation were instrumental in writing this post. </p>
<p>Here's the basic idea behind expected points. Given any combination of down, yards to go, and distance from the end zone, the expected value of the points from that position are equal to the average of every <em>next score</em> from that position. That next score could be on that play via a field goal or touchdown; it could be several to many plays later through a successful drive. It could also be negative -- the next points are scored by the other team. </p>
<p>So, you can imagine that the expected points from one's own one-yard line are probably negative, because even if you punt the ball away, your opponent will probably have very good field position to start their next drive and will likely get at least a field goal out of that possession. </p>
<p>Similarly, you can imagine that the expected points on 1st and goal from your opponent's one-yard line are somewhere between 3 and 7 because you'll have nearly four tries (barring fumbles and interceptions) to score a touchdown or kick a field goal. </p>
<p>You get the idea. The reason we build these kinds of models are to place a value on every position on the field to allow for in-game decision-making. By being able to compare the expected points from a variety of possible outcomes, we can choose the play call that allows for maximizing the number of expected points. There may be game scenarios when you're more interested in maximizing expected points (for instance, early in the game when an individual play may not have much impact on overall win probability). </p>
<h2>Building the model</h2>
<p>Building the model itself is just a bit of Python, made easier by the indexing and grouping capabilities of <a href="http://pandas.pydata.org">pandas</a>. It's just data manipulation and the only statistical procedure involved is taking the mean. You can find all of the code in an IPython notebook on <a href="http://github.com/treycausey/thespread/tree/master/notebooks/expected_points.ipynb">Github</a> (<a href="http://nbviewer.ipython.org/github/treycausey/thespread/blob/master/notebooks/expected_points.ipynb">NBViewer</a>).</p>
<h2>Exploring the assumptions</h2>
<p>This is where the fun starts. There are a number of assumptions that go into building this kind of model. For a start, Burke recommends throwing out plays where the score difference is greater than 10 and from the 2nd and 4th quarters. The reasoning behind this is that teams operate differently when facing or delivering a blowout or when the half is about to end. For instance, a winning team may just run their RB into the wall repeatedly towards the end of the game, not really trying to gain yards or score more points. This could distort the effects of these plays on points scored. Seems like a logical assumption.</p>
<p>However, I'm always a fan of presenting how assumptions change analyses, so I'll present it both ways. This is one way of measuring the effects of your assumptions, but it's also a good way to see how <em>robust</em> your conclusions are to changes in the data. Let's take a look at expected points as a function of field position on first down with and without these plays removed.</p>
<p><center><img src='images/first_downs.png', width=100%></center></p>
<p>Surprisingly, not much of a difference! Looks like the trimmed data produces slightly higher estimates of expected points than the complete data in the opponent's half of the field. But <em>how much</em> of a differnece is "not much"? Great question.</p>
<p>Burke uses a smoother, a kind of local regression known as <a href="http://en.wikipedia.org/wiki/Local_regression">LOESS</a>. This is definitely one approach to smoothing out those bumps and getting a better sense of the 'true' expected points contained in those noise lines. I'm going to take a slightly different approach and use a statistical technique known as the <a href="http://stats.stackexchange.com/questions/26088/explaining-to-laypeople-why-bootstrapping-works"><em>bootstrap</em></a> to build confidence intervals around those expected point values. Why do this?</p>
<p>The expected points we've plotted above only represent the plays we've actually seen happen. But they are just an estimate. We want to make some inferences about the range of possible outcomes we didn't see. We assume that the plays we saw are drawn from some distribution of outcomes from alternate universes or whatever. We can simulate what this distribution looks like by taking repeated samples with replacement from the plays we actually saw. This procedure has some nice properties that I won't get into in too much depth, but one of the nicest things is that it doesn't assume anything about the distribution of the statistic we're interested in.</p>
<p>The confidence interval that we build up here will give us some idea of how much variation we might expect in our estimator (expected points) if we were to keep sampling from the distribution that generated the observations we already have. Let's take a look at the 95% confidence interval for the original expected points.</p>
<p><center><img src='images/first_downs_ci.png', width=100%></center></p>
<p>As it turns out, the expected points estimated by using only 1st & 3rd quarters and close games falls outside of our confidence interval quite often in the opponent's half of the field! This is very interesting. Note also that the uncertainty around expected points is at its greated the closer you get to your own end zone, and the least the closer you get to your opponent's endzone. This is intuitive, but it's always good to know if your estimator has constant variance or not. </p>
<h2>Next up: how has this changed over time and what does that mean?</h2>
<p>This post is already growing too long, so I'll split it into two posts. Next up, I'll look at FiveThirtyEight's comments that kicking accuracy has changed expected points over time. I'll also discuss the pros and cons of inventing one's own measure (such as expected points).</p>Expected Points Part 2: Why Does Uncertainty Matter?2015-09-23T18:13:26-07:00Trey Causeytag:thespread.us,2015-09-23:expected-points-2.html<p>In the <a href="http://thespread.us/expected-points.html">last post</a>, we built a basic expected points model and showed how we can estimate uncertainty using a statisical procedure called the <em>bootstrap</em>. Now I want to push our assumptions a little further and look at how expected points have changed over time and I want to talk about why we want to estimate uncertainty in the first place. </p>
<p>FiveThirtyEight recently published a very interesting <a href="http://fivethirtyeight.com/features/kickers-are-forever/">article</a> demonstrating that kickers have continually improved in accuracy over time, and that this is likely not taken into account in expected points models that are used in many fourth-down decision arguments. My initial reaction is that this probably is an overreaction -- the more sophisticated fourth-down models out there often have a more rigorous kicking input than just historical averages that don't adjust over time. My second reaction is to take a look and see how expected points have changed over time. </p>
<p>Let's take a look at how first downs (across all yards to go) have changed over the years.</p>
<p><img alt="'test'" src="images/first_downs_over_time.png" /></p>
<p>Well, now we see why people use smoothers for these kinds of things. That's a noisy mess. Let's take another look, using the same kind of <a href="http://en.wikipedia.org/wiki/Local_regression">smoother</a> that Burke uses in his expected points model. To help look at how expected points are changing over time, I've set the blue to get darker as the data becomes more recent (i.e., the darkest year is the 2013 season). The overall average expected points is the line in red.</p>
<p><img src='images/first_downs_loess_years.png'></p>
<p>It definitely appears that expected points have risen over time, at least for first downs. But the whole point is to look at uncertainty in these estimates, so let's bootstrap a confidence interval and see how this changes our perception. This adds a new dimension of complexity, though, so let's take a specific game situation. For illustrative purposes, I'll show first and ten from the opponent's 35-yard line. </p>
<p><img src='images/single_season_ep.png'></p>
<p>It definitely looks like expected points have risen <em>some</em>, but the expected points for a first down in 2013 are still within the 95% confidence interval for 2000. That doesn't seem to mesh with the earlier statements at all! </p>
<p>There's a catch. Each of the points in that plot is only using a single season's worth of data. This is an important fact to learn about confidence intervals! As your sample size <em>increases</em>, your confidence interval gets more <em>narrow</em>. In other words, we can more precisely estimate the statistic we're interested in as we have more observations. This is statistics 101 stuff, but it's easy to forget.</p>
<p>To better compare, I computed a four-year rolling average for expected points and looked at how the same game situation, first and ten at the opponent's 35-yard line, has changed in expected points over time, and bootstrapped a 95% confidence interval. This allows us to observe if the value is changing over time, gives us a better sample size for estimating uncertainty, and doesn't let earlier seasons' data affect later seasons' data.</p>
<p><img src='images/ep_moving_average.png'></p>
<p>That definitely looks like an increase, and it certainly appears that expected points are on the rise! Let's compare the expected points from 2004-2007 with 2010-2013.</p>
<p><img src='images/ep_hist.png'></p>
<p>Looking at the distribution of the expected points from the same point on the field, there's very little overlap between the two (the purple area reflects where the densities overlap).</p>
<h2>What's the point?</h2>
<p>OK, so we've looked at expected points from a variety of angles and have found that they do, indeed, appear to be rising over time. We've also found that this has appeared to be the case more for the opponent's half of the field than one's own half of the field. Potential explanations for this include:</p>
<ul>
<li>More accurate kickers</li>
<li>Play-calling has gotten more aggressive closer to the opponent's end zone</li>
<li>As passing has risen, so has scoring</li>
</ul>
<p>The answer isn't immediately clear from this analysis. One thing we <em>do</em> know, however, is that by estimating the uncertainty associated with the expected points statistic, we're in a much better position to say if that change is meaningful or not. </p>
<h2>Why care about uncertainty at all?</h2>
<p>Why have I spent so much time banging on about uncertainty? Because we're often making arguments about which play calls are the better play calls based on <em>differences</em> in expected points (expected points added) from the plausible range of outcomes. For instance, going for it on fourth down vs. kicking a field goal, vs. punting. If we don't know how variable the statistic is, we're not really doing better than random guessing.</p>
<p>Take the above example. Using data from 2010-2013, the 95% confidence interval for expected points for a 1st and 10 at the opponent's 35 ranges from 3.0 to 3.68. We think that plausible values for the 'true' expected points from that scenario lie in that range, based on the data we've collected. Say the 'true' value is closer to 3.0, say 3.15 and we make a decision that is supposed to net us a half a point in expected points. It's entirely possible that we haven't really made any positive gains at all! </p>
<p>Simply put, without stating uncertainty, it's hard to know when we're making progress or losing ground.</p>
<h2>What's next?</h2>
<p>We haven't really discussed whether or not expected points is a 'good' statistic or not. I think it's an entirely reasonable statistic and an entirely reasonable approach to a difficult problem. However, it's worth noting that there are some problems with it. For instance, the number of plays and possessions between scores is highly variable. Is it possible that the expected points on a drive that ultimately results in a touchdown are the same as those from the same field position but where the 'next score' comes after four changes of possession? It's hard to say.</p>
<p>An alternative exists, but it's more methodologically complicated. David Romer, an economist, wrote a <a href="http://eml.berkeley.edu/~dromer/papers/nber9024.pdf">famous paper</a> [PDF] on fourth downs using a method called <em>dynamic programming.</em> This paper has grown a little long in the tooth, so perhaps it is time to revisit it with modern data! A project for another day.</p>A new season, a new (lean) design2015-09-23T18:13:26-07:00Trey Causeytag:thespread.us,2015-09-23:new-design.html<p>Things look a little different. Sorry for the long absence -- I'm in the midst of changing jobs and things have been quite hectic. With that in mind, I've made some changes to make it easier for me to get my thoughts out of my head and into posts as quickly as possible. I've exported the site from Wordpress and am now using <a href="http://pelican.readthedocs.org/en/3.4.0/">Pelican</a>, a Python-based static site generator. Now I can write all of my posts in Markdown and quickly sync them to my host. Hopefully this will lower the barrier to posting and make iteration quicker.</p>
<p>Most of the images and plots made the migration without any problem, however it looks like I've lost most of the tables and comments (and maybe RSS). I'll work on migrating them over, but I didn't want to waste any more time. It's good to be back and looking forward to talking football and data in the coming days. </p>
<p>I've changed the tagline as well, with the hopes that I'll write about some non-football sports in the coming months.</p>
<p>Until I get comments sorted out, please feel free to let me know what you think on <a href="http://twitter.com/treycausey">Twitter</a>.</p>Want to work in sports?2014-08-04T20:03:00-07:00treycauseytag:thespread.us,2014-08-04:want-to-work-in-sports.html<p>Get in line.</p>
<p>People interested in sports analytics, particularly those wanting to
become professionals, often wonder why teams don't invest more heavily
in analytics. Teams, they argue, routinely throw good money after bad
into garbage contracts and make unnecessary concessions of a million
here and seven hundred thousand there. What gives?</p>
<!--more-->
<p>This post was inspired by
a <a href="https://twitter.com/AustinClemens2/status/496391013716725761">tweet</a>
from <a href="http://austinclemens.com/blog/">Austin Clemens</a>, the creator of
absolutely fantastic <a href="http://nyloncalculus.com/shotchart/">shot charts at Nylon
Calculus</a>, which then sparked a
lively NBA-centric discussion. Smart people with training in statistics,
machine learning, and operations research, are willing to work for a
fraction of what they could earn outside of sports. To top it off,
people working in sports are routinely expected to work <em>insane</em> hours
(sleeping at the office, working past midnight, etc.). If analytics is
so valuable to these organizations, why aren't they paying market value?
I'm going to argue that this argument is only partially true, but the
true parts have some logical underpinnings. I only have experience with
NFL teams, but I assume a lot of my thoughts are sport-independent. I'll
offer them in sections.</p>
<p><em>Supply. </em>The truth is that there is a glut of qualified people who want
to work in sports. Depending on the sport you want to work in, there are
usually only 30-32 teams at the top level. Most organizations only
employ one or two analytics people. You do the math. An oversupply of
overly qualified, highly interested applicants will drive salaries down,
no matter what the industry. Employers aren't going to pay (much) over
market value -- and their analytics people wouldn't be any good if they
advised them to. One only has to look at the reaction to Andy Dalton's
contract announcement today to see how much people have internalized the
idea of market value.</p>
<p>You're giving up salary and job security to work for a team -- let's
face it, most teams clean house every few years, and you're just one
anti-numbers GM away from job hunting. On the other hand, you're getting
to eat, sleep (sometimes), and breathe the game you love. That doesn't
make it "right" -- many professions justify paying low wages because the
profession is supposed to be a 'calling.' One only has to look at
teachers' wages to see this. However, you are getting paid in many other
ways. You have the ear (hopefully) of a coach or a GM. You get to see
games from the booth or the sideline or the first row. You probably get
to eat the majority of your meals for free, often with coaches and
players. You probably don't pay for a lot of athletic gear or gym fees.
There are perks.</p>
<p>It's not just analytics that faces this problem -- look at the entry
level positions in coaching. An NFL quality control coach doesn't make
much, and is expected to put in longer hours than many of the other
assistant coaches. Of course, quality control coaches have the hope of
climbing the career ladder. This path isn't so clear in analytics -- we
can't all be Daryl Morey.</p>
<p><em>Assessment. </em>It's extremely difficult for teams with little advanced
analytics capacity to assess the quality of applicants. I spoke about
this some previously in regards to the Sacramento Kings' contest to find
draft analytics experts. An organization that has no advanced analytics
capacity faces a cold start problem. How do they know who's good and
who's not? The most likely outcome here is that an organization hires
someone who is slightly more advanced than the most advanced person
currently on staff. In the absence of some objective arbiter of talent,
teams won't know if an applicant's work is mindblowingly complex and
good -- or just garbage.</p>
<p>In these kinds of situations, signaling and credentials end up mattering
a lot. This is why you see so many Ivy League grads and MBAs on analytic
staffs. In the absence of information, you go with what seems to be an
objective measure of quality. It's a rational, if potentially
suboptimal, strategy. Unfortunately, most undergraduate and MBA programs
don't actually prepare their students to think really hard about
probability and statistics, so you get a lot of ad hoc analysis in
Excel.</p>
<p><em>Results. </em>The link between analytics and results is so tenuous, it's
extremely difficult to measure, and it takes time. Even if you could
somehow hold all else equal, most analytics exercises would produce
small marginal gains (positive expected value). Sports are still noisy
and stochastic affairs, and you're not going to be holding all else
equal. It's just so difficult to demonstrate to a VP or GM the value
added of paying someone six figures. Every hundred thousand dollars they
pay their analytics staff is that much less wiggle room they have at the
negotiating table with the players.</p>
<p>Then there's the fundamental problem of causal inference. My coworkers
will know how much I love to bang on about this one. The problem in many
cases is that history only runs itself once. You can't know what would
have happened if you had done something differently. This is a very,
very hard concept for people to understand (even highly trained
statisticians forget it sometimes). Combine this with loss aversion,
hindsight bias, and a host of other cognitive heuristics, and it's
pretty easy to write analytics out of the picture.</p>
<p>This is a little easier to overcome in baseball than in other sports,
and I suspect that's one reason there's been a 'revolution' there. In a
sixteen-game season, how do you demonstrate that you added a third of a
win? It's tough without buy-in from (high in) the front office.</p>
<p><em>Difficulty.</em> Honestly, and it pains me to say this, a lot of what teams
want from analytics just isn't that difficult. Just because you know all
about convolutional neural networks or support vector machines doesn't
mean those skills are necessarily going to be useful in sports. Many
decisions come down to historical averages, expected value, and
breakeven points. You don't need a PhD to figure those out, just a
foundation in probability. That doesn't mean we can't improve, but just
saying the skills don't dictate the salary -- the outputs do.</p>
<p><em>Variety. </em>Related to the difficulty point, analytics means a lot of
things to a lot of different people. There are people out there, who
work for professional teams, that are doing absolutely garbage work.
Other teams see the things that these professionals produce, and think
"That must be analytics. Not worth much." It probably doesn't help that
teams with losing records are more likely to invest in analytics to gain
an edge, thus producing a real but spurious correlation between
analytics talent and losing. This is why having a portfolio of work is
as important as your ability to talk to people about what it means and
what they should do differently because of it.</p>
<p><em>Canard.</em> All of this said, I want to push back a little bit against the
crux of the argument. Some teams <em>are</em> investing in analytics -- and
they're paying near-market rates. It's not every team, and the teams
doing so aren't advertising, but they're out there. We're in the infancy
of the analytics movement in many sports. Give it time.</p>Crowdsourced Team Strength Rankings and Ratings2014-07-31T08:00:00-07:00treycauseytag:thespread.us,2014-07-31:crowdsourced-team-rankings-and-ratings.html<p>Vote early and often!</p>
<p><a href="http://dartthrowingchimp.wordpress.com/">Jay Ulfelder</a>, who has
<a href="http://thespread.us/?p=210" title="Guest post: How’d Those Football Forecasts Turn Out?">appeared here
before</a>,
is back with his annual crowdsourced football rankings. Go vote to your
heart's content and help us build ratings and rankings for the upcoming
season. The poll uses <a href="http://www.allourideas.org/">All Our Ideas</a>, a
platform that allows anonymous users to make pairwise decisions about
any collection of things.</p>
<p>The really nice thing about this format is that it provides head-to-head
matchups for every team -- even teams that won't play each other this
year -- and provides many, many matchups (depending on how many people
vote and how often). This aligns nicely with many of the rankings and
ratings models we've been covering here. We'll turn voting off before
the season starts, and post the results (as well as how they change
during training camp and the preseason). Thanks for your help! <a href="http://www.allourideas.org/nflstrength2014">Now go
vote!</a></p>Elo ratings (part 4)2014-07-05T18:07:00-07:00treycauseytag:thespread.us,2014-07-05:elo-ratings-part-4.html<p>From chess to football</p>
<p>We've almost arrived at the end of the <a href="http://www.amazon.com/Whos-The-Science-Rating-Ranking/dp/0691154228">ratings and
rankings</a>
tutorials. I'll do one more post on Markov ratings, then a couple of
posts on ensemble ratings, and then it'll almost be time for
season. This week I'll be talking about <a href="http://en.wikipedia.org/wiki/Elo_rating_system">Elo
ratings</a>. Originally
used to rate and rank chess players, Elo ratings are now used in a
number of sports, including by <a href="http://usatoday30.usatoday.com/sports/sagarin.htm">Jeff
Sagarin</a> for USA
Today. They're a very simple and elegant way to create ratings</p>
<!--more-->
<p>Elo ratings are based around the basic idea that teams have a basic
level of quality, <em>mu</em>, around which their observable performance varies
randomly. The only way that this measure can change is if the team
consistently performs well above or below its expected level. If a team
plays a team with roughly the same skill level, its rating won't change
much, regardless of the result. Similarly, if a heavily favored team
wins as expected, it shouldn't lead to a big increase in the perceived
quality of that team.</p>
<p>On the other hand, if the underdog pulls off the upset, it should be
rewarded more substantially. The idea is that we have this unknown
variable mu and we're continually calibrating our estimates of it based
on the performance of each team. Elo is all about strength of schedule
in this way.</p>
<p>We can make predictions about future outcomes using the mu values, where
mu~ij~ is the expected number of points that team i will score when it
plays team j. Or, since we've been using <a href="http://thespread.us/?p=279" title="Points are great, but what about win percentage? (Ranking, part 2)">point proportions with a
smoother</a>,
this value will be the expected proportion of points that team i will
score in that game. If mu~ij~ = 0.5, we predict a tie, if mu~ij~ is
greater than 0.5, we predict a team for team i, and if mu~ij~ is less
than 0.5, we predict a loss for team j.</p>
<p>We start off by setting each team's rating before any games are played
to zero. By doing this, we have a "memory-less" rating system, i.e., one
that does not take into account any past events and considers all teams
equally skillful. Obviously we could change this by including preseason
odds of some kind. One nice side effect of taking the zero approach,
though, is that all of the ratings will sum to zero and the mean of the
ratings will always be zero. This means teams with a positive rating are
above average and teams with a negative rating are below average.</p>
<p>Elo ratings are then calculated on a week by week basis as such:</p>
<p><em>r~new~ = r~old~ + K(S - mu)</em></p>
<p>where r~new~ is the new rating for a team, r~old~ is the previous rating
for that team, <em>K</em> is some constant (called the K-factor), and mu is the
team's ability score. K can take on any value, but it's meant to ensure
that rankings are not too volatile and that teams don't get
rewarded/penalized unduly for beating/losing to teams of lesser quality.</p>
<p>To update ratings once two teams have played, there are only a couple of
more steps. We make the assumption that muij is the output of a logistic
function of the pre-game ratings differential between the two teams.
Logistic functions have the form:r</p>
<p><em>f(x)</em> = 1 / (1 + 10 \^ (-<em>d~ij~</em> / 1000))</p>
<p>where d~ij~ is equal to r~iold~ - r~jold~, or the pre-game ratings of
the two teams. Note that for the first games of the season, f(x) = 1 /
(1 + 10\^0) = 1 / 2 = 0.5, meaning that, absent other information, we
expect each team to score 50% of the points.</p>
<p>Where did that 1000 come from? It's empirically derived (i.e., derived
from using past results) and it's called the logistic parameter
<em>xi.</em> The value is set so that for every <em>xi</em> rating points difference
between two teams, the higher ranked team should have roughly ten times
the probability of winning than the lower ranked player. It can be
tweaked to account for how much parity there is in a league. Many chess
ratings use <em>xi = </em>400.</p>
<p>Finally, we have all of the information we need to rate and rank the
2013 teams after each game is played using the following formula:</p>
<p><em>r~i~new~~ = r~i~old~~ + 32(S~ij~ - mu~ij~)</em></p>
<p>where <em>S~ij~</em> is simply the proportion of points (plus the Laplacian
correction) scored by team i when playing team j.</p>
<p>Here's how the season shakes out.</p>
<p><a href="http://thespread.us/images/elo_nonadjusted.png"><img alt="elo_nonadjusted" src="http://thespread.us/images/elo_nonadjusted.png" /></a>Our
top four is pretty consistent, with Seattle, Denver, San Francisco, and
Carolina claiming those spots. The Chiefs do surprisingly well here, and
the Bengals also do well (see the <a href="http://thespread.us/?p=307" title="Power ratings (Ranking, part 3)">power rankings
post</a> for
more on this). Jacksonville is basically in free fall until week 8 of
the season, when they won their first game of the year, and recovered
slightly by year's end. Houston and Washington round out the bottom
three. The league's most average team award belongs to the Ravens,
almost exactly average with an Elo rating of 0.27.</p>
<p><strong>Prediction.</strong> How predictive are these ratings? Using the previous
week's rating for each head to head matchup, I projected each game using
the metric listed above. Each home team is provided with a home field
advantage boost of 15 points in the projection (but this isn't added to
their actual rating). Without this HFA boost, the Elo ratings predict
the winner of each game (straight up) in 60% of 2013 games. If you
factor in HFA, it improves to 62.5% straight up. <a href="http://www.amazon.com/Whos-The-Science-Rating-Ranking/dp/0691154228">Langville and
Meyer</a> suggest
changing the value of <em>K </em>for the last weeks of the season to account
for meaningless games. I tried doing so, setting K = 16 instead of 32
for the final two weeks of the season, but this actually decreased
predictive accuracy to 62.08%.</p>
<p><strong>Road to the Super Bowl. </strong>How do things look for the teams that
ultimately made it to the Super Bowl?</p>
<p><a href="http://thespread.us/images/den_sea.png"><img alt="den_sea" src="http://thespread.us/images/den_sea.png" /></a></p>
<p>Seattle is ranked higher than Denver every week after week 1, and really
pulls away starting in week 9. However, because both teams are so good,
our week 17 prediction for the score differential of the eventual Super
Bowl matchup would have been very close. Using the logistic formula
above with Seattle's and Denver's week 17 rankings, Seattle is predicted
to score 50.91% of the points -- essentially a toss up.</p>
<p>Although we know Seattle ended up dominating in that game, there is also
an interesting methodological question. If the teams are so far apart in
ratings, why is the predicted outcome so close? The answer lies in the
logistic function used to make the prediction. Because we set the
logistic factor to be 1000, we know most games will be pretty close. the
difference in ratings has to be 1000 in order to make it 10 times more
likely for team i to beat team j. We could always experiment with
setting this to a different constant.</p>
<p>Next up, Markov ratings!</p>Power ratings (Ranking, part 3)2014-06-29T18:08:00-07:00treycauseytag:thespread.us,2014-06-29:power-ratings-ranking-part-3.html<p>Eigenvalues and eigenvectors</p>
<p>I <a href="http://treycausey.com/getting_started.html">recently posted some
thoughts</a> on what it takes
to get started in data science. Interestingly, one of what I thought was
among my least controversial claims raised many questions and a great
deal of doubt from readers. "How," they asked, "do you use linear
algebra?" I explained that many statistical problems have a compact
matrix representation, and many systems of linear equations can be
represented and solved using linear algebra. What I should have said
was, "to calculate power ratings."</p>
<!--more-->
<p>Let's take a break from World Cup excitement and return to the science
of ranking and rating NFL teams. We're going to get a little more
technical this time and introduce what are often called power ratings
and power rankings. Unlike many of the popular 'power ratings' that you
see in the sports media, these ratings are so-called because of a
mathematical algorithm involving exponents (powers) used in creating
them. However, because I'm using a <a href="http://python.org">scientific programming
language</a>, I'm going to use linear algebra and
estimate the ratings directly rather than through approximation. If
you're interested in finding out how to do the following in Excel, you
should definitely <a href="http://www.amazon.com/Whos-The-Science-Rating-Ranking/dp/0691154228/ref=sr_1_1?ie=UTF8&qid=1404087324&sr=8-1&keywords=who%27s+%231">buy the
book</a>.
What follows is an overview of this highly customizable technique.</p>
<p>This is a more math-intensive post than some, but I hope that reading
through it will provide you with some things to think about when you're
creating your own metrics.</p>
<p><strong>Strength. </strong>The basic tenet of <a href="http://www.math.utah.edu/~keener/">James
Keener's</a> method for producing
ratings and rankings is as follows: every team has some measure
of <em>strength</em> that we can observe. We get to choose what this metric is,
as long as it is <em>relational</em>-- that is, it must describe events between
two teams, <em>i </em>and <em>j. </em>Our measure of strength could be wins, it could
be points scored, it could be number of first downs, passing yards, etc.
It all boils down to the idea that <em>s~ij~</em> is a single <em>nonnegative</em>
number (this is important) that describes what happened when
team <em>i </em>played team <em>j</em>. Going forward, I'm going to be using <strong>points
scored</strong> as my measure of strength. When team <em>i</em> plays team <em>j </em>and
wins 45-30, <em>s~ij~</em> = 45 and <em>sji </em>= 30. These are additive, so if team
<em>i </em> and <em>j </em>play again and the score is 10-14, we update our two
strength scores to 55 and 44, respectively.</p>
<p><strong>Ratings. </strong>Keener proposes that there exists some unknown but knowable
vector of ratings <em>r</em> that relate this measure of strength to all other
teams (relative strength). Relative strength must take into account not
only how good team <em>i</em> is, but how good team <em>j</em> is and how good all the
teams that team <em>j</em> has played are, and so forth. Further, this method
asserts that if there is a proportionality constant lambda that
describes the relationship between our observed measure of strength <em>s</em>
and our unknown ratings <em>r</em>. Finding <em>r</em> and lambda are the goal.</p>
<p>How we do that involves making a number of decisions -- this is both an
upside and a downside to this method. If you like to tweak and
experiment, this is a great method for rating and ranking. If you just
want to know who's the best team, you have some work ahead of you.</p>
<p><strong>Smoothing. </strong>Recall in a <a href="http://thespread.us/?p=279" title="Points are great, but what about win percentage? (Ranking, part 2)">previous
post</a>,
we used Laplace's Rule of Succession to "smooth" out our estimates.
We'll do the same here, meaning that our strength scores will actually
be <em>a~ij~</em> = (points scored by team <em>i</em>+ 1) / (points scored by both
teams + 2). These new scores, <em>a</em>, can be roughly interpreted as the
probability that team <em>i</em> will be team <em>j</em> in the future. These values
are by definition between 0 and 1 and are directly comparable.</p>
<p><strong>Skewing</strong>. One of the dangers of using points scored is that we give
can give too much credit to teams that Run Up The Score or teams that
constantly find themselves locked in low-scoring dogfights. We can apply
a non-linear transformation function to our strength scores and try and
adjust for this. Langville and Meyer call this "skewing", but I prefer
"skew-adjusting." You can use pretty much any function here that
constraints values between 0 and 1, but an easy one to use is:</p>
<p><a href="http://thespread.us/images/Screenshot-from-2014-06-29-173629.png"><img alt="From "Who's
#1"" src="http://thespread.us/images/Screenshot-from-2014-06-29-173629.png" /></a></p>
<p>What does this mean in practice? Here's the function plotted over the
interval [0, 1]. The dotted black line is the linear function x = y,
whereas the orange line is the above function. The x-axis is the
untransformed strength score and the y-axis is the skew-adjusted score.
You can see that scores that hover around 0.5 are pushed further apart,
and the impact of being close to one of the extremes (0 or 1) is
lessened in the transformed function.</p>
<p><a href="http://thespread.us/images/skew_adjusted_strength.png"><img alt="skew_adjusted_strength" src="http://thespread.us/images/skew_adjusted_strength.png" /></a></p>
<p>Let's take a look at the effect that this has. Using the 2013 NFL season
and points scored (smoothed), this is the distribution of non-zero
strength scores.</p>
<p><a href="http://thespread.us/images/non_zero_strength.png"><img alt="non_zero_strength" src="http://thespread.us/images/non_zero_strength.png" /></a></p>
<p>As we'd expect in an "Any Given Sunday" league with lots of parity, most
of the scores are centered around 0.5, meaning that most teams have a
roughly even chance of beating most other teams. What does this
distribution look like?</p>
<p><a href="http://thespread.us/images/skewed_strength.png"><img alt="skewed_strength" src="http://thespread.us/images/skewed_strength.png" /></a></p>
<p>Whoa! Now almost on one's around 0.5, but we have a more uniformly
distributed set of scores. If we have reason to believe that the NFL is
a highly unequal league, we may consider using this distribution. I
don't necessarily think that it is, but it's good to explore your
options.</p>
<p><strong>Eigenvalues and eigenvectors. </strong>Now let's actually rate and rank some
teams. A full explanation of eigenvalues and eigenvectors is well beyond
the scope of this post, but they play a vital role across a broad range
of statistical and mathematical problems. And they end up being really
useful quantities for describing matrices of data. In fact, the
eigenvalue that we're interested in finding is the lambda described
above -- that proportionality constant that describes the relationship
between our observed strength scores and our goal, the ratings scores.
This one number, lambda, allows us to transform our matrix of offensive
performance into a 32-team eigenvector.</p>
<p>Confusing, I know. Luckily, there are functions in
<a href="http://scipy.org">SciPy</a> for finding eigenvalues and eigenvectors.
That's just what I've done and all of the code is on
<a href="https://github.com/treycausey/thespread/tree/master/ranking">GitHub</a>.
The tricky part is that eigenvalues can be complex numbers (remember,
those numbers that can have both real and imaginary components? Of
course you do). For our purposes, we're looking for the largest positive
eigenvalue without an imaginary component. It ends up being 6.415 for
the 2013 season. What does that mean? It's not important. What <strong>is</strong>
important is that this constant is the same for every team, because
strength of schedule is already factored into our calculations.</p>
<p>We can then use this constant to produce our ratings vector <em>r</em> (the
eigenvector). We normalize this vector so that the values sum to 1.0 and
are directly comparable no matter what measure of strength we use above.</p>
<p><strong>The ratings and rankings.</strong></p>
<p>[table id=14 /]</p>
<p>Well, once again, we see Seattle's on top of the rankings. They really
did have a historic season. San Francisco and Carolina also figure in
the top 3. Cincinnati makes its top 5 debut at number 4, beating out
Denver! This seems controversial to me. We also see the Giants third
from bottom, the lowest they've been ranked so far.</p>
<p>What happens if we use the skew-adjusted ratings?</p>
<p>[table id=15 /]</p>
<p>According to this, the Bengals were the second-best team in the NFL in
2014! I'm not sure about that. They did go 11-5 and beat the Patriots,
but... Anyway. I think the slightly differing opinions that each rating
algorithm has provided has underscored something I've talked about in a
lot on this blog: the need for <a href="http://thespread.us/?p=30" title="Win probability, uncertainty, and overfitting">ensemble
models</a>.
We'll be returning to that as the rating and ranking series of posts
draws to a close.</p>How consistent are these ratings?2014-05-25T18:56:00-07:00treycauseytag:thespread.us,2014-05-25:how-consistent-are-these-ratings.html<p>Historical rankings and stability</p>
<p>It's not time yet for the next ranking algorithm (power rankings!), so I
wanted to take time for a quick digression. This was prompted when I was
<a href="https://twitter.com/shanemcr/status/470461133074075649">asked</a> where
Baltimore ranked in 2012. After all, no one picked them to make it to,
much less win, the Super Bowl.</p>
<!--more-->
<p>Good question. Now, obviously, the best team in the league doesn't
always win the Super Bowl. 2013 was somewhat of an anomaly in that way.
Remember all of the "The #1 offense vs. the #1 defense" and "the best
two teams in the league, as it should be" narratives leading up to the
game?</p>
<p>To my surprise, Houston was the #1 team in the Colley ratings (using
adjusted win percentage) in 2012, and Baltimore didn't even check into
the top 10! Seattle, who many thought was the best team to exit the
playoffs, is a meager #7. However, if we include the margin of victory
vector in our calculation to produce the Colley-Massey ratings, Seattle
shoots up to #2, New England takes the #1 spot, and Houston tumbles to
#9. If you think back to 2012, Houston won a number of close games,
including two in overtime, and had two blowout losses.</p>
<p>This got me to thinking. How consistent are teams from year to year in
the ratings? To find out, I computed the Colley and the Colley-Massey
ratings over the 2002-2013 seasons (2002 to cover expansion). The new
code is on <a href="http://github.com/treycausey/thespread/ranking/">Github</a>.
Here's the Colley ratings over time, along with each team's biggest
year-to-year change. The presentation isn't ideal, sorry.</p>
<p>[table id=11 /]</p>
<p>The biggest hero-to-zero story, according to Colley's simple method, is
actually the 2012 Texans, dropping 0.51 from their all-time high in 2012
to 0.22 in 2013. In the other direction, the 2004 Steelers shot up 0.54
points from the previous year.</p>
<p>If we factor in margin of victory and use the combined Colley-Massey
ratings, we see a slightly different story.</p>
<p>[table id=12 /]</p>
<p>The 2013 Kansas City Chiefs are the biggest gainers, up 17.85 points
from their 2012 rating of -12.25. Interestingly, this still leaves them
with a 2013 rating of 5.59, which isn't super impressive. If you'll
recall, we can interpret this as their expected margin of victory
against an average team. Lack of offensive production strikes again.</p>
<p>In the other direction, the 2004 San Francisco 49ers plummeted 14.51
points from an already unimpressive rating of 2.73 in 2003. I'm sure
many 49ers fans won't be surprised to find out that they aren't in the
black again until 2009 (just barely, at 0.58), and don't have two
consecutive years with a positive rating until 2011 and 2012. Quite the
turnaround.</p>
<p><strong>Standings</strong></p>
<p>Over the entire time period, you won't be surprised to learn that New
England and Indianapolis are the top-ranked teams (who can forget the
endless stream of Brady-Manning matchups?) with average ratings of 0.74
and 0.67, respectively. Detroit is the bottom-rated team, with Oakland
and Cleveland following. The median team is somewhere between the
Bengals and the Falcons.</p>
<p>When factoring in margin of victory, New England is first, and it's not
even close. Their average rating is 8.43. The second place team is
Pittsburgh at 3.83. Something something RUTS.</p>
<p><strong>Volatility</strong></p>
<p>How volatile are teams' ratings? Depends on which system you use. To
find out, I took the standard deviation of the year-to-year difference
for each team 2003-2013. Using the win-percentage based Colley ratings,
Pittsburgh is the most volatile, with an average rating of .60 (3rd
overall!), but a standard deviation of 0.24. Baltimore is the most
volatile when using margin of victory, with a standard deviation of
7.64. Cleveland, is unfortunately for Browns fans, the most consistent,
with a standard deviation difference of 3.09.</p>
<p>That's a lot of numbers and lists to throw at you. Next up, we'll tackle
power rankings and try and figure out how we can get some predictive
power out of all these ratings.</p>
<p>Happy Memorial Day!</p>Points are great, but what about win percentage? (Ranking, part 2)2014-05-24T08:51:00-07:00treycauseytag:thespread.us,2014-05-24:points-are-great-but-what-about-win-percentage-ranking-part-2.html<p>Colley's Method</p>
<p>Last time I produced a set of rankings and ratings of NFL teams in 2013
using Massey's method, which was fundamentally just a least-squares
solution with margin of victory as the outcome variable. That's great,
you might say, but what about teams that run up the score in some games
but also lose a bunch of games? Or what about teams that win all of
their games by a small margin? For that, we turn to Colley's method.</p>
<p><small>[This is the 2nd part of a series of posts using linear algebra
to rate and rank NFL teams using <a href="http://www.amazon.com/Whos-1-Science-Rating-Ranking-ebook/dp/B0076LOSHC/ref=sr_1_1?ie=UTF8&qid=1400435551&sr=8-1&keywords=who%27s+number+1">Who's
#1?</a>.
I definitely encourage you to buy the book and follow along. All of the
code for this post can be found on
<a href="https://github.com/treycausey/thespread/blob/master/ranking/chapter3.py">Github</a>.]</small></p>
<!--more-->
<p>Note: <a href="#results">Click here</a> if you want to skip past all the gory math
details and get to the ratings and what this means for learning how to
do data science for sports.</p>
<p>Like Massey, Colley's work was incorporated into the BCS (RIP). Instead
of using margin of victory, Colley proposes that win percentage (number
of games won divided by the number of games played) is the best measure
of team quality. By making a slight modification to the win percentage
formula, Colley also contends that strength of schedule (SOS) is
factored in to the generated ratings.</p>
<p>Instead of just taking wins / games, we'll modify the win percentage
formula to be:</p>
<p><strong>(1 + wins) / (2 + games)</strong></p>
<p>This small modification buys us a couple of things. First, it means that
win percentage is always defined, even when no games have been played.
Without this, teams that have played zero games have an undefined win
percentage, and computers hate trying to divide by zero.</p>
<p>Second, it acts as a quasi-Bayesian prior. In the absence of any
information, the win percentage of all teams is equal to 0.5. As games
are played, we can watch the win percentage move up and down relative to
this number towards the team's "true" win percentage. This is also a
concept closely related to <a href="http://en.wikipedia.org/wiki/Rule_of_succession">Laplace's rule of
succession</a> and
<a href="http://en.wikipedia.org/wiki/Additive_smoothing">smoothing</a>, topics
that often emerge in data science.</p>
<p>If you're skeptical that this allows us to factor in SOS, I encourage
you to read <a href="http://www.amazon.com/Whos-1-Science-Rating-Ranking-ebook/dp/B0076LOSHC/ref=sr_1_1?ie=UTF8&qid=1400435551&sr=8-1&keywords=who%27s+number+1">the
book</a>,
but for now you'll just have to trust me.</p>
<p>Just as before, what we're fundamentally trying to do is produce some
unknown (but real) vector of <strong>ratings</strong>, which we'll call <em>r</em>, that we
can then put into an ordered list to produce <strong>rankings</strong>. Our setup is
going to be very similar to Massey's method, and will be expressed as
the following equation:</p>
<p><strong>Cr = b</strong></p>
<p>The <em>C</em> in that equation is very similar to the <em>M</em> matrix that we set
up for <a href="http://thespread.us/?p=258" title="Ranking algorithms and the NFL (Part 1 of a series)">Massey's
method</a>.
The <em>b</em> vector is equal to:</p>
<p><strong>b_i = 1 + .5(wins_i - losses_i)</strong></p>
<p>where you substitute each team's information in for each team.</p>
<p><strong><a id="results"></a>Results</strong></p>
<p>All that's left then is to use linear algebra to solve for <em>r</em> to get
the team ratings. We do this using numpy's linalg.solve method, and get
the following ratings and rankings.</p>
<p>[table id=9 /]</p>
<p>No surprise, Seattle is still number one, but we have a new entrant at
number two -- the Panthers. This is quite interesting. Despite having a
lower win percentage than the Super Bowl also-rans Broncos, they have a
higher rating using Colley's method. Why could this be?</p>
<p>The simplest answer is strength of schedule. Carolina played games
against the toughest division in the NFL, the NFC West, and won games
against the Rams and 49ers. They also played the Saints and the
Patriots. <a href="http://www.amazon.com/Whos-1-Science-Rating-Ranking-ebook/dp/B0076LOSHC/ref=sr_1_1?ie=UTF8&qid=1400435551&sr=8-1&keywords=who%27s+number+1">NFL.com
argues</a>
that the Panthers had the toughest schedule in all of football in 2013.
Yet, they still produced 12 wins.</p>
<p>At the bottom of the list, we see Washington. Jacksonville, who were the
worst team according to Massey's method, are only the fifth-worst team
using Colley's mehod, better than Washington, Oakland, Cleveland, and
Houston. Like this Panthers, this is because of the tough schedule they
played.</p>
<p><strong>Combining Colley and Massey</strong></p>
<p>Most analytically minded fans will tell you that you can't just take
into account wins, you have to factor in points. Wins are affected too
much by luck. What if we could combine the margin of victory part of
Massey's method with the win percentage and SOS components of Colley's
method? We can!</p>
<p>All we need to do is change our outcome vector from modified win
percentage to the margin of victory vector from the Massey example. This
produces the formula:</p>
<p><strong>Cr = p</strong></p>
<p>This produces the following ratings and rankings:</p>
<p>[table id=10 /]</p>
<p>This pulls Denver back into the #2 spot, followed by San Francisco.
Carolina, the big mover, drops down to the #4 spot. Denver's vaunted
offensive production in 2013 is just too much and overwhelms Carolina's
gains from strength of schedule using "pure" Colley ratings.
Unfortunately for the Jaguars, this pulls them back into last place, I'm
guessing due to lack of offense.</p>
<p><strong>Take-aways</strong></p>
<p>There are more rating and ranking methods to come, but you're probably
wondering which one is <em>the right one</em>. There isn't one. It all depends
on what you're trying to measure and, if you're making decisions, what
you're trying to optimize for. Knowing the correct <em>metric</em> for your
data science problem is a huge part of doing good work. This can't be
overstated. Optimizing for the wrong metrics will burn you in the end.</p>
<p>If we take a pure machine learning perspective, the best rating method
is the one that has the best predictive power for some outcome like
wins. So, we'll need to validate these rating methods for predictive
accuracy. So far, we've only done retrospectives on a season that's
already been played. Many would argue it's no secret that the Seahawks
were the best team in football and any ranking method that doesn't put
them at the top of the pile is a bad one.</p>
<p>That brings me to a final point. The fact that these methods have so far
produced roughly similar rankings isn't necessarily a bad thing -- it's
probably a good thing. Social scientists call this <a href="http://en.wikipedia.org/wiki/Convergent_validity">convergent
validity</a>. In an ideal
world, where we know something about team quality, measures of team
quality should roughly correlate with one another. A counterintuitive
finding is nice and generates pageviews, but if a finding runs too
counter to common sense, there's probably something wrong with it.</p>Ranking algorithms and the NFL (Part 1 of a series)2014-05-18T11:56:00-07:00treycauseytag:thespread.us,2014-05-18:ranking-algorithms-and-the-nfl-part-1-of-a-series.html<p>Rating and ranking</p>
<p>I recently picked up <a href="http://www.amazon.com/Whos-1-Science-Rating-Ranking-ebook/dp/B0076LOSHC/ref=sr_1_1?ie=UTF8&qid=1400435551&sr=8-1&keywords=who%27s+number+1">Who's #1?: The Science of Rating and
Ranking</a>,
a really fun read on the many ways to take a list of items and order
them by some score. Obviously, rankings are a huge topic of interest in
sports, and my day job is working on <a href="http://en.wikipedia.org/wiki/Recommender_system">recommender
systems</a>, so I saw this
as the natural intersection of these things. The authors, Langville and
Meyer, use college football as a running example throughout the text and
I thought it would be a good exercise for myself and for readers if I
worked through the various algorithms in the book using NFL data. The
book requires a basic understanding of linear algebra, but don't let
that stop you from reading.</p>
<!--more-->
<p>I'll be using the recently released 2013 data from <a href="http://armchairanalysis.com">Armchair
Analysis</a>, but you could get this data just
about anywhere.</p>
<p>[Sidenote: If you want to jump immediately to the technical end of
things, I recommend you check out Sean Taylor's <a href="https://github.com/seanjtaylor/NFLRanking">work on NFL
rankings</a>. In the spirit of
this blog, I'm approaching this as an educational exercise instead of a
finished product.]</p>
<p>Ratings and rankings are hugely important and sports fans are often
obsessed with finding out who's <em>really</em> number one. I'm going to use
the shorthand "ranking" for the rest of these posts, but you should be
aware that "rankings" are ordered lists and "ratings" are numerical
scores attached to those lists. Rankings give us an idea of the order,
ratings give us an idea of the magnitude of that ordering (i.e., <em>how
much</em> better is number one than number two).</p>
<p><strong>Massey's Least Squares Method</strong></p>
<p>The first method we'll be visiting is Massey's Least Squares Method,
which originated as part of <a href="http://www.masseyratings.com/">Kenneth
Massey</a>'s undergraduate thesis. The basic
idea is that we can find some set of coefficients, <em>r, </em>that describe
the relative strength of a team. To compare two teams, <em>i</em> and <em>j</em>, you
simply subtract team <em>j</em>'s rating from team <em>i</em>'s rating to produce a
rough estimate of team <em>i</em>'s margin of victory.Massey's ranking work has
since moved on from this method (and he contributed to the BCS), but the
intuition behind this method is a good starting point.</p>
<p>I'll skip as many gory math details as possible, but this algorithm
boils down into a simple formula:</p>
<p><strong>Mr = p</strong></p>
<p>where <strong>M</strong> is a 32x32 matrix (there are 32 teams), r is a vector of
size 32, and p, a vector of size 32.</p>
<p><strong>M</strong>, being a square matrix, has the number of games played by each
team on the diagonal, and the negation of the number of times each team
played head-to-head on the off-diagonal spots.</p>
<p>For instance, if we sort the list of teams alphabetically by
abbreviation, we get the Arizona Cardinals in position 0 and the Atlanta
Falcons in position 1. That means the the (0, 0) cell of <strong>M</strong> would be
16 (the number of games played by the Cardinals) and the (0, 1 -- row 0,
column 1) cell of <strong>M</strong> would be a -1 because the Cardinals and the
Falcons played once in 2013.</p>
<p>Still with me? Good.</p>
<p>The <em>p</em> vector is just a list (in the same team order as the M matrix --
important!) of the end-of-regular-season point differentials for each
team (points scored - points allowed). For Arizona, this is 55 (for the
Denver Broncos, this is 207). So, our basic proxy for team strength is
point differential. We all know there are many problems with this, but
it's certainly part of the equation.</p>
<p>We need to solve for the last vector in that equation, <em>r, </em>which is the
set of coefficients that give us the mathematical relationship between
the teams and their score differentials. Anyone who's taken a statistics
class that uses linear algebra to teach regression might recognize the
above equation in a different form, <strong>Xb = y</strong> where X are our
covariates, b is the set of beta coefficients, and y is out dependent
variable. We're doing something very similar here, which is why this is
called the <strong>least squares method</strong>.</p>
<p><strong>Results - Overall Team Strength</strong></p>
<p>You can check
<a href="https://github.com/treycausey/thespread/tree/master/ranking">Github</a>
for the gory details, but doing all of this math produces the list
below. Nothing too counterintuitive emerges (though this is usually good
for first steps). Seattle and Denver are the #1 and #2 teams.
Jacksonville and Washington are the bottom two teams.</p>
<p>[table id=5 /]</p>
<p><strong>Offensive and Defensive Ratings</strong></p>
<p>We all know, however, that teams can be very different on opposite sides
of the ball. Massey quite rightly recognizes that point differentials
are composed of points scored and points allowed, so we should be able
to decompose our team strength scores into offensive and defensive
strength. In additional to decomposing the points, we'll need to
decompose our M matrix into its diagonal and off-diagonal elements. This
gives us to new matrices, <strong>T</strong> and <strong>P </strong>(again, check
<a href="https://github.com/treycausey/thespread/tree/master/ranking">Github</a>
for the gory details).</p>
<p>We can use the ratings <em>r</em> that we computed above to find two new
vectors, <em>d</em> and <em>o</em>, by solving the following equation:</p>
<p><strong>(T + P) d = Tr - f</strong></p>
<p>This will produce <em>d</em>, the defensive ratings, and we can find <em>o</em> by
simply solving for it in:</p>
<p><strong>r = o + d</strong></p>
<p><em>Offense</em></p>
<p>[table id=6 /]</p>
<p>On offense, things start to look slightly less intuitive, though it's
heartening to see that Denver is leaps and bounds better than the number
2 team, New England. Of course, these ratings are not really that useful
if they're not predictive. That's a post for later in the series.</p>
<p><em>Defense</em></p>
<p>[table id=7 /]</p>
<p>Things look relatively reasonable! We've just constructed our first set
of ratings and rankings. All of the code is up on
<a href="https://github.com/treycausey/thespread/tree/master/ranking">Github</a>.</p>Massey Least Squares Simple Ratings2014-05-18T11:25:00-07:00treycauseytag:thespread.us,2014-05-18:massey-least-squares-simple-ratings.html<p>[["ARI","42.67"],["ATL","33.37"],["BAL","32.69"],["BUF","33.02"],["CAR","45.53"],["CHI","32.1"],["CIN","41.58"],["CLE","28.54"],["DAL","35.54"],["DEN","47.6"],["DET","34.59"],["GB","31.37"],["HOU","28.64"],["IND","40.26"],["JAC","25.13"],["KC","42.31"],["MIA","35.39"],["MIN","31.37"],["NE","42.12"],["NO","45.0"],["NYG","30.84"],["NYJ","30.14"],["OAK","28.24"],["PHI","38.08"],["PIT","34.28"],["SD","38.89"],["SEA","49.37"],["SF","46.27"],["STL","38.45"],["TB","33.56"],["TEN","35.46"],["WAS","26.94"]]</p>Outliers2014-04-13T13:40:00-07:00treycauseytag:thespread.us,2014-04-13:outliers.html<p>... or extreme values?</p>
<p>We've all heard the word (thanks in no small part to Malcolm Gladwell).
What exactly are they and what do we do about them?</p>
<!--more-->
<p>[Note: I'm aware that the plots have been cropped and
pixelated. I'm walking out the door now, but will repair ASAP.]</p>
<p>Many machine learning and data analysis tutorials out there often
contain some version of the following phrase as one of the preliminary
steps to building a model: "Identify outliers in your data and remove
them." Sounds simple, right? Unfortunately, almost none of these
tutorials spend any time talking about what an outlier <em>actually is</em> and
what the consequences of removing data that fairly or unfairly gets
labeled as an outlier does to your model.</p>
<p>I'll try to correct this and walk through a contrived example using
football data to show you what you can do with your data points that may
or may not be outliers.</p>
<p>As it turns out, there are lots of definitions of outliers and there's
no strong agreement on what it means to be an outlier. Many people
generally take an outlier to mean a data point that is unlike the other
data points in your sample. More formally, this may mean that the
offending data was produced by a different <em>data generating process, </em>or
less formally, that it belongs to a different distribution than the
other data points.</p>
<p>Let's walk through an example. The plot below is a (very ugly) scatter
plot of field goal and extra point attempts. The x-axis is the the
distance in yards to the end zone and the y-axis is the number of
seconds remaining in the game.</p>
<p><a href="http://thespread.us/images/field_goal_attemps_by_seconds.png"><img alt="field_goal_attemps_by_seconds" src="http://thespread.us/images/field_goal_attemps_by_seconds.png" /></a></p>
<p>Notice that one attempt that sticks out from the rest? It's 58 yards
from the end zone (which means an even longer field goal attempt). It's
probably not <strong>that</strong> surprising to learn that it was a field goal
attempt by the Raiders. Let's say we're trying to build a simple model
and determine if there is a relationship between the quarter of the game
(it's very clear there isn't really one, but I did say the example was
contrived). Do we throw that data point out as an outlier? <strong>Is</strong> it an
outlier?</p>
<p><em>Easy first steps</em></p>
<p>The first few things we can do are easy. We can look at the distribution
of field goal attempts.</p>
<p><a href="http://thespread.us/images/field_goal_distance_frequency2.png"><img alt="field_goal_distance_frequency" src="http://thespread.us/images/field_goal_distance_frequency2.png" /></a></p>
<p>From this we can see that the data are not really close to being
normally distributed (in fact, they're almost uniformly
distributed!) The mean of this distribution is 18.54 yards and the
standard deviation is 9.98 yards. Since the distribution isn't normal,
it wouldn't make sense to just say that we'll exclude the Raiders'
attempt because it's more than two standard deviations above the mean.</p>
<p>In this case, we have a big sample (11,329 attempts), so we don't have
to worry too much about one potential outlier skewing our measure of the
center of the distribution. The mean is 18.54 yards, and the median is
19 yards. We could also look at the <em>interquartile range</em>, the data
contained between the 25th and 75th percentiles. That puts us between 10
yards and 27 yards. If we excluded everything outside the IQR, we'd be
throwing away a lot of data!</p>
<p>So far, there isn't a really strong case for excluding this value.
Before we get more complicated, let's ask ourselves a more existential
question.</p>
<p><em>What do we really <strong>want</strong> from outliers?</em></p>
<p>Why do we worry about outliers at all? We worry about data points
skewing our models and our conclusions and giving us the wrong answer
when we try to generalize beyond our sample. But when we throw away
perfectly good data toward that end, we're actually <strong>guaranteeing</strong>
that we'll do the very thing we're trying to avoid.</p>
<p>So, you have to ask yourself. Do you want to be better at predicting
events that are closer to the average event, knowing that you might get
blindsided by a rare event? Or do you want to include those rare events,
account for their rarity somehow, and build a more robust model? You
probably won't be surprised to learn that I favor the latter option.
That's what <a href="http://thespread.us/?p=200" title="Forecasting QB performance using multilevel regression">multilevel
models</a>
are for.</p>
<p><em>More complicated tactics</em></p>
<p>I estimated a linear regression of distance from the endzone on seconds
remaining in the game. You won't be surprised to learn that the effect
is not significant. When working in a regression framework, a popular
test for outliers is to estimate <a href="http://en.wikipedia.org/wiki/Cook's_distance">Cook's
distance</a> (commonly
referred to as Cook's <em>d</em>).<em> </em>Essentially, Cook's distance tries to find
out just how <em>influential</em> data points in a regression model are. By
estimating the same regression many times, each time omitting a data
point, you find out how much your model's estimates change when a given
data point is omitted.</p>
<p>Statsmodels, a popular linear model package for Python, can estimate
Cook's <em>d </em>for us easily. Here's a plot of Cook's <em>d </em>for each of the
field goal and extra point attempts.</p>
<p><a href="http://thespread.us/images/cooks_distance.png"><img alt="cooks_distance" src="http://thespread.us/images/cooks_distance.png" /></a></p>
<p>The majority of these values are vanishingly small. There's no
hard-and-fast cutoff for how big is too big, but a value of more than
one is often used as a quick heuristic to take another look at that
case. That <strong>doesn't</strong> mean you automatically exclude it, just that you
might give it some attention.</p>
<p><em>Machine learning approaches</em></p>
<p>This post is already growing too long, but we need to give some
attention to machine learning, which has developed its own approaches to
outlier detection. One approach is to use a <a href="http://scikit-learn.org/stable/auto_examples/svm/plot_oneclass.html#example-svm-plot-oneclass-py">one-class support vector
machine</a>.
Technically this is known as <em>novelty detection</em>, but since we're trying
to just figure out which of our cases are unusual, that might be a good
first step. Other options rely on <em>distance functions</em>, like the
jauntily named <a href="http://scikit-learn.org/stable/auto_examples/covariance/plot_mahalanobis_distances.html#example-covariance-plot-mahalanobis-distances-py">Mahalanobis
distance</a>. Scikit-learn,
the most well-supported machine learning package for Python, has
functions implemented for each of these.</p>
<p><em>Take aways</em></p>
<p>In general, we want to throw away as little good data as possible.
Obviously, if the data we have is corrupted in some way (e.g., someone
entered 999 yards to go instead of 99), we want to get rid of that.
However, in many cases it's just not that clear if the data is "bad" or
not. I prefer to err on the side of keeping the data in and trying to
build in my uncertainty around it.</p>
<p>Especially in a sport with such small sample sizes as football, it's
important to make the most of your data whenever you can. If your
ultimate goal is (and should be) to minimize prediction error on new
data, you need the best, most representative sample you can get. A good
idea is often to build your models with and without the suspect data
points and see how different they are. Computation is cheap. Inferential
mistakes are expensive.</p>
<p>Finally, I would like to make a pitch to eliminate the word 'outlier'
from most people's vocabulary. As I've tried to drive home, it's often
not clear what it even means to be an outlier. Instead, I propose that
we use the phrase 'extreme value' to indicate that we are aware that a
particular data point is far from the mean/median/mode, but that we
don't know for sure if it's been produced by a different data generating
process or not.</p>Sloan recap2014-03-02T05:03:00-08:00treycauseytag:thespread.us,2014-03-02:sloan-recap.html<p>Ups and downs</p>
<p>This weekend, I attended the MIT Sloan Sports Analytics Conference in
Boston. Sloan is part trade show, part research conference, and part
see-and-be-seen affair. Despite the focus of this blog, I attended
panels across a wide variety of sports. I'll offer what I saw as common
themes across panels and then offer some more specific comments.</p>
<!--more-->
<p><strong>Injuries</strong></p>
<p>Nearly every panel discussed the need to improve the state of injury
analytics: forecasting injuries, modeling injury recovery time, and
injury prevention. Scott Pioli of the Atlanta Falcons stressed that the
human body, not on-field decision-making, was the next really big
frontier in analytics. He described how players self-monitor their
hydration levels by comparing the color of their urine to a color-coded
chart provided by Gatorade that hangs over the urinals in the locker
rooms. Given the amount of attention being paid in the tech sector to
health and the body, I wouldn't be surprised to see quick movement here.</p>
<p>The fact that so many executives were asking for more analytics in this
area undercuts my next point about communication. Instead, as multiple
conversations I had with people in sports seemed to confirm, injury
analytics are attractive a) because they have immediate implications for
the bottom line and b) because there's often no conventional wisdom or a
certain way that things have always been done in the league. Thus,
analysts are not fighting to be heard.</p>
<p><strong>Communication</strong></p>
<p>Similarly, nearly every panel featured a front-office person that
stressed that analytics won't make any inroads until people are able to
communicate the results of their analyses. On the one hand, I completely
agree with this. Knowing how to effectively present quantitative
information to non-experts is a skill that few have mastered.</p>
<p>On the other hand, I think this is something of a canard, and a way for
skeptics of analytics to push off having to address uncomfortable
findings for the time being. Stan Van Gundy, who some thought stole the
show at the Basketball Analytics panel, argued that he'd believe
analytics as soon as someone would "show [him] the science." Those
familiar with anti-science rhetoric will recognize this tactic (as <a href="https://twitter.com/AndyGlockner/status/439510576176578560">Andy
Glockner</a>
said, "SVG is anti-vaccine, based on this panel discussion +140"). I'm
guessing that the general manager doesn't ask the offensive or defensive
coordinator to dumb it down for him or her.</p>
<p><strong>"Heart", "Determination", and "Chemistry"</strong></p>
<p>As much as those of us in the analytics community like to make fun of
the cliched "you can't measure heart!" throw-away comments that
frequently emerge in these discussions, they won't go away. Panelist
after panelist stressed that some of the most important unmeasured
things for them was how well a player was likely to fit in with a team,
if the player would be able to perform under stress, and so on. As Aaron
Schatz of Football Outsiders <a href="https://twitter.com/FO_ASchatz/status/439784785649889281">tweeted in
frustration</a>,
"People, they're called intangibles because they are NOT TANGIBLE."</p>
<p>That being said, we do have scientific fields that study questions like
these -- psychology and sociology both study small group behavior,
decision-making under duress, and other related topics. If this is an
area in high demand, analysts might consider focusing on it more.
Successfully answering questions that decision-makers have specifically
asked to be answered is a good way to accrue reputational capital and
make it easier to tackle questions with more entrenched opposition.</p>
<p><strong>Random observations and summary</strong></p>
<p><em>SportVU</em>. SportVU and motion-tracking data from basketball were
<strong>everywhere</strong>. I saw many research papers using this data, attempts to
mirror the functionality of SportVU in other sports, and had several
discussions with NBA team analysts who use the data. It's clear that
some of the most advanced analytics are occurring in basketball. That
being said, I have such a hard time following the NBA. As Daryl Morey
pointed out, there are two things that drive fan interest: a) the
uncertainty of the outcome, and b) the importance of the game to the
final championship result. Basketball often fails on both fronts; it's a
long season with many essentially meaningless games.</p>
<p><em>Definitions</em>. What people mean by analytics differs wildly. I saw lots
of work that was conducted in Excel and I also saw computer vision work
using topic models to automatically recognize play types in the NFL.</p>
<p><em>Uniqueness</em>. It's no question that nearly everyone I talked to still
views their sport as a special snowflake with its own set of statistical
problems. While that's partially true, I found there was little buy-in
with the idea that, in fact, many of these problems are just substantive
examples of long-existing statistical problems and that there's no need
to reinvent the wheel to study them. That's the entire point of the
spread.</p>
<p><em>Summary</em>. As a whole, the conference was a fantastic way to meet people
who do this on a daily basis, have really interesting conversations with
people, and get a birds-eye view of sports analytics. The research paper
sessions were obviously more informative than many of the panels, but
the former lacked the presentation polish of the latter. It's hard for
research to compete when the commissioner of the NBA is presenting a few
rooms down the hall.</p>
<p>The highlight for me was watching Kevin Kelley, the 'coach who never
punts' offer his views on things. The lowlight, unfortunately, was the
football analytics panel, which really only featured one 'analytics
person', Brian Burke of Advanced NFL Stats. That panel was moderated by
Suzy Kolber and was an unmitigated disaster. I think 95% of the panel
was spent setting up straw men of analytics and then robustly knocking
them down. Burke offered defended analytics, making smart claims such as
that analytics forces us to make our assumptions explicit, but was a
minority voice.</p>
<p>Another lowlight was the ESPN panel on the upcoming college football
playoff selection committee. That's going to be a mess. Apparently, they
"hope" that the selection committee will watch as much football as they
can to make their decisions, but they're busy people. Similarly they
"hope" that the "football people" on the committee can sway the
decisions of the "non-football people" in close calls. Committee members
are free to use "whatever information" they see fit to make their
decisions. And so on.</p>
<p>I very much enjoyed meeting everyone that I could and my apologies to
those that I was unable to meet. Hope to make it again next year.</p>Selecting features vs. selecting samples: Making smart decisions2014-02-22T08:33:00-08:00treycauseytag:thespread.us,2014-02-22:selecting-features-vs-selecting-samples-making-smart-decisions.html<p>Modeling information rather than excluding it</p>
<p>A few days ago, Bill Mill <a href="https://twitter.com/llimllib/status/435775196683698176">asked
me</a> an excellent
question that deserves an extended response. If you'll recall, I
recently posted a simple multilevel regression model to forecast
quarterback performance as a (partial) function of their age. Bill
caught some imprecision in my language and asked if I wasn't
contradicting myself.</p>
<!--more-->
<p>In <a href="http://thespread.us/?p=200" title="Forecasting QB performance using multilevel regression">that
post</a>,
I made the following claim:</p>
<blockquote>
<p>Many existing approaches make arbitrary decisions about which players
to include and exclude in their models along the lines of “I only
selected players who played more than four seasons, started at least
12 games, and had at least 100 attempts in a season.” Doing this
biases your model because you are doing what is known as selecting on
the dependent variable.</p>
</blockquote>
<p>Then, when I'm described the features that I used in the model, I stated
that I used "a variable that was set equal to 1 if the player was a
starter (generously defined as starting more than 8 games in a season)
and 0 otherwise."</p>
<p>Bill asks -- what's the difference between these two things? The answer
lies in the idea of selecting your <strong>sample</strong> vs. selecting your
<strong>features</strong>.</p>
<p>When we engage in <a href="http://thespread.us/?p=150" title="Building a win probability model part 4: Feature engineering and model evaluation">feature
engineering</a>,
we want our features to improve how our model fits our data. In
Silverian (tm) terms, we want to include features that increase the
signal more than they increase the noise. The features we include in a
model should be associated with the outcome in some systematic way. The
more information we can include in our model while still having a model
that <em>generalizes</em> out of sample, the better our model will be.</p>
<p>Contrast this with how we select our samples or how we collect our data.
We want our data to be as <em>representative </em>as possible of the population
we're trying to model. This is a little bit like solving a word problem.
You have a question you're trying to answer and you need to collect some
data to answer it. You need to make sure that the data you collect is
appropriate for answering that question. Otherwise, if you collect data
in a way that systematically restricts the way you can answer that
question, you're going to get a biased answer.</p>
<p>Let's think about this in more concrete terms. The question I was trying
to answer was, "How does age predict quarterback performance?" Notice I
said "quarterback performance" and not "starting quarterback
performance." Since the question I am trying to answer has to do
with <strong>all</strong> quarterbacks, I need to make sure that my data
represent <strong>all</strong> quarterbacks. Otherwise, I'm asking the first
question, but if I arbitrarily decide to only include starting
quarterbacks in my sample, I'm actually answering the second question.</p>
<p>This doesn't seem like that big of a deal until you try and use a model
that you built with starting quarterbacks to forecast the performance of
a non-starting quarterback. At this point you'll probably have an overly
rosy view of how the non-starter will do and you won't realize it. Then,
you'll make a bad signing decision could either lose your job or your
fantasy league (depending on who you are).</p>
<p>OK, you might still be asking why this is any different than including a
feature that indicates if the player was a starter or not. The
difference is that we're now answering the question we set out to
answer, but we're also including information in the model that
acknowledges a simple fact: some quarterbacks are better than others.
What ends up happening in a linear model like the one in the post is
that players who were starters get a fixed "boost" to their projections
that takes into account that they are probably better players. Players
who were not starters don't get this boost.</p>
<p>Is this the best way to do this? Probably not, it's a very rough
indicator. But it helps improve the model performance out of sample. I
could have included, for instance, the number of games the player
started in a season, the number of snaps they took, or something
completely different. The key point is that I answered the question
using as much data as I could, didn't make arbitrary decisions about
inclusion in the model that could bias my results, and tried to build
player differences<strong>into</strong> the model rather than exclude them.<strong><br />
</strong></p>Guest post: How’d Those Football Forecasts Turn Out?2014-02-12T07:00:00-08:00treycauseytag:thespread.us,2014-02-12:guest-post-howd-those-football-forecasts-turn-out.html<p>Crowdsourcing NFL rankings via pairwise comparisons</p>
<p>The following is a guest post from <a href="http://dartthrowingchimp.wordpress.com/">Jay
Ulfelder</a>, a political
scientist who specializes in forecasting political development and
instability. Even if you're not politically minded, Jay's blog is a
really fantastic source for anyone who's interested in statistical
forecasting. Jay has gracefully agreed to cross-post his post
summarizing his experience crowdsourcing pre-season NFL rankings to
predict the Super Bowl.</p>
<p>Bonus edit: If you want to play with the data that Jay used for this
post, <a href="https://drive.google.com/file/d/0B5wyt4eDq98GdVFyZ2xpaFFwVVE/edit?usp=sharing">here you
go</a>.</p>
<!--more-->
<h4>How’d Those Football Forecasts Turn Out?</h4>
<div>
Yes, it’s February, and yes, the Winter Olympics are on, but it’s a cold
Sunday so I’ve got football on the brain. Here’s where that led today:
Last August, I used a crowdsourcing technique called a wiki survey to
generate a set of preseason predictions on who would win Super Bowl 48
(see [here](http://dartthrowingchimp.wordpress.com/2013/08/11/using-wiki-surveys-to-forecast-rare-events/ "Using Wiki Surveys to Forecast Rare Events")).
I did this [fun
project](http://dartthrowingchimp.wordpress.com/2013/03/20/in-praise-of-fun-projects/ "In Praise of Fun Projects") to
get a better feel for how wiki surveys work so I could start applying
them to [more serious
things](http://dartthrowingchimp.wordpress.com/2014/01/01/relative-risks-of-state-led-mass-killing-onset-in-2014-results-from-a-wiki-survey/ "Relative Risks of State-Led Mass Killing Onset in 2014: Results from a Wiki Survey"),
but I’m also a pro football fan who wanted to know what the season
portended.
Now that Super Bowl 48′s in the books, I thought I would see how those
forecasts fared. One way to do that is to take the question and results
at face value and see if the crowd picked the right winner. The short
answer to that is “no,” but it didn’t miss by a lot. The dot plot below
shows team in descending order by their final score on the preseason
survey. My crowd picked New England to win, but Seattle was second by
just a whisker, and the four teams that made the conference championship
games occupied the top four slots.
![nflpostmortem.dotplot](http://dartthrowingchimp.files.wordpress.com/2014/02/nflpostmortem-dotplot.png?w=600&h=800)
So the survey did great, right? Well, maybe not if you look a little
further down the list. The Atlanta Falcons, who finished the season
4-12, ranked fifth in the wiki survey, and the Houston Texans—widely
regarded as the worst team in the league this year—also landed in the
top 10. Meanwhile, the 12-4 Carolina Panthers and 11-5 KC Chiefs got
stuck in the basement. Poke around a bit more, and I’m sure you can find
a few other chuckles.
Still, the results didn’t look crazy, and I was intrigued enough to want
to push it further. To get a fuller picture of how well this survey
worked as a forecasting tool, I decided to treat the results as power
rankings and compare them across the board to postseason rankings. In
other words, instead of treating this as a classification problem (find
the Super Bowl winner), I thought I’d treat it as a calibration problem,
where the latent variable I was trying to observe before and after is
relative team strength.
That turned out to be surprisingly difficult—not because it’s hard to
compare preseason and postseason scores, but because it’s hard to
measure team strength, even after the season’s over. I asked [Trey
Causey](http://treycausey.com/) and [Sean J.
Taylor](http://seanjtaylor.com/), a couple of professional acquaintances
who know football analytics, to point me toward an off-the-shelf “ground
truth,” and neither one could. Lots of people publish ordered lists, but
those lists don’t give us any information about the distance between
rungs on the ladder, a critical piece of any calibration question. (Sean
later produced and emailed me a set of postseason [Bradley-Terry
rankings](http://en.wikipedia.org/wiki/Pairwise_comparison) that look
excellent, but I’m going to leave the presentation of that work to him.)
About ready to give up on the task, it occurred to me that I could use
the same instrument, a wiki survey, to convert those ordered lists into
a set of scores that would meet my criteria. Instead of pinging the
crowd, I would put myself in the shoes of those lists’ authors for
while, using their rankings to guide my answers to the pairwise
comparisons the wiki survey requires. Basically, I would kluge my way to
a set of rankings that amalgamated the postseason judgments of several
supposed experts and covered distance as well as rank order. The results
would have the added advantage of being on the same scale as my
preseason assessments, so the two series could be directly compared.
To get started, I Googled “nfl postseason power rankings” and found four
lists that showed up high in the search results and had been updated
since the Super Bowl
([here](http://fansided.com/2014/02/03/nfl-power-rankings-post-super-bowl/#!uWLy6),
[here](http://www.cbssports.com/nfl/powerrankings), [here](http://www1.skysports.com/american-football/news/13283/9152393/nfl-power-rankings-post-season-simon-veness-ranks-the-gridiron-teams),
and [here](http://www.mercurynews.com/raiders/ci_25053949/steve-corkrans-final-2013-nfl-rankings)).
Then I set up a wiki survey and started voting as List Author \#1. My
initial thought was to give each list 100 votes, but when I got to 100,
the results of the survey in progress didn’t look as much like the
original list as I’d expected. Things were a little better at 200 but
still not terrific. In the end, I decided to give each survey 320 votes,
or the equivalent of 10 votes for each item (team) on the list. When I
got to 320 with List 1, the survey results were nearly identical to the
original, so I declared victory and stuck with that strategy. That meant
1,280 votes in all, with equal weight for each of the four list-makers.
The plot below compares my preseason wiki survey’s ratings with the
results of this Mechanical Turk-style amalgamation of postseason
rankings. Teams in blue scored higher than the preseason survey
anticipated (i.e., over-performed), while teams in red scored lower
(i.e., under-performed).
![nflpostmortemplot](http://dartthrowingchimp.files.wordpress.com/2014/02/nflpostmortemplot.jpg?w=600)
Looking at the data this way, it’s even clearer that the preseason
survey did well at the extremes and less well in the messy middle. The
only stinkers the survey badly overlooked were Houston and Atlanta, and
I think it’s fair to say that a lot of people were surprised by how
dismal their seasons were. Ditto the Washington [bleep]s and Minnesota
Vikings, albeit to a lesser extent. On the flip side, Carolina stands
out as a big miss, and KC, Philly, Arizona, and the Colts can also thumb
their noses at me and my crowd. Statistically minded readers might want
to know that the root mean squared error (RMSE) here is about 27—better
than random guessing, but certainly not stellar.
A single season doesn’t offer a robust test of a forecasting technique.
Still, as a proof of concept, I think this exercise was a success. My
survey only drew about 1,800 votes from a few hundred respondents whom I
recruited casually through my blog and Twitter feed, which focuses on
international affairs and features very little sports talk. When that
crowd was voting, the only information they really had was the previous
season’s performance and whatever they knew about off-season injuries
and personnel changes. Under the circumstances, I’d say a RMSE of 27
ain’t terrible.
It’d be fun to try this again in August 2014 with a bigger crowd and see
how that turns out. Before and during the season, it would also be neat
to routinely rerun that Mechanical Turk exercise to produce up-to-date
“wisdom of the (expert) crowd” power rankings and see if they can help
improve predictions about the coming week’s games. Better yet, we could
write some code to automate the ingestion of those lists, simulate their
pairwise voting, and apply [All Our Ideas](http://www.allourideas.org/)‘
hierarchical model to the output. In theory, this approach could scale
to incorporate as many published lists as we can find, culling the
purported wisdom of certain crowd without the hassle of all that
recruiting and voting.
Unfortunately, that crystal palace was a bit too much for me to build on
this dim and chilly Sunday. And now, back to our regularly scheduled
programming…
* * * * *
<div id="jp-post-flair">
</div>
<p>Thanks, Jay! You can check out the original post
<a href="http://dartthrowingchimp.wordpress.com/2014/02/09/howd-those-football-forecasts-turn-out/">here</a>.</p>
</div>Forecasting QB performance using multilevel regression2014-02-09T10:36:00-08:00treycauseytag:thespread.us,2014-02-09:forecasting-qb-performance-using-multilevel-regression.html<p>Making the best use of all of the available data</p>
<p>Now that we've hit the offseason, I thought I'd publish some of my
earlier football analytics work. The following was a spec project
demonstrating the use of multilevel regression models to forecast
quarterback performance. It's a very simple model and could be an
excellent jumping-off point for someone to take up as a project. The
model was built during the break between the 2012 and 2013 seasons, so
now I have a chance to go back and look at how it performed in
projecting 2013 numbers. You canfind <a href="https://github.com/treycausey/thespread/tree/master/qb_forecasting">all of the
code</a>
on GitHub.</p>
<p>Projecting the passing productivity for quarterbacks is particularly
important for an NFL team, as quarterbacks are frequently the
highest-paid or second-highest paid player on the roster.
Importantdecision points include the draft and the end of the third year
of the player's contract, when a decision is usually made about
exercising a fifth-year contract option.</p>
<!--more-->
<p><a href="http://www.pro-football-reference.com/blog/?p=591">Lisk 2008</a>, <a href="http://www.footballperspective.com/quarterback-age-curves/">Stuart
2013</a>, and
<a href="http://www.advancednflstats.com/2011/08/how-quarterbacks-age.html">Burke
2011</a>
each approach the question in separate ways, but all are focused on the
question of quarterback age curves. Each tries to identify the age at
which quarterbacks are likely to peak, coming to slightly different
conclusions.</p>
<p>While identifying the average peaking age is useful for GMs, the
richness of individual player histories is not represented when
condensing the data like this. I instead approached this as a question
appropriate for a statistical technique called "<a href="http://www.stat.columbia.edu/~gelman/arm/">multilevel
modeling</a>." This technique
uses all of the available information about each player, takes into
account that performance is correlated year-upon-year, and creates both
an overall estimate for the "average QB" as well as an individualized
model for each QB.</p>
<p>Many existing approaches make arbitrary decisions about which players to
include and exclude in their models along the lines of "I only selected
players who played more than four seasons, started at least 12 games,
and had at least 100 attempts in a season." Doing this biases your model
because you are doing what is known as <strong>selecting on the dependent
variable</strong>. You are trying to model some measure of success and you are
deciding who is in your sample by the values on this variable. If you
only model successful people, you will get a biased view of what drives
success.</p>
<p>Think about it this way. You're trying to build a model that will help
you forecast how players will do because you're deciding whether or not
to spend millions of dollars on them. You hope that this model will
generalize and you will be able to use it in many decisions. You need
the best, most unbiased information possible. If you let information
from outside of the model leak into the model, it will decrease the
ability of the model to generalize. <strong><br />
</strong></p>
<p>Multilevel modeling allows you to include all of your observations. The
model creates two components, a "population-level" component which tries
to model a typical player in the absence of any specific information,
and then player-specific adjustments (effects) that move that player's
projections up or down.</p>
<ul>
<li>First, as it means there is no need to disregard players with very
short careers. Their impact to the overall model is weighted less
than players with longer careers, but they still contribute
information to the model of the "average" player.</li>
<li>Second, it is interactive and allows for projections for both a)
unobserved years for existing players, and b) unobserved years for
players not in the model. Obviously, uncertainty for the latter
projections is higher than for existing players, but it is a useful
starting point.</li>
</ul>
<p>The interesting thing about the problem of projecting quarterback
performance is how much it benefits from a standard approach to modeling
nested and correlated data, rather than trying to reinvent the wheel.
Using an overall aging curve for a random player is less useful from a
management perspective, as it is almost never the case that a player is
signed or re-signed without any existing information. Rather than using
a general rule-of-thumb about the player's contract, it makes much more
sense to use all of the available information for both the player in
question and all other players in that position.</p>
<p>Using the <code>XML</code> package for the R statistical language, I scraped <a href="http://www.pro-football-reference.com/play-index/psl_finder.cgi?request=1&match=single&year_min=1980&year_max=2012&season_start=1&season_end=5&age_min=0&age_max=99&league_id=&team_id=&is_active=&is_hof=&pos_is_qb=Y&c1stat=&c1comp=gt&c1val=&c2stat=&c2comp=gt&c2val=&c3stat=&c3comp=gt&c3val=&c4stat=&c4comp=gt&c4val=&order_by=pass_td&draft=0&draft_year_min=1936&draft_year_max=2012&type=&draft_round_min=0&draft_round_max=99&draft_slot_min=1&draft_slot_max=500&draft_pick_in_round=0&draft_league_id=&draft_team_id=&college_id=all&conference=any&draft_pos_is_qb=Y&draft_pos_is_rb=Y&draft_pos_is_wr=Y&draft_pos_is_te=Y&draft_pos_is_rec=Y&draft_pos_is_t=Y&draft_pos_is_g=Y&draft_pos_is_c=Y&draft_pos_is_ol=Y&draft_pos_is_dt=Y&draft_pos_is_de=Y&draft_pos_is_dl=Y&draft_pos_is_ilb=Y&draft_pos_is_olb=Y&draft_pos_is_lb=Y&draft_pos_is_cb=Y&draft_pos_is_s=Y&draft_pos_is_db=Y&draft_pos_is_k=Y&draft_pos_is_p=Y">names
and draft
dates</a>
for all players drafted at quarterback from 1980 - 2012 (the "modern
era") from Pro Football Reference. I then scraped the passing
information for each player for each year (<a href="http://www.pro-football-reference.com/years/2012/passing.htm">example using
2012</a>) and
matched the draft information to passing statistics.</p>
<p>Following previous analyses, I used <a href="http://www.pro-football-reference.com/about/glossary.htm">adjusted net yards per
attempt</a>
(<strong>ANY/A</strong>) as the measurable outcome of quarterback performance. The
formula for computing this is:</p>
<blockquote>
<p>(pass yards + 20*(pass TD) - 45*(interceptions thrown) - sack
yards)/(passing attempts + sacks)</p>
</blockquote>
<p>I then estimated a multilevel regression model using the player's age at
each measurement of ANY/A, the player's age squared (to capture the
non-linear rate at which age affects performance), and a variable that
was set equal to 1 if the player was a starter (generously defined as
starting more than 8 games in a season) and 0 otherwise.</p>
<p>The model produces an intercept for the overall model, an intercept for
each player, an overall slope for each variable, and a slope for each
variable for each player. The intercept can be thought of as the
starting estimate of ANY/A in the absence of any information about age
or player. The slopes can be thought of as the effect of each additional
year of age (or year squared) on ANY/A plus a fixed amount of influence
for being a starter.</p>
<p>Because the model is interactive, it is not as easy to report the
findings in a short summary of "quarterback age peaks." However, here
are the so-called "fixed effects", the coefficients that are stable from
player to player and season to season. Note that I don't really like the
language of "fixed" and "random" effects, as they tend to vary in
definition from discipline to discipline.</p>
<p>~~~~ {tabindex="0"}
Fixed effects:
Estimate Std. Error t value
(Intercept) -9.13057 5.82462 -1.568
age 0.82139 0.39566 2.076
I(age^2) -0.01299 0.00654 -1.986
starter 1.81190 0.22154 8.179</p>
<div class="highlight"><pre>Then, each player has individual coefficients (called "random effects")
that add or subtract from these coefficients to adjust to the
information given about each player.
Because there are coefficients for each player, I can't really just spit
out a list of them all here. However, we can look and see how well the
model fits the training data (both good and bad):
Player Actual Y4 Estimated Y4 Actual Y5 Estimated Y5
Tom Brady 5.94 5.49 6.92 6.47
Rex Grossman 5.21 5.49 3.91 5.00
Byron Leftwich 5.34 5.62 2.69 3.86
David Carr 3.77 5.54 4.57 5.49
Now that the 2013 season's over, we can take a look and see how well the
model did in predicting new data. Let's take a look at players who were
rookies in the 2012 season first.
~~~~ {tabindex="0"}
name age starter anya_12 pred_2013 actual_2013
47 Andrew Luck 23 1 5.66 5.155733 6.06
195 Brandon Weeden 29 1 4.98 5.480730 4.51
245 Brock Osweiler 22 0 3.00 2.856016 4.83
1142 Kirk Cousins 24 0 7.53 4.351131 3.67
1434 Nick Foles 23 0 5.13 3.839286 9.18
1590 Russell Wilson 24 1 7.01 5.476200 7.10
1610 Ryan Lindley 23 0 1.89 2.466979 DNP
1612 Ryan Tannehill 24 1 5.23 4.998627 5.00
</pre></div>
<p>Some hits and some misses here. We would expect these to not be great
for rookies, as they only have one year of data and the effect of the
overall league average will be strong on their predictions. We can see
that the model predicted Luck to take a little bit of a slide in 2013,
but he actually improved. Weeden, Osweiler, and Cousins all have overly
optimistic predictions, though none of them played many games. Nick
Foles obviously exceeded everyone's expectations, and Russell Wilson
continued to impress. Tannehill is on the money.</p>
<p>If we look at the players who ended their third year in 2012, let's
check their 2013 predictions and their actual performance. I limited
this to players who had not already left the league.</p>
<p>~~~~ {tabindex="0"}
name age starter anya_12 pred_13 actual_2013
1 Chase Daniel 26 0 10.00 5.018754 5.30
2 Colt McCoy 25 0 4.74 3.898679 13.0
3 John Skelton 24 0 4.50 3.703542 DNP
4 Rusty Smith 25 0 6.80 3.607257 DNP
5 Sam Bradford 25 1 5.64 5.090951 6.10
~~~~</p>
<p>Yikes. Daniel actually only played 5 games, McCoy a single game, and
Bradford played seven games before a season-ending injury.</p>
<p>This simple model is built on only two measures of a quarterback: his
age and whether or not he is a starter. One of the issues with the data
set is that it only measures <em>outcomes</em>, which are biased towards
successful players anyway. Trying to disentangle outcomes from process
would be an important contribution. Second, it does not help in
evaluating potential draft candidates. It would need to incorporate
college data to do so, and college statistics are notoriously poor at
forecasting NFL performance. Third, it does not take into account that
the quarterback is not solely responsible for his performance. It does
not account for talented receivers, effective offensive lines, or a
heavy run game.</p>
<p>Clearly, the model isn't magic and has to be considered with other
information and in context. However, I was able to produce a model with
very few features that produced reasonable forecasts about the future
and allowed us to use all available data rather than selecting arbitrary
cutoff points. I look forward to updating it in the future.</p>Reproducible research and sports2014-02-08T07:43:00-08:00treycauseytag:thespread.us,2014-02-08:reproducible-research-and-sports.html<p>Good science requires transparency</p>
<p>Allow me to digress from models and plots for a minute to address an
important topic. If you read the spread, chances are you're familiar
with the <a href="http://www.sloansportsconference.com/">Sloan Sports Analytics
Conference</a>. It's a yearly
gathering of people from sports franchises, academia, and industry. It's
part research conference, part vendor gathering, and part social
networking. Sloan is also seen as the premier venue for discussing and
presenting sports analytics work. I'm tentatively planning to be there
this year myself, pending the resolution of some ticket snafus. What
could be wrong with that?</p>
<!--more-->
<p>I was therefore distressed to see <a href="http://statsbylopez.wordpress.com/2014/02/06/ssac_2014/#comments">this
post</a>
by Michael Lopez, a graduate student in biostatistics and contributer to
<a href="http://regressing.deadspin.com/">Regressing</a>, Deadspin's sports science
blog. To summarize, Michael submitted a research paper to the Sloan
paper competition, which boasts a top prize of \$20,000, free admission
to the conference (roughly \$500), and a chance to present your work to
some of the best and brightest in sports.</p>
<p>Through a series of confusing emails, Michael eventually works out that
a) his paper was not accepted, but that he can present a poster, b)
there is no prize money for this, c) there is no free admission for
this, and d) there are no more tickets remaining to the conference.</p>
<p>He digs further in and finds that the authors of the winning papers have
an unusually high rate of association with either the conference's
sponsors (ESPN and its parent company Disney) or the conference's host
school (MIT). He also notes that <strong>six of the eight</strong> winning papers use
proprietary data.</p>
<p>In his post, Michael notes what a problem this is. When research is not
reproducible, it is difficult to verify its veracity. How many models
did the authors estimate before arriving at the one in the paper? Are
the results robust to different model specifications? What happens when
you include or exclude various variables? These are unanswerable
questions.</p>
<p>As I pointed out in the <a href="http://thespread.us/?p=1" title="Hello world! Introducing the spread.">very first post on the
spread</a>:</p>
<blockquote>
<p>When advanced analysis *is* conducted, it's often behind closed
doors. Understandably, teams want to preserve any edge they find.
However, this is not only bad for the analytics community, it's bad
for the advancement of football analytics. As Ben Alamar pointed out
on the <a href="http://www.advancednflstats.com/search/label/podcast?max-results=100">Advanced NFL Stats
podcast</a>,
without peer review, isolated analysts often have no objective check
on the quality of their work.</p>
</blockquote>
<p>You might be asking yourself if this is just an esoteric point made by
out-of-touch academics who don't understand how the 'real world' works.
The answer is no. I know for a fact that professional sports
organizations use the findings presented at Sloan. The conference was
cofounded by the general manager of the Houston Rockets, and I've done
work for teams who have referenced work presented there.</p>
<p>When research is proprietary and not reproducible, you should be <strong>very
careful</strong> how much stock you put in the findings. I understand the
confidential nature of the data and the need to make a profit -- I work
in industry. But one only need look to the tools that the very best in
data science are using to understand the value of being transparent: R,
Python, scikit-learn, pandas, etc. These are all open-source software
packages. The best data scientists are moving away from expensive,
proprietary tools. Has data science suffered? On the contrary. Some of
the most successful tech firms have open sourced large parts of their
workflows. Releasing
<a href="https://www.facebook.com/note.php?note_id=89508453919">Hive</a> didn't
mean a copycat Facebook was able to open up shop.</p>
<p>Sports organizations want advanced analytics capacity to make better
decisions and get an edge over their competition. The way to get an edge
is not to have someone write a proprietary paper and then say 'trust me
on the findings.' That's how <strong>bad decisions</strong> are made.</p>
<p>Postscript: Here is some coverage of Michael's post on
<a href="http://regressing.deadspin.com/here-are-this-years-sloan-finalist-papers-and-their-bi-1518317761">Deadspin</a>.</p>Learning about football fandom using social media2014-02-05T05:14:00-08:00treycauseytag:thespread.us,2014-02-05:learning-about-football-fandom-using-social-media.html<p>Data science and football, together at Facebook</p>
<p>One of the most exciting opportunities created by the introduction of
data science to football is the ability to analyze massive amounts of
non-traditional data to learn more about the sport.</p>
<!--more-->
<p>Social media data is often noisy and unstructured, but presents a great
snapshot of what fans are thinking and saying in real time. Although a
great deal of existing work tries, with varying levels of success, to
<a href="https://www.cs.cmu.edu/~nasmith/papers/sinha+dyer+gimpel+smith.mlsa13.pdf">predict things with
Twitter</a> [PDF],
<a href="http://seanjytaylor.com">Sean Taylor</a> has taken a different tack and is
trying to learn about the fans themselves.</p>
<p>Sean is a data scientist at Facebook (and avid Eagles fan) who is also
wrapping up his PhD at NYU's Stern School of Business. You may have seen
his most famous work, <a href="https://www.facebook.com/notes/facebook-data-science/nfl-fans-on-facebook/10151298370823859">a map of every county in the
US</a>
with the most popular (on Facebook) NFL team there:</p>
<p><a href="http://thespread.us/images/528895_10151382327948415_1568495614_n.png"><img alt="528895_10151382327948415_1568495614_n" src="http://thespread.us/images/528895_10151382327948415_1568495614_n.png" /></a></p>
<p>Sean has now produced some new visualizations, using Facebook data to
<a href="https://www.facebook.com/notes/facebook-data-science/the-emotional-highs-and-lows-of-the-nfl-season/10152033221418859">track the sentiment of
fans</a>
during games and across the season to depict the emotional highs and
lows that fans all experience. There are quite a few nice little nuggets
of information in here.</p>
<p><a href="http://thespread.us/images/sentiment1.png"><img alt="sentiment" src="http://thespread.us/images/sentiment1.png" /></a></p>
<p>Sean has also conducted research into <a href="https://github.com/seanjtaylor/NFLRanking">ranking NFL
teams</a> and using Bayesian
inference to <a href="https://github.com/seanjtaylor/fantasy-football">model the value of
players</a> in fantasy
football drafts. I highly encourage fans of this blog to check all of
his work out.</p>Following up on usability and prediction: Downs don't matter?2014-02-02T10:57:00-08:00treycauseytag:thespread.us,2014-02-02:following-up-on-usability-and-prediction-downs-dont-matter.html<p>I was making some adjustments to the win probability model and found a
great example of the points I discussed in my <a href="http://thespread.us/?p=174" title="Thinking about statistical decision-making vs. prediction">previous
post</a> on
making sure that models are usable.</p>
<!--more-->
<p>I ran through all of the current features through a feature selection
algorithm in scikit-learn that takes only the <em>k</em> best features at each
step, with <em>k</em> being provided by the user. Since I only have a small
number of features, I just tried ran the algorithm iteratively from <em>k</em>
= 1, ..., 9. The results were surprising, to say the least.</p>
<p><a href="http://thespread.us/images/feature_selection.png"><img alt="feature_selection" src="http://thespread.us/images/feature_selection.png" /></a>The
features are on the y-axis and the number of times each feature was
selected as one of the <em>k</em> best features is on the x-axis. As you can
see, the score differential was selected every time. Meaning that even
if you can only use one feature, it should be score differential (makes
sense).</p>
<p>However, the current down was only selected once!</p>
<p>This really underscores my point. Perhaps down is not that predictive.
But can you honestly say that you believe that down is unimportant to
winning? Or could you tell a coach that? That being said, it's not clear
that it <em>hurts</em> predictability or makes the model worse (more on that
soon), but it doesn't seem to be doing much of the heavy lifting in
predicting outcomes.</p>
<p>My initial hunch is that down is relatively unimportant until interacted
with other features. Only one way to find out!</p>Thinking about statistical decision-making vs. prediction2014-02-01T10:12:00-08:00treycauseytag:thespread.us,2014-02-01:thinking-about-statistical-decision-making-vs-prediction.html<p>What's the problem with prediction?</p>
<p>So far I've emphasized constructing a win probability model that
generalizes as well as possible and maximizes out-of-sample
prediction.The argument for this has been that we want to create a model
that best captures the relationship between variance in the features
(variables like seconds remaining, score difference, etc.) and variance
in the outcome (does this play belong to a winning team or not).</p>
<!--more-->
<p>I've done a little feature engineering by hand. Using domain knowledge
that plays at the end of the game are more likely to be important to the
outcome than plays near the beginning, I created a feature that
increases in a non-linear fashion as game time dwindles. This seemed to
improve the performance of the model, but not by a tremendous amount.</p>
<p>Thinking about features like this is an important part of building a
predictive model. However, sometimes it only gets you so far. Sometimes
you have many more features than you know what to do with (known as <em>the
<a href="http://www.stat.ucla.edu/~sabatti/statarray/textr/node5.html">curse of
dimensionality</a>)</em>,
or your features are highly correlated, or maybe you don't even know
what your features are.Maybe you find that taking the cube root of one
of your features produces a big jump in model performance but you can't
explain why.</p>
<p>Reducing the number of features can be accomplished in many ways. You
can combine features to create an index or you can reduce the
dimensionality by using a procedure like <a href="http://en.wikipedia.org/wiki/Principal_component_analysis"><em>principal components
analysis</em></a>(PCA).
PCA takes a set of features and transforms them into a new set of
uncorrelated features called principal components. This is extremely
useful if you suspect you have many correlations between your features
(called collinearity) and if you are using a modeling technique that has
a hard time with this state of affairs (like many linear models).</p>
<p>Often times preprocessing your data using something like PCA will
produce pretty significant model performance gains. The problem is that
instead of a set of features like [down, distance, yards from own goal,
etc.] you now have a set of features like [principal component 1,
principal component 2]. The units on the new features don't tend to make
a lot of sense, and it's not totally clear what they mean. Each feature
can contribute to multiple components to different degrees (we say that
they "load" on components) so we're no longer operating in a world where
we can say "each second that ticks off the clock is work .001 win
probability points."</p>
<p>This highlights one of the fundamental challenges that data scientists
face -- communicating models to people who use them to make decisions.
One of the greatest things about data science is the amount of time you
can spend tuning your model's parameters and hyper-parameters, doing
feature engineering, and eeking out every last little bit of prediction
accuracy that you can. However, if the model can't be used to make
decisions, it's not that useful. To me, this is what separates data
science in industry from data science in the academy.</p>
<p>Decision-making with statistics often occurs in situations where a model
must be used in certain ways. In the NFL, computers are not allowed on
the sidelines or in the booth. Thus, it's important that any predictive
model that's going to be used for decision-making be interpretable
without one. A coach or coordinator wants to know if he should go for it
on fourth in a particular situation. He needs to know what the impact of
picking up a first down will be. Chances are, he's not going to be
pleased (and you probably won't have a job for that long) if you tell
him, "hold on, let me figure out how these features load on my principal
components."</p>
<p>That means while you may be building a sub-optimal model in the
short-term, you will be maximizing the utility of your model. However,
suppose you have built a vastly superior, but highly difficult to
interpret model. As a data scientist, your next step is figuring out a
way, possibly via a visualization, to communicate the results in a way
that is easily acted upon. Of course, if you're trying to build the best
possible model, for betting or general prediction purposes, you're not
as limited by such constraints.</p>
<p>In a nutshell, one of the most important things about building a model
is knowing not only the necessary "data janitor" work to get started,
not only the latest algorithms, but also knowing how the model's output
will be used.</p>
<p>On that note, happy Super Bowl weekend! Looking forward to a great game
and getting back to writing about data science and football in the
coming weeks.</p>New code up on Github2014-01-20T18:32:00-08:00treycauseytag:thespread.us,2014-01-20:new-code-up-on-github.html<p>... and we're back!</p>
<p>After an extended absence due to "real-life" responsibilities and an
extended illness, I'm back. Unfortunately, the NFL season is winding to
a close, but what better time to talk about football than the
off-season?</p>
<!--more-->
<p>In anticipation of this, I've pushed a bunch of code to the site's
Github page. There you'll find code to setup a Postgres database with
the Armchair Analysis play-by-play data, all of the data-munging taken
care of for you, and lots of other assorted goodies. I've uploaded a
simple IPython notebook showing how to use the data to build a random
forest as well. More on the way soon, I promise.</p>
<p>You can find it all on
<a href="http://github.com/treycausey/thespread/">Github</a>.</p>Building a win probability model part 4: Feature engineering and model evaluation2014-01-01T15:42:00-08:00treycauseytag:thespread.us,2014-01-01:building-a-win-probability-model-part-5-feature-engineering-and-model-evaluation.html<p>How do we continue to improve the model?</p>
<p>So far we've used a fairly simple set of features in the win probability
model. We saw that it performed pretty well on the training set and
performed slightly less well, but still much better than chance, on the
test set. Now it's time to delve deeper into increasing the accuracy of
the model and assessing attempts to do so. This involves two
processes: <em>feature engineering</em> and <em>model evaluation.</em></p>
<!--more-->
<p>I already covered some of the details of model evaluation in the
<a href="http://thespread.us/?p=136" title="Building a win probability model, part 3: What’s a good model?">previous
post</a>
in this series. We'll use the metrics I introduced in that post to
assess how good our feature engineering attempts are. Feature
engineering is the process of <em>selecting</em> the features that you think
will predict your target, <em>transforming</em> them to reflect their
relationship with the target, and potentially <em>creating</em> new features
out of the information you have to get the information you want.</p>
<p>We want to find the features that best classify plays as belonging to
winning or losing teams, so we want to try a variety of them. However,
every time you add new features to the model, you increase the
complexity of the model which can, in turn, lead to overfitting. This
means that you'll need to test how well your model does out of sample
each time you add new features to the model to see if it improves or
decreases the model's accuracy.</p>
<p>In doing so, though, you run the risk of overfitting to the <em>testing
set. </em>By repeatedly checking to see how well the model predicts
out-of-sample on the same sample, you're essentially using the testing
set as an extension of the training set. One way to work around this is
to have a small dataset called a <em>validation set</em> that you use for all
of this model tuning before you finally roll to evaluating its accuracy
on the test set. For simplicity, I'm not going to do that right now, but
we'll return to it in a later post. For a good explanation of the
differences between the training, validation, and test sets, see <a href="http://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set">this
Stack Overflow
question.</a></p>
<p><strong>Feature engineering</strong></p>
<p>We've already engaged in some feature engineering -- by <em>transforming</em>
the minutes and seconds to a 'seconds adjusted' feature. We're going to
add three more features to the model: the Vegas line, the Vegas total
line, and a new transformed time feature. The first two are fairly
obvious -- including the spread will allow us to incorporate (perceived)
team strength into our estimates and the total line will allow for
situations when various amounts of offense are expected.</p>
<p>The third feature is a way to account for the fact that time has an
unevenly distributed effect on win probability. Being down by 9 points
in the first quarter isn't great, but it's not the kiss of death, but
being down by 9 points with less than a minute remaining in the game is
a very different situation. So we need a number that stays relatively
constant but starts changing more as the number of seconds remaining
increases. For this feature, we'll use 1/sqrt(seconds remaining + .01).</p>
<p>The training data, from <a href="http://armchairanalysis.com/">Armchair
Analysis</a>, already conveniently has the
spread and over/under in the data. The test data does not, so some data
munging is required. I'll post the code shortly. I obtained all of the
Vegas data <a href="http://www.repole.com/sun4cast/data.html">here</a>.</p>
<p><strong>Model evaluation</strong></p>
<p>Let's look at two models. The random forest model that we've been using
up to now, which we'll call the 'limited model', and the new random
forest model that includes our new features, which we'll call the 'full
model.' You can click on any image to see a larger version.</p>
<p><a href="http://thespread.us/images/roc_full_vs_limited.png"><img alt="roc_full_vs_limited" src="http://thespread.us/images/roc_full_vs_limited.png" /></a></p>
<p>It looks like the full model, with the blue line, does do slightly
better than the limited model, the purple line. To verify, let's look at
some evaluation statistics.</p>
<p>[table id=4 /]</p>
<p>WOW! That's not much of an improvement. Surprising, given how much new
information we've included. Only a percent improvement in accuracy here
and there. While there were some non-trivial gains in the training set,
the out-of-sample performance didn't change much (although it didn't
worsen, which is always a good thing). This is definitely something
we'll want to revisit over time. Feature engineering and model selection
is an ongoing process.</p>
<p><strong>Model evaluation</strong></p>
<p>Since we're trying to make predictions about future events, we want to
know the model's weak spots. We've gotten a birds-eye view with the
above metrics, but is this accuracy evenly distributed throughout game
situations? Let's find out. One way to look at this is to look at the
average accuracy of the model at various points in the game. To do this,
I've tested the model for each minute from 0 to 60 and plotted the
accuracy score.</p>
<p><a href="http://thespread.us/images/accuracy_by_minute.png"><img alt="accuracy_by_minute" src="http://thespread.us/images/accuracy_by_minute.png" /></a></p>
<p>Unsurprisingly, the model gets more accurate as the end of a game
approaches. Notice that the accuracy increase in the test set isn't as
smooth as in the training set. This is to be expected. Interestingly
enough, there are actually points where the limited model is more
accurate out-of-sample than the full model, though not consistently.
This just underscores the idea that there usually isn't the answer to a
modeling problem often isn't 'a model', but rather 'several models.'</p>
<p>Now for something really interesting. Since we're making predictions
about wins, one metric we might be interested in is how <em>calibrated</em> our
estimated probabilities are. Plays that a well-calibrated model
estimates as having a win probability of 0.5 should be wins about 50% of
the time; a win probability of 0.75 should belong to winning teams about
75% of the time, and so on.</p>
<p><a href="http://thespread.us/images/win_probability_by_wins.png"><img alt="win_probability_by_wins" src="http://thespread.us/images/win_probability_by_wins.png" /></a></p>
<p>This is super interesting. Essentially what we see here is that the
model is wrong at the extremes, but in different ways. Teams with very,
very low win probabilities still end up winning the game about 20% of
the time. Teams with win probabilities as high as 80% may only win the
game 60% of the time. I've also included the 95% confidence interval
here because this is based on a relatively small test set.</p>
<p>It's important to think about this, especially when you see games with
wild win probability graphs. Thinking about this is one of the reasons I
started this project in the first place. This is a really important
plot, and not one you'll see on a lot of sites. I tweeted about this,
and <a href="http://tempo-free-gridiron.com/">Tempo Free Gridiron</a>, who produces
win probabilities for college football, helpfully <a href="https://lh5.googleusercontent.com/-QFPvIAP1atY/UsSJ5AjUjCI/AAAAAAAAACQ/uI9W6n2hR1I/s800/test2.png">produced
one</a>
as well.</p>
<p>Brenton Kenkel <a href="https://twitter.com/brentonk/status/418482022915780608">asked me on
Twitter</a> why the
errors in the above plot aren't symmetric -- i.e., why isn't the model
wrong the same amount of time around 0 and around 1? I don't have an
immediate answer for this, but it's a good question! I'd love to hear
your thoughts.</p>
<p><strong>Closing</strong></p>
<p>Let's revisit the Baltimore-Denver game that opened up the 2013 season
and that I used as an example in a <a href="http://thespread.us/?p=45" title="Building a win probability model part 1">previous
post</a>.
Here's the win probability for that game using the full model. I've
altered the uncertainty estimates, though, and am instead using a
bootstrapped 95% confidence interval based on bootstrapped samples of
the estimated probabilities from each tree of the random forest.</p>
<p><a href="http://thespread.us/images/bal_den_win_prob_bootstrap.png"><img alt="bal_den_win_prob_bootstrap" src="http://thespread.us/images/bal_den_win_prob_bootstrap.png" /></a>The
bootstrapped confidence interval got rid of some of the wild swings in
uncertainty in the final portions of the game, but the interval also
appears to be wider in general than the previous one. That being said, I
think I trust these estimates more than the previous hack-y ones.</p>
<p>Also of note is that Denver, which was a 7.5 point favorite in this
game, starts the game with approximately a 75% win probability and,
despite trailing at the half, never dropped below a 50% win probability
(although the confidence interval at halftime stretches from 20% to 90%.
That tells a much richer story than the \~58% mean win probability
expressed by the single blue line.</p>
<p>Progress! Next up, I'm implementing a tool that will allow you to
interactively compare and plot game situations with all of the
uncertainty and evaluation metrics included. I'm still working on live,
in-game win probability estimates for the playoffs. I don't know if that
will happen or not.</p>
<p>The code for this post will be posted ASAP!</p>Full model vs. model moel2014-01-01T15:13:00-08:00treycauseytag:thespread.us,2014-01-01:full-model-vs-model-moel.html<p>[["Metric","Full model","Limited model"],["Training
accuracy","0.80","0.76"],["Test
accuracy","0.66","0.65"],["Precision","0.66","0.65"],["Recall","0.66","0.65"],["AUC","0.71","0.70"]]</p>Building a win probability model, part 3: What's a good model?2013-12-22T16:04:00-08:00treycauseytag:thespread.us,2013-12-22:building-a-win-probability-model-part-3-whats-a-good-model.html<p>How do we know when we have a good model?</p>
<p>The win probability model is well on its way and we can now produce
probabilities for any given down, distance, score differential, field
position, and time remaining. Yet we haven't really evaluated our model.
How do we know if it's any good? This ends up being a more complicated
question than one might think at first.</p>
<!--more-->
<p>There are some standard techniques for evaluating the performance of a
classifier (remember -- the model is trying to predict if a given play
belongs to the 'winning' class or not). I'll walk through some of these
and discuss how they give us a better understanding for how much trust
we can put in the model. After all, if you want to base decision-making
on the model, you'd like to have some idea of how well it performs. As
far as I know, most existing win probability models out there don't
present any kind of model-checking diagnostics or let you see behind the
curtain.</p>
<p>I've already mentioned that we're primarily concerned with how well the
model <em>generalizes</em> out of our training set of plays. In other words, we
want to have confidence that the model will perform reasonably well when
it's presented with new data. We need more than a model that does a good
job of explaining past games -- we already know who won those games.
This is where the <em>learning</em> part of <em>machine learning</em> comes in. We
want our model to learn about the things that identify when a team is
likely to win a game and, once it's learned to identify those things, to
apply this knowledge in new places and situations.</p>
<p>But in order to do that, our model first needs to learn about
the <em>training data. </em>This means we need to know how well the model
performs on the 2002-2012 play-by-play data. There are a few ways to do
this. The easiest first stop is to ask our model to make predictions on
every play in the training set and then compare that against the truth.
This will give us an overall accuracy score. In the case of our random
forest model, our accuracy score is 0.76.</p>
<p>Put another way, when asked to guess if each play in the training data
belongs to a winning team or a losing team, the model is right 76% of
the time. Pretty good, right? Well, maybe. When you're evaluating how
well a prediction is doing, you need to know the <em>base rate</em> -- how
often each class occurs in your data set. Maybe it's just that winning
teams run a lot more plays than losing teams. If winning teams run 76%
of the plays, our model isn't really doing any better than chance.</p>
<p>Luckily, winning teams run 50.82% of the plays in the training data. So,
we're doing better than chance on that front.</p>
<p><strong>Out of sample testing</strong></p>
<p>So, how does the model do out of sample? Its average accuracy score on
2013 games is 0.63, or it gets the class right about 63% of the time. Uh
oh. That's 13% worse than the training set! Is this a problem? No. We'd
love for the model to do better on the test set if possible, but you
would expect performance to drop. The 2013 games might contain plays the
model has never seen before, or the same plays the model has seen before
but with different outcomes, or any combination of these two.</p>
<p>So far, we've only talked about 'accuracy.' But there are different ways
to be right and wrong. Each prediction the model makes has one of four
possible characteristics: a <strong>true positive</strong>(the model correctly
predicts a winning team), a <strong>false positive</strong> (the model predicts a
winning team that is actually a losing team), a <strong>true negative</strong> (the
model correctly predicts a losing team), or a <strong>false negative</strong> (the
model predicts a losing team that is actually a winning team). Building
a predictive model means figuring out the right tradeoffs between the
correct and the incorrect predictions. You can imagine there are cases
when a false positive is more costly than a false negative, so we might
pick the model that prioritizes one over the other.</p>
<p><a href="http://thespread.us/images/precision.png"><img alt="Precision and recall from
Wikipedia." src="http://thespread.us/images/precision.png" /></a></p>
<p>We can put a number to these rates. <em>Precision</em> is the proportion of our
predicted wins that were actually wins. The random forest model has a
precision score of .63, meaning of all the plays it predicted to belong
to winning teams, it was right 63% of the time. <em>Recall</em> is another
common metric: out of all the <em>actual</em> plays belonging to winning teams,
what proportion of them did the model catch? In the random forest model
case, it's .63 again. This is pretty good news, it means our model is
not biased toward positive or negative classification. For a visual
depiction, see the <a href="http://en.wikipedia.org/wiki/Precision_and_recall">graphic from
Wikipedia</a> above.</p>
<p>One common way of presenting how the model does on true/false positives
and negatives is a <em>confusion matrix. </em>It looks like this:</p>
<p>[table id=3 /]</p>
<p><strong>Thresholds</strong></p>
<p>So far, I've been talking about the model making predictions about
winning and losing, but we also know that the model is actually
producing probabilities. This was a little sleight of hand on my part.
Because the model is actually producing probabilities, we can select a
probability above which we're willing to believe the play belongs to a
winning team. Most software packages use .5, or 50/50, as the threshold,
but this is often not a good choice. Particularly in many noisy
situations when we can't know the 'true' probability of an event, we may
just be willing to accept relative probabilities that ranked in order.</p>
<p>So, how to pick a threshold? One popular way is to find the threshold
that maximizes the true positive rate and minimizes the false positive
rate. We can ask the model to make predictions at many different
thresholds and record these numbers for each threshold and then plot
them against each other. What this produces is something called
a <em>receiver operating characteristic</em> curve (often abbreviated as ROC
curve).It looks like this:</p>
<p><a href="http://thespread.us/images/rf_roc.png"><img alt="rf_roc" src="http://thespread.us/images/rf_roc.png" /></a></p>
<p>For our model, it turns out that the threshold that satisfies this
condition is 0.492906.</p>
<p>We can also use the ROC curve to select compare different models against
each other. Here's an example where I compard our random forest to a
couple of other models, using gradient boosted decision trees and
logistic regression:</p>
<p><a href="http://thespread.us/images/model_rocs.png"><img alt="model_rocs" src="http://thespread.us/images/model_rocs.png" /></a></p>
<p>In this case, we want to select the model that maximizes the <em>area under
the curve</em> (AUC). Unfortunately, it looks like all three of these models
perform almost equally well. That's a potentially troubling sign and it
might mean that there's so much noise in the play-by-play data, that
producing a better model might be hard work. Then again, we're using a
relatively small number of features right now.</p>
<p>You've made it this far and you might be asking yourself -- so, <strong>do we
have a good model here or not? </strong>And the answer to this question is,
unfortunately, another question -- <strong>compared to what? </strong>That's the
question data scientists are always asking themselves. Does this model
perform as well as any other model we've tried so far? Yes. Is it doing
better than chance? Yes. Is it doing a lot better than chance? Eh, we're
getting there. Can we do better? We're going to try.</p>
<p>Checking out how these models compare to one another with some new
features is the next step. We'll also examine how to decide how often to
retrain the model with new data. Until next time.</p>Confusion matrix2013-12-22T15:31:00-08:00treycauseytag:thespread.us,2013-12-22:confusion-matrix.html<p>[["","Predicted loss","Predicted win"],["Actual
loss","8314","5251"],["Actual win","4695","8719"]]</p>Building a win probability model, part 22013-12-18T20:17:00-08:00treycauseytag:thespread.us,2013-12-18:building-a-win-probability-model-part-2.html<p>Our first model</p>
<p>Now that we have play-by-play data in a format ready for analysis, have
selected our features and target, and have given some thought to why the
world needs another win probability model, it's time to start modeling.
<!--more--> The first modeling technique we'll be trying is called a
<em>random forest</em> (or, if you want to use the non-copyrighted name,
forests of randomized decision trees). This is a really popular method
these days, and if you take a look at data science competitions on
Kaggle you'll see that many of the winners use random forests as part of
their modeling toolkit.</p>
<p>Random forests are popular for a number of reasons. First, they
don't <a href="http://thespread.us/?p=30" title="Win probability, uncertainty, and overfitting"><em>overfit</em></a>
as easily as many other methods. Second, they're really robust to
non-linear interactions among your features. Third, they're surprisingly
accurate in lots of modeling situations. And, fourth, they're easy to
run in parallel, which means that you can estimate random forests on
really honking big data sets across lots of computers with lots of CPUs.</p>
<p>So... what exactly <em>are</em> they and how do they help us estimate win
probability? A full discussion of random forests would take way too long
and, besides, there are lots of them out there. I'll try and give you
the high-level view, though. Random forests are a kind of ensemble
model. Instead of building one model, you build lots of small (usually
simple, bad) models and combine their output, which often leads to more
accurate predictions than using a single, highly tuned model.</p>
<p>One way that random forests do this is through a process known
as <em>bagging</em>, which is just short for <em>bootstrap aggregating.</em> Still
with me? OK. What's going on under the hood here is that many small,
random samples are taking from your original data <em>with
replacement. </em>Taking samples with replacement is a way of simulating
taking many random samples from the entire population when you only have
a sample.</p>
<p>So, while you only have one dataset, you're simulating many datasets
that were (hopefully) produced by the same data generating process. You
then build a model for each of these little samples. Random forests take
this one step further and do the same thing with all of your features as
well. So, for each of these little models, you have a random subset of
your features. Each of these little models then makes a prediction about
the observations in its subsample -- for our purposes, that means for
each play, the small models each make a prediction about if the play
belonged to the winning team or not (and with what probability).</p>
<p>Each of these predictions from the subsamples are sometimes
called <em>votes</em>. We can then tally up all of those votes using some
predetermined rule (often just a majority) and get more accurate
predictions. Each of the small models that are built are <em>decision tree
classifiers</em>. These are basically flowcharts used to make predictions.
The random forest algorithm keeps splitting the subsamples up based on
values of the features until it reaches a predetermined stopping point.
There are a number of ways to determine how to make these splits, but
suffice to say that they come from information theory. Here's an example
of a decision tree from Wikipedia.</p>
<p><a href="http://thespread.us/images/Decision_tree_model.png"><img alt="Example decision tree from
Wikipedia" src="http://thespread.us/images/Decision_tree_model.png" /></a></p>
<p>A nice side effect of this and the random sampling of the features
between models is that we can estimate how important various features
are for predicting win probability. If we can remove a feature from our
model and our errors in predicting wins don't get much worse, that
feature probably isn't that important to our model. In the end, this
means that all of your observations and all of your features are used in
building the model, but not at the same time. This helps with
overfitting and allows you to use as much of your data as possible when
building the model.</p>
<p>Importantly, each of these votes also provides a predicted probability
of winning, which will enable us to quantify the uncertainty surrounding
our final win probability estimates. We'll just take each of the
predicted probabilities for each observation, sort them in ascending
order, and take the 2.5th and 97.5th percentiles to give us a 95%
confidence interval. Hopefully, this will mean that game situations that
usually belong to a winning team will have more precise estimates of win
probability than those situations that are less associated with winning.
We can also examine individual game situations visually. Here's an
example.</p>
<p><a href="http://thespread.us/images/touchback_probs1.png"><img alt="touchback_probs" src="http://thespread.us/images/touchback_probs1.png" /></a></p>
<p>This violin plot represents the distribution of estimated win
probabilities by the random forest model for a team receiving the ball
at the beginning of the game on their own 20 via a touchback from the
opening kick. The plot gets wider as there is more 'density' or more
votes at that probability. The dashed line represents the median vote --
the 50th percentile of the votes and the two dotted lines are the 25th
and 75th percentiles. Interestingly, the model says that teams who
receive the ball first have about a 47% win probability (.474,
actually). One thing we can ask -- does this make sense? We'll explore
that soon.</p>
<p>I've used <a href="http://scikit-learn.org/">scikit-learn</a>, a machine learning
library for Python, to create my model, but you can use just about any
statistics package to build a random forest. <a href="http://www.wiley.com/WileyCDA/WileyTitle/productCd-111866146X.html">You can even use
Excel</a>!</p>
<p>[Technical sidenote: Random forests require some <em>tuning</em> to find the
optimal number of observations to include in each subsample, how many
features to use in each model, how many 'trees to grow', and so on.
There are different ways to do this. I used what's known as grid search,
which tries many different parameters and selects the best one based on
cross-validation error. More technical posts on this later. My model
uses 150 trees, with a max of 3 features per tree, and a minimum of 100
samples per leaf.]</p>
<p>In this post, I'll show you what plotting the win probability looks like
for a single game. In a following post, I'll discuss different ways
to <em>validate </em>a model; i.e., how do we know if we have a <strong>good</strong> model?
For now, suffice to say that I'm testing the model on 2013 data, which
were not used in building the model.</p>
<p><strong>Modeling a single game</strong></p>
<p>Let's start examining what the model can do by picking a single game,
this season's opening game. After a slow start and trailing at the half,
Denver defeated Baltimore by a score of 49-27. Here's what the win
probability plot looks like without any kind of uncertainty estimates.
The red lines indicate the start of each quarter.</p>
<p><a href="http://thespread.us/images/den_win_week12.png"><img alt="den_win_week1" src="http://thespread.us/images/den_win_week12.png" /></a></p>
<p>Some important things to notice off the bat. Roughly halfway through the
first quarter, we see that the model is giving Baltimore roughly 70%
probability of winning the game. Denver's only down by a touchdown and
it's not even the second quarter yet! Crazy. Clearly, this leads us to
question what it means to have a 'win probability' in the first quarter
(hat tip to <a href="https://twitter.com/joelgrus">Joel Grus</a> for that insight).</p>
<p>Let's take a look at a first pass at uncertainty. I stored the votes
from each of the 150 trees of the random forest for each play and used
it to construct what I'll call a 95% uncertainty interval. This isn't
really a confidence interval <em>or</em> a prediction interval, but it does
gives us some idea of how stable the probability estimates are.</p>
<p><a href="http://thespread.us/images/den_win_with_uncertainty.png"><img alt="den_win_with_uncertainty" src="http://thespread.us/images/den_win_with_uncertainty.png" /></a></p>
<p>OK, clearly this needs some work -- what are those stalactites in the
3rd and 4th quarters all about? Is this a good way to quantify
uncertainty? One thing that's clear. The 70% win probability attributed
to Baltimore in the first quarter also received 95% of its estimates as
low as 52% and as high as 84%. That's a pretty wide range, and conveys a
lot more information than 70%. The uncertainty estimates get narrower as
the game goes on, as we would hope they would, with those obvious
exceptions.</p>
<p>In the coming posts, we'll test out some other models -- gradient
boosted decision trees and good old fashioned logistic regression and
see how they fare. We'll also look at <em>feature engineering</em> -- the
process of selecting features and combining features to increase the
accuracy of our model.</p>
<p>Code for this post is available on
<a href="http://github.com/treycausey/thespread/">Github</a>.</p>Probabilities, models, and reality2013-12-10T12:33:00-08:00treycauseytag:thespread.us,2013-12-10:probabilities-models-and-reality-2.html<p>Alternate title: statistics and smugness</p>
<p>If you follow me on <a href="http://twitter.com/treycausey">Twitter</a>, you may
have seen me rant here and there about sports analytics people and
treating statistical questions as settled and being too smug about the
conclusions they draw from existing analytical work. This might sound a
little crazy coming from someone who builds statistical models for a
living and for a hobby. Am I actually advocating for journalistic
narratives of grit and momentum?</p>
<!--more-->
<p>Not at all -- in fact, the best data scientists are up front about their
uncertainty and about treating models as if they are reality. There's
even a term for this:
<a href="http://en.wikipedia.org/wiki/Reification_(statistics)">reification</a>.
The models that we build are abstractions of reality. This is a feature,
not a bug. Model building is all about identifying the information that
best characterizes a phenomenon and generalizes across multiple
scenarios.</p>
<p>Recently, the <em>New York Times</em> introduced a new feature in conjunction
with Advanced NFL Stats, the <a href="http://www.nytimes.com/newsgraphics/2013/11/28/fourth-downs/">Fourth Down
Bot</a>.
Fourth down is one of those situations in which the analytically savvy
know that coaches are far, far too conservative. This bot tracks every
fourth down situation each weekend in the NFL and crunches the numbers
to see what would be an optimal call and whether or not the coaches made
the 'right call.'</p>
<p>So what's the problem? See Aaron Schatz below, or <a href="http://statsbylopez.wordpress.com/2013/12/04/my-quick-thoughts-on-the-4th-down-bot/">Michael
Lopez</a>.</p>
<blockquote>
<p>The problem with <a href="https://twitter.com/NYT4thDownBot">@NYT4thDownBot</a>
as currently built is it encourages the idea that 4th D is always a
cut-and-dry decision.</p>
<p>— Aaron Schatz (@FO_ASchatz) <a href="https://twitter.com/FO_ASchatz/statuses/407989774730141696">December 3,
2013</a></p>
</blockquote>
<p>If you read Grantland on a regular basis, you'll know that win
probabilities are often given as arguments for why coaches did or didn't
do the right thing. <a href="http://www.grantland.com/contributor/_/name/bill-barnwell">Bill
Barnwell</a> has
a regular feature, <em>Thank You For Not Coaching</em>, in which he routinely
does so.</p>
<p>As I mentioned in an <a href="http://thespread.us/?p=30" title="Win probability, uncertainty, and overfitting">earlier
post</a>,
these win probabilities are <em>estimates</em>. There's some error and
uncertainty associated with these estimates. Unfortunately, we don't
usually know their magnitude. Treating these numbers as if they're the
actual probability of winning is a fallacy. <a href="http://statsbylopez.wordpress.com/2013/12/04/my-quick-thoughts-on-the-4th-down-bot/"><span
style="font-size: 13px; line-height: 18px; color: #333333;"><br />
</span></a></p>
<p>To expand upon this further, take the example of the Advanced NFL Stats
Win Probability model. If it seems like I'm picking on Brian Burke's
work, let me assure you quite the opposite is true. Brian's a giant in
the field and his work is the standard against which all other work is
often compared (which is why the <em>Times</em> partnered with him).</p>
<p>Recently, some major enhancements were added to the ANS model. Most
notably? Incorporating team strength into the model. Think about this
for a second -- the existing model that most everyone was using to make
their arguments about coaching had not yet taken into account if one
team was known to be better than the other team. Now, Brian lays out the
case for being agnostic about team quality -- and it is a good one --
when building a win probability model; and the model that we initially
build here will not, either.</p>
<p>[Technical sidenote: As far as I know, Brian's model is not Bayesian,
but it uses the spread for a game as a sort of prior that decays in its
impact as the game is played. You could think of this as a Bayesian
prior with influence that is eventually overtaken by the likelihood.
There needs to be a lot more Bayesian work in sports.]</p>
<p>Yet, this is a major update. I have no idea how much this changes
estimates and what previous estimates would look like when using the new
model. But it serves as an important corrective -- the model you're
using is never <strong>the </strong>model. It's just a current iteration of a set of
models. The estimates that the model produces aren't reality. They're
the model's best guess about reality using a limited set of simplified
inputs.</p>
<p>This is why it's so tremendously important, as data scientists and as
sports analysts, to iterate, iterate, iterate, then validate, validate,
validate. Models become stale the minute they're deployed. Are they
robust to outliers and extreme values (so-called 'black swans?') Has the
data generating process changed? Are there heterogeneous populations
that are being modeled as homogeneous?</p>
<p>These are the questions you need to ask of yourself and your models.</p>Building a win probability model part 12013-12-08T21:06:00-08:00treycauseytag:thespread.us,2013-12-08:building-a-win-probability-model-part-1.html<p>Data preparation</p>
<p>This begins a series of posts on building a win probability model. In
actuality, we're going to be building a lot of models that will be
combined into one, a technique known as <em>ensemble learning</em>. There are
several advantages to doing this. First, we can start very simply and
measure how our win probability model does against existing models and,
second, it will allow us to iterate and improve in small steps.
<!--more--></p>
<p>Tackling any data science problem requires (wait for it...) data. I'll
be using play-by-play data from <a href="http://armchairanalysis.com">Armchair
Analysis</a>, combined with data from <a href="http://nflsavant.com/about.php">NFL
Savant</a> and <a href="http://www.advancednflstats.com/2010/04/play-by-play-data.html">Advanced NFL
Stats</a>.
The Armchair Analysis data costs \$25, but it's been cleaned quite well
and comes separated into various tables to be loaded into a SQL
database. It does not include 2013 data, so in order to make predictions
and do out-of-sample testing, I'll supplement using the NFL Savant and
ANS data.</p>
<p>We're going to approach this particular data science problem as an
example of a <em>classification </em>problem (or, more specifically, a <em>class
probability estimation</em> problem). In plain English, we're trying to
figure out which <em>class, </em>or group, each of our plays belongs to -- are
they are winners or losers? What our classification model will do is
estimate the probability that each play belongs to either the winners
class or the losers class.</p>
<p>To begin to estimate the probability that a team wins before a given
play situation, we need a few things. We need our variables,
or <em>features</em> as they are called in machine learning, and we need an
outcome, or <em>target</em> in machine language terms. If you have a more
traditional science or experimental background, you might recognize
these as the independent and dependent variables, respectively. We'll
try and figure out what combination of our features best predicts the
target and with what probability. What are our features and target in
this case?</p>
<ol>
<li>Score. We need the score as it was before each play was run. In
order to simplify things, we're actually going to use the <em>score
differential</em> with respect to the offense. This is, at its heart, an
offensive model, so it makes sense to frame things offensively. So,
each row of our dataset should have the point differential. If it's
positive, the offense is leading by that many points, if negative,
the offense is trailing by that many points, and if it's zero, the
teams are tied.</li>
<li>Down and distance. Pretty self explanatory, but we'll need separate
columns for down and yards to go until the next first down.</li>
<li>Field position. Similar to score differential, we'll code this in
terms of the offense. Rather than use 50-yard increments, we'll
convert field position into 100-yard increments and code it as
distance from own end zone. A team that is receiving the ball for
the first time following a touchback would be scored as on the 20;
the same position in the other team's half (the beginning of the
'red zone') would be the 80.</li>
<li>Time remaining. Again, we could break this up a bunch of different
ways: quarters, minutes, seconds, etc. If we begin with an
assumption that win probability estimates become more certain as the
amount of time in the game decreases, we should use a constant unit.
I'll use seconds remaining as my unit.</li>
<li>Outcome. If we're going to build the model, we need to know if teams
actually ended up winning from a given position or not. This might
require a little data fu to compile for some. Again, we'll code this
in terms of the offense, coding this variable as a 1 if the offense
ended up winning and a 0 otherwise.</li>
</ol>
<p><span style="font-size: 13px; line-height: 18px;">We'll add more
variables, like time outs remaining, as we build the model, but this is
a good start. For now, I've excluded kickoffs, no plays, onside kicks,
two-point conversions, punts, and field goal/extra point lines from the
data as well as all post-season games. </span></p>
<p>Why exclude the post-season? When data scientists create models, they're
often operating under the assumption that all of the data were produced
by the same <em>data generating process. </em>This is just a way of saying that
the same basic decision-making processes were used by coaches to
generate the data we have here. We can't know what the coaches were
thinking or what they meant to do, only what they did. The post-season
is a different monster. Since one loss will end your season and send
your team on vacation, coaches may employ different strategies and
attempt things they might not normally do during the season. This
implies that the post-season data might be generated via a different
process.</p>
<p>Now, notice that I just made an assumption about the data generating
process. I don't know if it's true. One of the things we can do later is
test it. That's one of the foundations of doing good data science --
make your assumptions explicit and test to see whether or not the data
support them.</p>
<p><span style="font-size: 13px; line-height: 18px;">Here are the first
five rows of my data, which I've stored in a Postgres database (note I
start using data from 2001, even though Armchair Analysis begins with
2000, because of a data issue):</span> [table id=2 /]</p>
<p>Inspecting the data</p>
<p>The next step is to check the quality of the data and see if there are
any obvious extreme values (i.e., a field position of greater than 100
yards, seconds greater than 3600, etc.) Some tables and plots when you
first get your hands on a new data set can go a long way toward avoiding
headaches later on.</p>
<p>To do this, I'll be using my language of choice, Python, but you can do
any of this in any language. All of my Python code will be submitted to
my <a href="https://github.com/treycausey/thespread">Github account</a>. If you're
completely new to this, you may want to check out <a href="http://shop.oreilly.com/product/0636920023784.do">Python for Data
Analysis</a> to get the
basics of the PyData stack and/or <a href="http://shop.oreilly.com/product/0636920018483.do">Machine Learning for
Hackers</a> for an
overview of some of the methods used here (although in R).</p>
<p><a href="http://thespread.us/images/seconds_remaining2.png"><img alt="seconds_remaining_603" src="http://thespread.us/images/seconds_remaining2.png" /></a></p>
<p>Just looking at the distribution of seconds remaining across all of the
plays, we already see a few interesting things. The beginnings and ends
of quarters and halves immediately jump out -- more plays will have
3600, 2700, 1800, or 900 seconds remaining. We also see a bunching up
around halftime and the end of the game (0 on the x-axis), presumably
because more time outs are called here. Everything looks good so far.</p>
<p><a href="http://thespread.us/images/down_distribution.png"><img alt="down_distribution" src="http://thespread.us/images/down_distribution.png" /></a></p>
<p>This looks good too. Let's look at two more, field position and final
score differential before we wrap up this post.</p>
<p><a href="http://thespread.us/images/yards_from_own_goal.png"><img alt="yards_from_own_goal" src="http://thespread.us/images/yards_from_own_goal.png" /></a></p>
<p>Again, nothing too surprising here. Most teams don't spend a whole lot
of time backup up to their own end zone, but we see a big spike at 20
yards due to touchbacks. This then decreases steadily from about the
halfway mark.</p>
<p>Finally, let's look at the final score differential.</p>
<p><a href="http://thespread.us/images/score_diff_distribution_20.png"><img alt="score_diff_distribution_20" src="http://thespread.us/images/score_diff_distribution_20.png" /></a></p>
<p>Looks like the most common score differential is pretty low with a few
extreme values at the other end of the distribution from rare blowouts.
Let's take a more granular look at the same data, increasing the number
of bins in the histogram.</p>
<p><a href="http://thespread.us/images/score_diff_distribution_601.png"><img alt="score_diff_distribution_60" src="http://thespread.us/images/score_diff_distribution_601.png" /></a></p>
<p>Now we see that the most common score differential is 3 points, with the
next bump at 7. This makes prefect sense.</p>
<p>Wrapping up, we haven't found any crazy surprises in simple plots of our
data. This is great news! Our next step is to start building the model.
If you want a sneak preview of how we'll do that initially, check out
Yhat on <a href="http://blog.yhathq.com/posts/random-forests-in-python.html">building random forests in
Python</a>.</p>Football analytics on Twitter2013-12-04T15:34:00-08:00treycauseytag:thespread.us,2013-12-04:football-analytics-on-twitter.html<p>Per a request I got on Twitter, I created <a href="https://twitter.com/treycausey/lists/football-analytics">this
list</a> of
football analytics accounts on Twitter. Feel free to let me know if you
notice omissions or want to be added.</p>Win probability, uncertainty, and overfitting2013-12-01T23:55:00-08:00treycauseytag:thespread.us,2013-12-01:win-probability-uncertainty-and-overfitting.html<p>Uncertainty estimates</p>
<p>As a first exercise, I'm building a per-play win probability calculator.
In an effort to be more transparent and make this more of a teaching
exercise, I'll walk you through my thought process on why there's room
for another win probability calculator as I show you how I build
it.<!--more--></p>
<p>A few already exist, the most well-known being the <a href="http://wp.advancednflstats.com/winprobcalc1.php">Advanced NFL Stats
WP model</a>. <a href="http://www.pro-football-reference.com/about/win_prob.htm">Pro
Football
Reference</a> has
also implemented one this year based on Wayne Winston's model in
<em><a href="http://www.amazon.com/gp/product/0691154589/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=0691154589&linkCode=as2&tag=thespread08-20">Mathletics</a></em>.
One of the areas that I see as ripe for improvement with these models is
the explicit incorporation of uncertainty estimates.</p>
<p>Most models we encounter produce what are called <em>point estimates;</em>these
are the single number that represents our best guess of a particular
outcome (like the probability of winning a game). Most of these guesses
are probabilistic, though, which means we can be more or less certain
how likely our guess is going to be (close to) the real outcome.</p>
<p>If you think in terms of probability, you're already used to
incorporating uncertainty into your estimates whether or not you realize
it. For instance, if you say there's a 10% chance of a team winning a
game, you're saying that your best guess (your point estimate) is that
they'll lose the the game, but that there's a small (but non-zero)
chance that they'll actually win.</p>
<p>We can take this a step further. When we say that a team has a 10%
chance to win, are we saying that the team has <strong>exactly</strong> a 10% chance
of winning? No, 10% is a point estimate. Depending on how much
information we have about teams in this particular position, we can be
more or less certain about that 10%. If it is a situation with gobs of
data, we may actually think that the team has between a 5% and 15%
chance of winning, with 10% being the middle of that uncertainty
interval. However, maybe only a few teams have ever been in this
situation, so it's actually the case that the team has between a 1% and
a 19% chance of winning. In the second scenario, the team may have as
almost twice the probability of winning as in the first scenario, but
they both are reported as a 10% win probability.</p>
<p>I think about this a lot when I see
<a href="http://www.sportsplusnumbers.com/2013/10/whats-matter-with-win-probability.html">articles</a>
talking about wild swings in win probability or teams with extremely
high win probabilities before tanking in the end. There's no question
that these things happen, I just wonder how wild the swings are. Small
(and sometimes bigger) changes in win probability from one play can fall
comfortably within the uncertainty estimate of the win probability on
the next play. Not only does providing uncertainty estimates allow for a
more accurate representation of the likelihood of an outcome, it also
helps in decision-making.</p>
<p><strong>Overfitting</strong></p>
<p>This brings me to my second point: overfitting. This is a topic that
deserves its own series of posts, but I'll hit the high points here
before discussing it at more length later. When data scientists build
predictive models, we usually build the models on a set of data known as
the <em>training set </em>(the model-building step is also called model
training). But what we really want, in many cases, is to be able to make
predictions about the future.</p>
<p>There are two obvious problems with that goal. First, we can't check and
see how good we are at predicting the future until it actually happens.
Second, doing that could be tremendously costly. So what we normally do
is reserve a portion of the data we have, called the <em>test set. </em>We
don't use this set of data when we build the model. Once the model is
built, we use it to make predictions using all of the data in the test
set but not telling the model the outcome. We let the model make
predictions and then expose the actual outcomes and see how well we did
in predicting them. How well the model does is known as <em>out-of-sample
performance.</em></p>
<p>Why is this important? Why wouldn't we want to use all of our data in
building the absolute best model? We do -- and more posts are to come
about this. The problem, though, is when we gauge the model's
performance only by using the data used in building the model, we are at
risk of <em>overfitting. </em>The goal of the model is to find the underlying
patterns that do the best job of predicting outcomes in as many
situations as possible; that is, the goal is for the model
to <em>generalize. </em>If we only have one set of data and we use all of it in
the model, we could continue to build a model more and more complicated
until it perfectly predicted all of the outcomes in the training set.
The only problem is that in doing so we have built a complicated
description of the data we already have and that probably will fall
apart some when faced with new data.</p>
<p>This is a really important concept, and it's taught much more in
computer science/machine learning than in traditional statistics
courses. I'll be returning to it frequently.</p>
<p><strong>Summing up</strong></p>
<p>So far we know our model needs two things to build upon the win
probability models that already exist -- quantification of uncertainty
and a measure of out-of-sample performance. Luckily, there's a class of
models that let us do these things easily. They're called <em>ensemble
methods;</em> I'll discuss them in the next post.</p>
<p><strong>Update</strong>: Here's a <a href="http://blog.kaggle.com/2012/07/06/the-dangers-of-overfitting-psychopathy-post-mortem/">great
post</a> on
Kaggle detailing the dangers of overfitting.</p>About2013-11-30T17:11:00-08:00treycauseytag:thespread.us,2013-11-30:about.html<p>What's all this?</p>
<p>I'm Trey Causey. I'm a data scientist. I'm a football fan. This site is
an attempt to bring the two together, in the hopes of achieving two
goals. First, to kickstart the use of methods from data science in the
football analytics world. Second, to teach some introductory data
science using interesting, substantive, real-world examples.</p>
<p>The name of the site is a riff on both the spread offense and the spread
used in betting. This is not a betting site, does not offer any advice
on betting, or endorse sports betting in any way. That being said, the
betting world is often a few steps ahead of the game when it comes to
analytics and forecasting.</p>
<p>It's a great time to be involved in sports analytics. Baseball has
already seen its sabermetric revolution. Basketball is quickly
following, especially with the introduction of SportVU and related
technologies.</p>
<p>Football has been slower to warm to advanced statistics. Of course,
absolutely fantastic work is being done by Brian Burke at <a href="http://advancednflstats.com">Advanced NFL
Stats</a>, the <a href="http://www.footballoutsiders.com/">Football
Outsiders</a> crew, and Chase Stuart
at <a href="http://www.footballperspective.com">Football Perspective</a> (to name
only a few). Yet, football analytics remains largely dominated by simple
cross-tabs, linear regression, and ad hoc analyses that select on the
dependent variable, fail to check model assumptions, eschew
out-of-sample testing, and generally don't capitalize on tremendous
advances in probabilistic modeling. And if you don't know what this
mean, hopefully I can teach you.</p>
<p>When advanced analysis *is* conducted, it's often behind closed doors.
Understandably, teams want to preserve any edge they find. However, this
is not only bad for the analytics community, it's bad for the
advancement of football analytics. As has been pointed out on
the <a href="http://www.advancednflstats.com/search/label/podcast?max-results=100">Advanced NFL Stats
podcast</a> (I
can't remember who, sorry!), without peer review, isolated analysts
often have no objective check on the quality of their work.</p>
<p>Let's fix that. You can follow me
on <a href="http://twitter.com/treycausey">Twitter</a> or drop me
an <a href="mailto:trey@thespread.us">email</a> if you have questions or want to
contribute in some way.</p>
<p>Data science and football. Together at last.</p>Hello world! Introducing the spread.2013-11-30T15:30:00-08:00treycauseytag:thespread.us,2013-11-30:hello-world.html<p>What's all this?</p>
<p>I'm Trey Causey. I'm a data scientist. I'm a football fan. This site is
an attempt to bring the two together, in the hopes of achieving two
goals. First, to kickstart the use of methods from data science in the
football analytics world. Second, to teach some introductory data
science using interesting, substantive, real-world examples.</p>
<p>You'll notice there's not much here yet. Obviously I'm not a designer
(if you want to help with that, especially with the header/logo, please
get in touch!). I figured the site would never get off the ground if I
waited until I had a finished product to roll out. So, I'll just update
it as I go. I know it looks terrible on mobile right now. Responsive
design implementations coming soon.</p>
<p>The name of the site is a riff on both the spread offense and the spread
used in betting. This is not a betting site, does not offer any advice
on betting, or endorse sports betting in any way. That being said, the
betting world is often a few steps ahead of the game when it comes to
analytics and forecasting.</p>
<p>It's a great time to be involved in sports analytics. Baseball has
already seen its sabermetric revolution. Basketball is quickly
following, especially with the introduction of SportVU and related
technologies.</p>
<p>Football has been slower to warm to advanced statistics. Of course,
absolutely fantastic work is being done by Brian Burke at <a href="http://advancednflstats.com">Advanced NFL
Stats</a>, the <a href="http://www.footballoutsiders.com/">Football
Outsiders</a> crew, and Chase Stuart at
<a href="http://www.footballperspective.com">Football Perspective</a> (to name only
a few). Yet, football analytics remains largely dominated by simple
cross-tabs, linear regression, and ad hoc analyses that select on the
dependent variable, fail to check model assumptions, eschew
out-of-sample testing, and generally don't capitalize on tremendous
advances in probabilistic modeling. And if you don't know what this
mean, hopefully I can teach you.</p>
<p>When advanced analysis *is* conducted, it's often behind closed doors.
Understandably, teams want to preserve any edge they find. However, this
is not only bad for the analytics community, it's bad for the
advancement of football analytics. As has been pointed out on the
<a href="http://www.advancednflstats.com/search/label/podcast?max-results=100">Advanced NFL Stats
podcast</a>
(I can't remember who, sorry! [Edit: It was Ben Alamar on Episode Six of
the podcast, per Dave Collins, the host of said podcast, in the
comments.]), without peer review, isolated analysts often have no
objective check on the quality of their work.</p>
<p>Let's fix that. You can follow me
on <a href="http://twitter.com/treycausey">Twitter</a> or drop me
an <a href="mailto:trey@thespread.us">email</a> if you have questions or want to
contribute in some way.</p>
<p><strong>What's coming?</strong></p>
<p>The first project I'm tackling is an ensemble play-level win probability
calculator. An ensemble model is when you build several (sometimes many)
models to forecast the same result and combine their outputs to get a
(usually) more accurate prediction. The framework is mostly in place and
will be posted soon. I hope to have it working in real time before the
season is over, but I can't make any promises. If you have experience
building Django or Flask apps, I'd love hear your input. Second up is
reconceptualizing the idea of 'field goal range' and devising a new
visualization for kick probabilities.</p>
<p>Thanks for checking out the spread. Data science and football. Together
at last.</p>