the spread the (data) science of sports

Elo ratings (part 4)

Sat 05 July 2014

From chess to football

We've almost arrived at the end of the ratings and rankings tutorials. I'll do one more post on Markov ratings, then a couple of posts on ensemble ratings, and then it'll almost be time for season. This week I'll be talking about Elo ratings. Originally used to rate and rank chess players, Elo ratings are now used in a number of sports, including by Jeff Sagarin for USA Today. They're a very simple and elegant way to create ratings

Elo ratings are based around the basic idea that teams have a basic level of quality, mu, around which their observable performance varies randomly. The only way that this measure can change is if the team consistently performs well above or below its expected level. If a team plays a team with roughly the same skill level, its rating won't change much, regardless of the result. Similarly, if a heavily favored team wins as expected, it shouldn't lead to a big increase in the perceived quality of that team.

On the other hand, if the underdog pulls off the upset, it should be rewarded more substantially. The idea is that we have this unknown variable mu and we're continually calibrating our estimates of it based on the performance of each team. Elo is all about strength of schedule in this way.

We can make predictions about future outcomes using the mu values, where mu~ij~ is the expected number of points that team i will score when it plays team j. Or, since we've been using point proportions with a smoother, this value will be the expected proportion of points that team i will score in that game. If mu~ij~ = 0.5, we predict a tie, if mu~ij~ is greater than 0.5, we predict a team for team i, and if mu~ij~ is less than 0.5, we predict a loss for team j.

We start off by setting each team's rating before any games are played to zero. By doing this, we have a "memory-less" rating system, i.e., one that does not take into account any past events and considers all teams equally skillful. Obviously we could change this by including preseason odds of some kind. One nice side effect of taking the zero approach, though, is that all of the ratings will sum to zero and the mean of the ratings will always be zero. This means teams with a positive rating are above average and teams with a negative rating are below average.

Elo ratings are then calculated on a week by week basis as such:

r~new~ = r~old~ + K(S - mu)

where r~new~ is the new rating for a team, r~old~ is the previous rating for that team, K is some constant (called the K-factor), and mu is the team's ability score. K can take on any value, but it's meant to ensure that rankings are not too volatile and that teams don't get rewarded/penalized unduly for beating/losing to teams of lesser quality.

To update ratings once two teams have played, there are only a couple of more steps. We make the assumption that muij is the output of a logistic function of the pre-game ratings differential between the two teams. Logistic functions have the form:r

f(x) = 1 / (1 + 10 \^ (-d~ij~ / 1000))

where d~ij~ is equal to r~iold~ - r~jold~, or the pre-game ratings of the two teams. Note that for the first games of the season, f(x) = 1 / (1 + 10\^0) = 1 / 2 = 0.5, meaning that, absent other information, we expect each team to score 50% of the points.

Where did that 1000 come from? It's empirically derived (i.e., derived from using past results) and it's called the logistic parameter xi. The value is set so that for every xi rating points difference between two teams, the higher ranked team should have roughly ten times the probability of winning than the lower ranked player. It can be tweaked to account for how much parity there is in a league. Many chess ratings use xi = 400.

Finally, we have all of the information we need to rate and rank the 2013 teams after each game is played using the following formula:

r~i~new~~ = r~i~old~~ + 32(S~ij~ - mu~ij~)

where S~ij~ is simply the proportion of points (plus the Laplacian correction) scored by team i when playing team j.

Here's how the season shakes out.

elo_nonadjustedOur top four is pretty consistent, with Seattle, Denver, San Francisco, and Carolina claiming those spots. The Chiefs do surprisingly well here, and the Bengals also do well (see the power rankings post for more on this). Jacksonville is basically in free fall until week 8 of the season, when they won their first game of the year, and recovered slightly by year's end. Houston and Washington round out the bottom three. The league's most average team award belongs to the Ravens, almost exactly average with an Elo rating of 0.27.

Prediction. How predictive are these ratings? Using the previous week's rating for each head to head matchup, I projected each game using the metric listed above. Each home team is provided with a home field advantage boost of 15 points in the projection (but this isn't added to their actual rating). Without this HFA boost, the Elo ratings predict the winner of each game (straight up) in 60% of 2013 games. If you factor in HFA, it improves to 62.5% straight up. Langville and Meyer suggest changing the value of for the last weeks of the season to account for meaningless games. I tried doing so, setting K = 16 instead of 32 for the final two weeks of the season, but this actually decreased predictive accuracy to 62.08%.

Road to the Super Bowl. How do things look for the teams that ultimately made it to the Super Bowl?


Seattle is ranked higher than Denver every week after week 1, and really pulls away starting in week 9. However, because both teams are so good, our week 17 prediction for the score differential of the eventual Super Bowl matchup would have been very close. Using the logistic formula above with Seattle's and Denver's week 17 rankings, Seattle is predicted to score 50.91% of the points -- essentially a toss up.

Although we know Seattle ended up dominating in that game, there is also an interesting methodological question. If the teams are so far apart in ratings, why is the predicted outcome so close? The answer lies in the logistic function used to make the prediction. Because we set the logistic factor to be 1000, we know most games will be pretty close. the difference in ratings has to be 1000 in order to make it 10 times more likely for team i to beat team j. We could always experiment with setting this to a different constant.

Next up, Markov ratings!

blog comments powered by Disqus