the spread the (data) science of sports

Reproducible research and sports

Sat 08 February 2014

Good science requires transparency

Allow me to digress from models and plots for a minute to address an important topic. If you read the spread, chances are you're familiar with the Sloan Sports Analytics Conference. It's a yearly gathering of people from sports franchises, academia, and industry. It's part research conference, part vendor gathering, and part social networking. Sloan is also seen as the premier venue for discussing and presenting sports analytics work. I'm tentatively planning to be there this year myself, pending the resolution of some ticket snafus. What could be wrong with that?

I was therefore distressed to see this post by Michael Lopez, a graduate student in biostatistics and contributer to Regressing, Deadspin's sports science blog. To summarize, Michael submitted a research paper to the Sloan paper competition, which boasts a top prize of \$20,000, free admission to the conference (roughly \$500), and a chance to present your work to some of the best and brightest in sports.

Through a series of confusing emails, Michael eventually works out that a) his paper was not accepted, but that he can present a poster, b) there is no prize money for this, c) there is no free admission for this, and d) there are no more tickets remaining to the conference.

He digs further in and finds that the authors of the winning papers have an unusually high rate of association with either the conference's sponsors (ESPN and its parent company Disney) or the conference's host school (MIT). He also notes that six of the eight winning papers use proprietary data.

In his post, Michael notes what a problem this is. When research is not reproducible, it is difficult to verify its veracity. How many models did the authors estimate before arriving at the one in the paper? Are the results robust to different model specifications? What happens when you include or exclude various variables? These are unanswerable questions.

As I pointed out in the very first post on the spread:

When advanced analysis *is* conducted, it's often behind closed doors. Understandably, teams want to preserve any edge they find. However, this is not only bad for the analytics community, it's bad for the advancement of football analytics. As Ben Alamar pointed out on the Advanced NFL Stats podcast, without peer review, isolated analysts often have no objective check on the quality of their work.

You might be asking yourself if this is just an esoteric point made by out-of-touch academics who don't understand how the 'real world' works. The answer is no. I know for a fact that professional sports organizations use the findings presented at Sloan. The conference was cofounded by the general manager of the Houston Rockets, and I've done work for teams who have referenced work presented there.

When research is proprietary and not reproducible, you should be very careful how much stock you put in the findings. I understand the confidential nature of the data and the need to make a profit -- I work in industry. But one only need look to the tools that the very best in data science are using to understand the value of being transparent: R, Python, scikit-learn, pandas, etc. These are all open-source software packages. The best data scientists are moving away from expensive, proprietary tools. Has data science suffered? On the contrary. Some of the most successful tech firms have open sourced large parts of their workflows. Releasing Hive didn't mean a copycat Facebook was able to open up shop.

Sports organizations want advanced analytics capacity to make better decisions and get an edge over their competition. The way to get an edge is not to have someone write a proprietary paper and then say 'trust me on the findings.' That's how bad decisions are made.

Postscript: Here is some coverage of Michael's post on Deadspin.

blog comments powered by Disqus