DFS Baseball Spring Training -- Predicting Runs
Baseball. Just typing that word puts a smile on my face. Before we know it pitchers and catchers will be reporting followed by spring training. Then suddenly it is opening day and we are blessed with Kevin Roth and his weather reports. Then opening day we will all be deciding which ace to pick. (Hint: Zack Greinke gets a home game against the Rockies in his first start as a Diamondback). With the predictable lineups and known probable starters, the world of MLB DFS can not come soon enough. Whether you are automating your research process, analyzing your previous results, or scouring the internet for baseball material, I hope to be a part of you improving your game.
I began playing DFS for a hobby last spring, I have always loved the game of baseball and learning the ins and outs has been an adventure. I firmly believe one should never stop learning and I hope to both learn and maybe even educate as we progress towards the season. The series will attempt to cover all topics of daily fantasy baseball as we try to discover new techniques or even new methods of research in players. I hope you enjoy the ride and please leave any questions, comments or concerns.
Part I – Predicting Runs
The original plan was to work on predicting when a pitcher’s ERA would regress to their xFIP level, however I was swamped with school work towards the end of the semester and didn’t have the time to get some research together. One of my final projects was using Rstudio to perform some form of data analytics, it was an introductory course to Python, R and Matlab. Naturally, I selected something baseball related. The goal of the project was to perform some form of linear regression and ideally machine learning. The first was a success, the latter was not. The goal was to show how old style statistics such as batting average and on base percentage are not as good of predicting runs versus sabermetric stats such as wOBA, wRC and ISO.
First a way of measuring runs was needed. I chose to create my own measure. I call it RAPA, or runs accounted per plate appearance. This is calculated by taking the sum of a player’s runs and RBI’s and subtracting their home runs, to avoid duplication. This was then divided by the player’s plate appearances. The ratio was created to avoid any bias in players who were injured or did not play everyday. This was negated in the data sample as I only chose 143 hitters from the 2015 season, I used fangraphs and these were the players who were deemed qualified for the 2015 season. As will be seen, RAPA is a bit biased due to a how the team of a player performed. Runs and RBI’s are easier to attain when your team has more runners on base and better hitter’s to drive the player in.
It was then time to identify the best statistic to use. To decide on this I selected to use the r-squared value of each statistic with RAPA. In short r-squared is a measure of how close the data is to the fit line created by the model. It ranges from 0 to 1, with 1 being an exact fit. To learn more check out the wiki page. It was found that batting average had the lowest r-squared value while wOBA had the highest value with .54. The graph below will show how each stat measured against one another in terms of r-squared values.
Last season I would constantly see people debating the best stat to use in their selection of hitters. I believe this here shows that if you are only to look at one stat wOBA would be the best, with wRC right behind it. My initial reaction is that we don’t just rely on runs and rbi’s for DFS purposes, but we also need doubles, triples and homeruns. Fortunately, wOBA accounts for these statistics,the formula is here ,and from the data we can conclude that wOBA is the best single stat to use in evaluating hitter’s each day. We still need to account for statistics like BABIP, K%, and BB% but for quick analysis and gaining a foundation for each day wOBA is our best bet.
From here I then went on to perform a multivariate regression to create a model. I ended with using 5 statistics to find an adjusted R-squared value of .64. To tie everything together I then looked at players who seemed to have well underperformed in the 2015 season base off of their peripherals to their respective RAPA’s. The three players were Joey Votto, Nelson Cruz, and Joc Pederson. These players vastly underperformed for the 2015 season based off of their wOBA, wRC and OBP when looking at their RAPA. For the assignment I concluded that Nelson Cruz and Votto had the same fit value yet Votto’s salary was $6 million more, and Pederson was a similar worth to Melky Cabrera who made $8 million in 2015. Pederson is a rookie making only $500k. The idea of the conclusion was using the model as a small market team to find value players. Then using their fit values I found their expected runs for the 2015 season versus their actual as seen in the graph below.
Each player should have had about 30 more runs accounted for the 2015 season than they actually had. As previously stated, RAPA is somewhat biased in that it can be based off of team performance as well. I went and looked at this and at least for this sample my thoughts were correct. Each player’s team well underperformed for the 2015 season versus the league average of runs.
My thinking is using the same idea for DFS baseball. Based on a player’s stats, a dfs specific model can give us a better understanding of when a player is truly underpriced. Finding underpriced players is vital in roster construction and can help find enough value to pay up for pitching in cash games and find the low owned players to take down a GPP. The next step is to look at several players for the 2015 season and their wOBA versus their DK points. Since the data may be hard to obtain, I plan on using players who are almost equal for splits. Let me know what you guys think and please feel free to post any questions!