clock menu more-arrow no yes

Filed under:

Projections 101 - Let's build a projection system

New, 5 comments

A step-by-step description of building a rudimentary projection system.

Projections by nature are flawed, yet can be incredibly useful when used with other data sources to draw a conclusion about the potential future performance of a player. When considering projections, it is important to have a fundamental understanding of how they work so you can most accurately interpret them for any given player and weigh them appropriately with other information to make inferences.

The purpose of this article is to provide a digestable overview of how projection systems can work by taking you step-by-step through the process of creating a very simple and rudimentary system. It is not intended to describe how popular systems such as PECOTA or ZIPS work, because those systems are much more complex than what will be presented here. The idea is not to create the most accurate system, but to provide conceptual understanding of how the statistics behind projections can reveal why they are both useful and have limitations. The ultimate goal is across a series of articles these projections will get gradually more complex and you will gain increasing amounts of knowledge by following along with the fun as I explain the process. I am not an expert on projections by any means, but I have a general understanding of how they can work; these articles will basically be a first person account of my exploring them on my own and ultimately creating an actual system.

We are going to build a very basic projection system in order to forecast three offensive variables; base hits (H), home runs (HR) and weighted on-base average (wOBA). My philosophy behind this system is that any given year's production from a player can be predicted by the prior year result by observing how players have historically performed from one year to the next across time.

Basically, I am attempting to forecast the high temperature for tomorrow based on the high temperature of today by looking at historical data that tells me what the high was on these same two dates each year across time. If, on average, the high for tomorrow is one degree cooler than today, the projection system will take the recorded high for today and subtract one degree to arrive at a prediction for tomorrow. It is a very basic method, but a great place to start to learn more about how different statistics can vary from year to year and understand which are more likely to be predicted accurately.

The first step is to obtain the needed historical player data, which was done through Sean Lahman's website and my own data base manipulation. The sample used to create the projections consists of all players with 100 or more plate appearances in each of their first ten major league seasons between 1871 and 2015, which results in 1,428 players. The data base was set up to have one individual record for each player with H per PA, HR per PA and wOBA. To control for effect of playing time and injury, hits and home runs were converted to rate statistics per plate appearance. The result is a data base like the sample below, which shows the number of H per PA across the first ten seasons for each player.

Spreadsheet Sample - Hits per PA David Rand

The first thing to look at is how closely each year's results are related to the following year. This can be done by looking at the correlation between Year 1 and Year 2, Year 2 and Year 3, Year 3 and Year 4, etc. and get an idea of how well one year of performance can predict the next year. The correlation coefficients (r) for H per PA are listed below along with the mean for each year. In general, a correlation of 0.7 or higher can be considered a strong relationship; for example, the relationship between height and weight among all baseball players between 1871 and 2015 is .692.

Means and Correlations - Hits per PA David Rand

The observed H per PA range from an r of .409 between Year 1 to Year 2 to .567 between Year 6 to Year 7 and the correlation among all comparisons can be considered statistically significant. The statistical significance indicates the relationship in the data is a result of something more than could be explained by random chance or limited sample size. When you combine all of the yearly comparisons as one lump comparison, you end up with an r of .515, which is statistically significant, but less than what is generally considered a strong correlation.

This example highlights one of the fundamental limitations of projection systems; individual player performance varies from year to year and it is subject to large fluctuations that cannot be predicted. The r of .515 means only about 26% of the variance in H per PA can be accounted for, or predicted, by the performance of the prior year. That means statistically, 74% of the variance in H per PA are a result of other factors that aren't necessarily captured by the prior year performance. It is easy to see how projections can be limited in predicting future outcomes simply by the nature of variance of different offensive outcomes.

Among the three variables being considered, H per PA has the least year to year correlation. One could conclude that this reinforces the idea that BABIP fueled batting averages are subject to regression to the mean. Below are the correlation tables for HR per PA and wOBA.

Means and Correlations - HR per PA David Rand

Means and Correlations - wOBA Seasons 1-10 David Rand

These offensive outcome data have a closer relationship from year to year than hits. In particular, home run rates have strong correlations, ranging from .719 between Year 1 and Year 2 to .813 between Year 4 and Year 5. When all of the yearly comparisons are aggregated into one number, you get an r of .784. With wOBA, the coefficients range from .508 between Year 1 and Year 2 to .683 between Year 6 and Year 7 and the aggregate coefficient among all years of comparison is .632.

This means in general, about 62% of variance in the number of HR per PA and approximately 40% of variance in wOBA can be explained, or predicted, by the prior year's performance. The higher the correlation between the variables, the better your chances are of using one to predict the outcome of another.

Now that statistical significance of the year to year relationships of these variables has been established, the next step is to determine the average year to year change across the ten seasons. This can be done by creating a scatter plot which represents Year 1 (X) performance on the x-axis and Year 2 (X+1) performance on the y-axis. This means in the Year 2-3 comparison, Year 2 becomes Year 1 (X) on the x-axis and Year 3 becomes Year 2 (X+1) on the y-axis; in the Year 6-7 comparison, Year 6 becomes Year 1 (X) on the x-axis and Year 7 becomes Year 2 (X+1) on the y-axis. The resulting scatter plots for the three offensive outcomes are presented below. Each black circle represents a player's intersection of their Year 1 (X) baseline and subsequent Year 2 (X+1) result.

Scatterplot - Hits per PA - Year X vs Year X+1 David Rand

Scatterplot - HR per PA 2 David Rand

Scatterplot - wOBA David Rand

The idea behind this method is the data has a high enough correlation so there is a noticeable cluster pattern when plotted on a graph. The trend line which represents the statistical average of each data set can be plotted that bisects the data in a straight (linear) line. There are more sophisticated trend lines that can be plotted (such as the curved quadratic curved line above), but for these purposes, we are making linear projections.

Any straight line on an x-y axis can be represented by the equation y=mx + b. The Year 2 (X+1) is represented by y, Year 1 (X+1) is represented by x, m is the slope (math way of seeing if the line goes up or down) and b is a mathematical constant (a pre-determined number that you add to every calculation regardless of x, y or m). I am taking the linear trend line from each of the three scatter plots above and converting it to a mathematical equation. I won't bore you with the details of what I did in order to find the equations for each trend line, but you can do a google search for 'create linear regression equation SPSS' if you are interested. The result means that we can project Year 2 (X+1) outcomes of H per PA, HR per PA and wOBA based on the observed Year 1 (X) results by using the math formulas below.

Hits per Plate Appearance (H per PA): y = .514*x + .119

Home Runs per Plate Appearance (HR per PA): y = .792*x + .004

Weighted on Base Average (wOBA): y = .632*x + .126

To check how this rudimentary process for projecting these outcomes turned out in a quick and dirty way, we will use the Royals' individual 2015 batting statistics to predict individual performance for 2016. Below are the players with at least 100 plate appearances in both 2015 and 2016 along with their 2015 actual performance, 2016 predicted performance and 2016 actual performance. Note that instead of predicting playing time, the rate statistics for H per PA and HR per PA were multiplied by the actual number of PA for 2016 to yield results.

Predicted 2016 vs. Actual - H, HR and wOBA David Rand

***Note, all 2016 statistics presented are based on games played through September 28th and don't include the final four games of the year***

The results show even a basic projection system like the one we created can give you some idea about about a player's season performance based on the previous season's result. Total hits were predicted within +/- 10 hits for all players except for Paulo Orlando and Alex Gordon. Total home runs were within +/- 3 for six players and were within +/- 4 for another two. The wOBA predictions were within +/- .02 for seven of the 11 players.

In conclusion, the very nature of making projections in baseball is difficult because there is a lot of variance in performance from year to year. We have created a very basic projection system for a limited number of variables and it did a somewhat reasonable job of predicting actual performance when looking at one sample of 11 players. This is far from scientific, but provides a general idea of the math on which projection systems can be based.

In order to improve the predictive validity of our projection system, data input points must be found that result in higher correlations with output performance. Finding ways to increase the accuracy of the projections is now a matter of exploration and experimentation. This is the big secret that statisticians won't tell you; start with something born out of actual mathematics, then guess and check until you get it right.

Can accuracy be increased if two or three years of input are used to predict a year of output? Was the cut point for players being included in the sample to generate the projections too low since it was only set at 100 plate appearances? Would it be more accurate to set the cut point at 250 plate appearances? Are the projections too dependent on league averages or is there room for additional regression to league average in our system? What if the extreme outliers were removed from the sample?

These are all questions to be considered, among others, to determine next steps for this exercise. If this has proven to be informative for and has helped you better understand projections, please vote in the poll and let me know what your next step would be to make the projection system more accurate. If your conclusion is that you can guess off the top of your head as well as what was shown above, please share your opinions in the comments below. The purpose of this is to explore the middle ground in a polarizing issue.

Crazy, right?

Limited Use License - NOTE* - Data in this article are based on downloads from Sean Lahman's site. This database is copyright 1996-2016 by Sean Lahman. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.  For details see: