Using R programming to Analyze Baseball: Part 1


The following analysis explores batting performance from the eight historical eras within the MLB. My goal is to study how the game has changed over time. The Dead Ball era (1901-1920), the Live Ball era (1921-1942), the Integration era(1943-1961), the Expansions era( 1962-1977), the Free Agent era (1978-1994), the Steroid Era (1995-2004), the Analytics era (2005-2015) , and the Three true outcomes era (2016- Present Day) will be the eras used for this data analysis. SLG, OBP, Walks, and Homeruns are the batting metrics I will be using to compare batting performance.


I used the batting dataframe within the Lahman package to compare the different eras. To determine the extent of the correlation between eras and batting statistics, I separated the data within batting into seven different groups, labeling them the respective names of each era. Using the dplyr package within R, I placed a color on each piece of data, and then used the package ggplot2 to plot this data across multiple graphs. Looking at each graph, I was able to determine if historical eras correlated to batting statistics.


playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO IBB HBP SH SF GIDP

#1 abercda01 1871 1 TRO NA 1 4 0 0 0 0 0 0 0 0 0 0 NA NA NA NA 0

#2 addybo01 1871 1 RC1 NA 25 118 30 32 6 0 0 13 8 1 4 0 NA NA NA NA 0

#3 allisar01 1871 1 CL1 NA 29 137 28 40 4 5 0 19 3 1 2 5 NA NA NA NA 1

#4 allisdo01 1871 1 WS3 NA 27 133 28 44 10 2 2 27 1 1 0 2 NA NA NA NA 0

#5 ansonca01 1871 1 RC1 NA 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA 0

#6 armstbo01 1871 1 FW1 NA 12 49 9 11 2 1 0 5 0 1 0 1 NA NA NA NA 0

batting<- Batting %>% filter(yearID> 1900)

batting_eras<- cut(batting$yearID,c(1900,1920,1942,1961,1977, 1994,2004,2016,2021),

labels=c("Dead.Ball", "Live.Ball", "Integration","Expansion",

"Free.Agent", "Steroid", "Analytics","Three.True.Outcomes"))


#[1] "Dead.Ball" "Live.Ball" "Integration"

#[4] "Expansion" "Free.Agent" "Steroid"

#[7] "Analytics" "Three.True.Outcomes"


playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO IBB

#1 anderjo01 1901 1 MLA AL 138 576 90 190 46 7 8 99 35 NA 24 21 NA

#2 bakerbo01 1901 1 CLE AL 1 4 0 0 0 0 0 0 0 NA 0 0 NA

#3 bakerbo01 1901 2 PHA AL 1 3 0 1 0 0 0 1 0 NA 0 0 NA

#4 barreji01 1901 1 DET AL 135 542 110 159 16 9 4 65 26 NA 76 64 NA

#5 barrysh01 1901 1 BSN NL 11 40 3 7 2 0 0 6 1 NA 2 3 NA

#6 barrysh01 1901 2 PHI NL 67 252 35 62 10 0 1 22 13 NA 15 22 NA


#1 3 4 NA NA Dead.Ball

#2 0 0 NA NA Dead.Ball

#3 0 0 NA NA Dead.Ball

#4 5 7 NA NA Dead.Ball

#5 1 0 NA NA Dead.Ball

#6 2 12 NA NA Dead.Ball

cbPalette <- c("#00e64d", "#00e6c4", "#00c0e6", "#0049e6",

"#a100e6", "#e60036", "#e6d300", "#07023b")

top_OBP <- batting %>%

filter(AB>502) %>%


OBP=(H+BB+HBP)/PA) %>%

group_by(yearID) %>%

arrange(desc(OBP)) %>%

ggplot(.,aes(x=yearID,y=OBP)) + geom_point(aes(colour=eras),pch=19) +

scale_colour_manual(values=cbPalette) + theme_classic() + ggtitle("On-Base Percentage by Era")



Walks<- batting %>%

group_by(yearID) %>%

arrange(desc(BB)) %>%

ggplot(.,aes(x=yearID,y=BB)) + geom_point(aes(colour=eras),

pch=19) + scale_colour_manual(values=cbPalette) + theme_classic() +

ggtitle("Walks by Era")



SLG<- batting %>%

filter(AB>502) %>%


SLG=(H-X2B-X3B-HR+X2B*2+X3B*3+HR*4)/AB) %>%

group_by(yearID) %>%

arrange(desc(SLG)) %>%

ggplot(.,aes(x=yearID,y=SLG)) + geom_point(aes(colour=eras),pch=19) +

scale_colour_manual(values=cbPalette) + theme_classic() + ggtitle("Slugging Percentage by Era")



StolenBases<- batting %>%

group_by(yearID) %>%

arrange(desc(SB)) %>%

ggplot(.,aes(x=yearID,y=SB)) + geom_point(aes(colour=eras),

pch=19) + scale_colour_manual(values=cbPalette) + theme_classic() +

ggtitle("Stolen Bases by Era")




Figure 1, surprisingly shows very little change in OBP across the eras. This is surprising, as in recent years, hitters with high OBPs have become highly valued and teams have fundamentally changed to value OBP more than batting average. This lack of change in OBP can be explained by pitching performances. Although the MLB has focused on playing and obtaining players with high OBPs, pitchers have greatly improved, as the average velocity of an MLB pitcher is steadily going up. Had hitters continued to compete against the same level of pitchers across the eras, OBP most likely would have risen; however, as hitters and game strategies began to get better, so did pitchers, resulting in a similar OBP across the eight different eras.

Figure two shows very little change in walks per player over time. This can also be attributed to pitching performance. Although teams have begun to prioritize the walk, pitchers have become much better, resulting in a constant amount of walks across the seasons.

Figure 3 shows very little change in slugging percentage across the history of the MLB. Similarly to OBP, this can be attributed to pitching performance.


Figure 4 shows a sharp jump in home runs between the dead ball era and the live ball era with a steady increase in home runs per person until the steroid era, in which a significant jump in home runs hit per player happened. The sharp incline in home runs hit between the dead ball era and the live ball era can largely be attributed to the difference in baseball strategies. During the early 1900s, the MLB teams prioritized small ball. Focusing on bunts, steals, and putting the ball in play. It was not until 1919, when Babe Ruth hit over 20 home runs in a season, that teams began to value the home run. Beginning in 1919, and lasting until 1994, the amount of players to hit over 20 home runs in a season steadily increased. In 1995, MLB fans witnessed a significant increase in home runs total and players to hit over 20 home runs. This sudden increase in home runs is due to the use of steroids in baseball. Players like Mark McGuire, Sammy Sosa, and Barry Bonds began to mash home runs with the help of juicing. This peak in home runs returned to normal after the 2005 season, as the MLB began to crack down on steroid use.


It seems that over the years significant stats such as slugging and OBP, and as a result OPS, have remained constant over time while counting statistics like home runs have differed. This information has large implications on the game of baseball. This data is available to front offices and more than likely they are using it to build their rosters. Front offices will continue to buy high achieving individuals in hopes of raising their team to above league average in these select statistics, usually resulting in more wins. The problem with this is small market teams do not have the luxury of buying these types of players, as the Mike Trouts of baseball are extremely hard to come by for a cheap team. Unable to afford these types of players typically means small market teams must be okay with league average to even below average production at the plate. One possible solution to this problem is chance, which the Kansas City Royals exploited in 2015. By chance, I mean taking advantage of immeasurable variables. No amount of data could have predicted Jose Bautista dropping the fly ball in the ALCS, just as amount of data would have predicted Carlos Correa making an error in the ALDS: the 2015 Royals succeeded on the backs of creating chaos. If a team could find a way to measure the value of human error and inconsistency, an MLB team would have an easy path to success. The question is, how do we go about exploiting the human side of baseball?

This FanPost was written by a member of the Royals Review community. It does not necessarily reflect the views of the editors and writers of this site.