Advanced Statistics: A Few Discussion Points
I've spent some time recently reading up on some of more complex statistics out there and it sparked some questions and ideas that I thought might resonate with the readers of this site. I'll start with a couple questions and then move on to my idea for a stats project.
1) The original makeup of wOBA does not include an allowance for stolen bases or SB%. However, I've seen work on this site by Devil Fingers, where he says he's accounted for SBs in his wOBA. When you see a wOBA quoted, can you assume that it takes into account SBs? Is that now the generally accepted practice?
2) There is a statistical advantage to hitters in certain defensive alignments, such as when the infield is in double play depth or playing against the bunt. Essentially, in certain situations, the adage about runners on base causing havoc for the pitcher/defense can hold true. Is this statistical advantage used in calculations, such as wOBA? If not, how can defensive alignment be incorporated into an analysis of a hitter?
3) We've talked a bit on this site about using regression analysis. I don't know what all is out there, but I'm wondering what could be accomplished with stats and regression analysis, particularly using powerful software such as SAS. Specifically, I'm thinking about predictive values for hitters and pitchers.
I'd like to see some advanced regressions run on the base stats that comprise some of the more complex measures such as wOBA and WAR. I'd like to see what items or combination of items have the most predictive power. For example, it might seems obvious that previous seasons HRs have the most predictive accuracy for a typical power hitter, but HRs alone won't give you an accurate wOBA or WAR forecast for a hitter in the coming year. However, what if you combined HRs, 2Bs, and BBs - how predictive do you think that might be? These are the sort of things regressions can tell us.
I'd like to see how simple you could make an equation, that also has a reasonable degree of accuracy. That way it would be easy for a fan sitting at home to do some simple forecasts for his favorite players. What I think would be interesting is having different equations for different types of players. So the example above would be for a power hitter, but maybe a speedster would have something like total hits, stolen bases, and CS%.
I think you'd need to use basic counting stats for these exercises and not stats like OBP and SLG because those stats include too many variables already. For example, in SLG and wOBA a triple is obviously worth more than a double. But a triple will have less predictive value than a double because they occur far less frequently. I want to strip out the items that have minimal predictive value.
Pitchers would have something similar, but you could probably use slightly more advanced stats such as K/9, BB/9, or GB to FB ratio. For relievers, you'd have to add holds to the equation to make it worthwhile. Just kidding! Just seeing if you were still paying attention. I'm not exactly sure how to tweak the equation for relievers. Of course all of this is theoretical anyhow. I haven't run the regressions and I don't have access to powerful stats software like SAS or SPSS which would make this stuff much easier than grinding it out in Excel.
I don't know if this would be feasible for defense. I bet that's something that isn't being done though. Predictive regressions using defensive stats. Could be groundbreaking material!
Not sure if there is any value in this. Mostly it would simplify predictions without sacrificing too much in the way of accuracy. I thought it might be nice to have a few quick and dirty equations to forecast your favorite players for the coming year.
59 comments
|
1 recs |
Do you like this story?
Comments
There is a statistical advantage to hitters in certain defensive alignments, such as when the infield is in double play depth or playing against the bunt. Essentially, in certain situations, the adage about runners on base causing havoc for the pitcher/defense can hold true. Is this statistical advantage used in calculations, such as wOBA? If not, how can defensive alignment be incorporated into an analysis of a hitter?
I don’t think so.
There is a demonstrated advantage hitters have with men on base, but I don’t think it’s accounted for in wOBA. As for the infield positioning thing… no
I bet some teams have a house stat or something that basically takes away all a guys cheap hits, but it isn’t in wOBA
I wonder if the Royals
have a house stat for, well, anything…
the sad thing is
many of the advanced stats are basically scouting/stat hybrids when you think about them
they are based off of close, repeated observation
and yet the royals just prefer, say, errors
To prove a point
We should replace Jose Guillen in RF with say, the statue of David. I’m sure the statue will have less errors and eventually earn a 3 year deal from the team.
And cost less
I used to be an A's fan until they left town and got good.
by philofthenorth on Jan 22, 2010 2:42 AM EST up reply actions
One of the difficulties with that is isolating the pitcher skill ...
… bad pitchers put more men on base. Consequently, the population of at bats with men on base is a data set which is biased in favor of bad pitching.
As an extreme case, to illustrate the point. Let’s say that there are only two types of pitchers. The first type does not allow any base runners at all. Every batter they face makes an out. That is, the BAA fo rthose pitchers is .000. The second type of pitcher has a BAA of .300, in all situations. whether there are men on base or not Let’s say the two types of pitchers face equal numbers of hitters.
Then the overall BAA for players will be .150. But the BAA with runners on base will be .300. It would be totally incorrect to infer that players hit better with men on base because the BAA with men on base is .300 vs. and overall BAA of .150.
You need to consider whether the BAA with men on base is higher for each pitcher. Then you look at the population of pitchers and determine whether the deviation from norm for the pitcher population is statistically significant.
by Steve Nelson on Jan 23, 2010 3:17 AM EST up reply actions
OT: but...
RT @TBrownYahoo Rick Ankiel agrees to a one-year contract with Royals worth $3.25 million guaranteed. Second-year mutual option worth $6 million.
....my quick smells like french toast...
quickie on wOBA and SBs
the original formula doesn’t have them, and I don’ t think Stat Corner (the first site to implement woBA) uses them. tango later noted that you could add them in to the “original formula” by using +.25 for SBs and -.5 for Cses
however, the most commonly cited implementation is af FanGraphs, and that DOES include SBs (although they don’t include ROEs). It should also be noted that the FanGraphs version doesn’t use the standard weights of the “original,” but uses Tango’s script (that anyone who works with Access or MySQL can get if you search) for generating custom weights for each season’s run environment.
I'm not a sabermetrician, but I do play one at FanGraphs.
Can't get enough of me? Check out my Twitter feed.
On sample sizes/regression
see this post by Eric Seidman (now of BP) based on the work of Russell Carleton (now also of BP) done at the now sadlly defunct MVN site StatSpeak, which kicked f—-ing ass.
I'm not a sabermetrician, but I do play one at FanGraphs.
Can't get enough of me? Check out my Twitter feed.
Interesting.
I think you’d have to run any of my theoretical predicting stats through the sample size test to see if they qualify as robust predictors. Of course you could always get over that hurdle by including minor league stats, but then you’d have to rely on a conversion metric which considerably muddies the waters.
I've been starring at the post for a while trying to decide how to respond
specifically on your point 3….
first of all, using something like SAS isn’t going to change the results. Whether you do an OLS regression in excel or SAS, you will get the same result (and excel is actually easier; I use both regularly at work). That said, some more advanced techniques would become available as you ween yourself off of Excel, but programs like R, Stata, etc. are just as capable, while much cheaper (base SAS is like $6k per user, plus extra for all the add-ons and stuff). A program like R is free and, as far as statistical capabilities needed for sabrmetrics, equivalent.
As for using more regression analysis, the short answer is of course it would be a great step forward. It has always bothered me the lack of statistical rigor most sabrmetrics is done with. Things like error analysis, distributions, and simple linear regressions are often either ignored, or very poorly done. Not to say there isn’t some good people out there doing it, but a vast majority of work you see on blogs doesn’t get much more advanced than a high school level stats class. Again, don’t get me wrong, that doesn’t mean current analysis out there is wrong, but it does mean “Truth” and “Fact” about baseball is a lot less concrete than some would have you believe (on either side of the stats debate). Part of the problem with it is the understanding gap between the baseball fans who demand this analysis, and the researchers who supply it. I could (and have tried to) talk all day about the economic fallacies of linearly priced WAR and salary caps, but unless you have a excellent understanding of economics, it gets discounted as an unknown and the simple explanation and rule is what gets propagated around the net.
That said, to be more specific with your desire for better projection systems, one thing in particular that I have always thought about is properly estimating weights for past years. You often hear about waiting past years by 5/3/1 or some variation of that to predict next year. The thing about that is those numbers are completely arbitrary. it would not be terribly hard for someone with access to all the data to actually run a regression and estimate the coefficients/weights of past years. Even with respect to age curves, position, etc. there is a lot of room to improve.
Going a step further however, a lot of projection systems actually already do use some regression analysis. Pecota comes to mind first, Silver used an extensive set of regressions (the way I understand it) to map players onto their comps. But I don’t know too much more detail.
Certainly a good post jsolo, important questions to think about.
Thanks Zep. Extremely valuable insight.
I think there are strides that can be with projections. I really wonder if there is anything out there resembling a predictor for defensive stats. Not sure how much value that would bring, but I bet it could gain some traction with some of the more advanced stats users.
I’d do some of this stuff myself if I had more time, but I don’t see my free time increasing anytime in the near future.
I think for defense
the advancement that needs to be made isn’t one with stats, its with data.
Hit F/X and defensive positioning will be vital to truly understanding defense. Are players good defenders because they are athletic, lucky or had a good coach that started them in the right spot.
lateral movement
I'm not a sabermetrician, but I do play one at FanGraphs.
Can't get enough of me? Check out my Twitter feed.
by Matt Klaassen on Jan 22, 2010 10:38 AM EST up reply actions
...
I really wonder if there is anything out there resembling a predictor for defensive stats.
There are plenty of projections for defensive stats.
CHONE
Steve Sommer
Jeff Zimmerman
Are among the most prominent.
by vivaelpujols on Jan 22, 2010 6:53 PM EST up reply actions
Well, let me restate that
I’m sure there are defensive stat projections, but do we know what are the best predictors of future defensive performance. For example, we know number of errors basically means nothing, but perhaps something like range factor might be a decent predictor. In terms of defensive stats, I also think multicollinearity may be less of an issue – of course there just aren’t that many stats to choose from at this point.
Regarding Jeff Zimmerman, as a rule I don’t trust anyone who lives in Tucson. :)
Where is Jeff anyhow? I would have expected him to weigh in on this thread by now!
I am around, the questions here are loaded.
Actually, I no longer live in Tucson, the wife, kids and I moved to Wichita a couple of months ago.
I have had a couple of run ins with defensive stats. They problem is that they have a small sample size each year 30 people and if someone gets hurt or changes position, the zero value changes. I don’t have access to the good data ($$$$), but I would love to have an all year baseline. It would be nice to see that the SS in one year all were better than the year before because someone was hurt.
Range factor is ok, but it needs to be adjusted to PA per team at least. I have my own query based of range factor that I use as an idiot check. I checked and it is decent with UZR.
As for predicting stats, I am not sure, but I know range and arm decreases while usually errors decrease (not as many stupid throws).
Jeff Zimmerman - Protecting the world from RBI's and Wins from my mom's guest house.
by Jeff Zimmerman on Jan 22, 2010 10:34 PM EST up reply actions
we need to get together and do our own open-source retrosheet defensive thing
I know TotalZone is out there, but couldn’t we hack it?
I'm not a sabermetrician, but I do play one at FanGraphs.
Can't get enough of me? Check out my Twitter feed.
by Matt Klaassen on Jan 22, 2010 10:39 PM EST up reply actions
by "hack it" I don't mean TotalZone
I mean couldn’t we figure out something… did you ever mess with Colin’s SZR?
I'm not a sabermetrician, but I do play one at FanGraphs.
Can't get enough of me? Check out my Twitter feed.
by Matt Klaassen on Jan 22, 2010 10:39 PM EST up reply actions
Yea, let's get it done.
I am busy for a bit, Fangraphs deadline of ST and also going to begin writing Pitch F/X for a really cool and mainstream website soon.
Chone put his stuff at the Hardball Times. He sent me the links once.
Jeff Zimmerman - Protecting the world from RBI's and Wins from my mom's guest house.
by Jeff Zimmerman on Jan 22, 2010 10:59 PM EST up reply actions
I'm swamped a bit myself
but at some point
you know, as soon as my “awesome” projection system is at a good place
I'm not a sabermetrician, but I do play one at FanGraphs.
Can't get enough of me? Check out my Twitter feed.
by Matt Klaassen on Jan 22, 2010 11:23 PM EST up reply actions
Are you entering Tango's contest this year?
Jeff Zimmerman - Protecting the world from RBI's and Wins from my mom's guest house.
by Jeff Zimmerman on Jan 22, 2010 11:34 PM EST up reply actions
I've thought about it
but only anonymously… is it just hitters, or pitchers, too?
My projections are so awesone they have like, one more double or walk or one less hit that Marcels
I'm not a sabermetrician, but I do play one at FanGraphs.
Can't get enough of me? Check out my Twitter feed.
by Matt Klaassen on Jan 23, 2010 1:04 AM EST up reply actions
I'd rather work on getting a gameday fielding metric.
I should talk to Peter Jensen.
by vivaelpujols on Jan 23, 2010 12:14 AM EST up reply actions
Truthfully, I would love to dive into Defense when Hit FX is 100% available
Some more detailed work of what I started here:
http://www.beyondtheboxscore.com/2009/6/18/913774/zones-of-scoring-using-hit-f-x-to
Jeff Zimmerman - Protecting the world from RBI's and Wins from my mom's guest house.
by Jeff Zimmerman on Jan 23, 2010 12:28 AM EST up reply actions
I really hated dealing with Heidi SQL for some reason
So I never got into Hit f/x. I agree that it’s probably the next step in defense (but not as much as people think, as you can’t calculate spin from Hit f/x).
by vivaelpujols on Jan 23, 2010 12:38 AM EST up reply actions
I might not be great, but I would love to see more.
On the work i did, you can see the run values decrease near positioned players. I guess for me, I could get a general area for the player (might need to bring R into equation) and then see if it increases or decreases over time.
Jeff Zimmerman - Protecting the world from RBI's and Wins from my mom's guest house.
by Jeff Zimmerman on Jan 23, 2010 12:48 AM EST up reply actions
you guys shouldjust ask people
like Sean Smith (CHONE), or especially Dan Szymborski, who, like you guys, is an economist.
CHONE does “win” most of the “wars” though.
The Seidman/Pizza Cutter article I mention above does have relevant to the regression of specific components, if not the year-to-year weighting.
Them most common weighting I’ve seen for position players on the ’net is 5-4-3 (or 5-4-3-2), or 0.8^(projected season-source season). Brian Cartwright (Oliver) uses .75^(projected season-source season), I think. He found that decreased the RMSE, but that posting is gone with StatSpeak.
Actually, Tango (or those following him) gets the Marcels weights (basically 5-4-3) for hitter by doing 0.9994^((projected season – source season)*365.25), but I don’t know how they get that. I use this for “my” projections (just a souped-up Marcels, deciding on a name, either GOB, FREDO, or DAYTON) since that’s what Colin Wyers had in his script, and the weight turn out to be (for projecting 2010):
2009: 0.803148403073148
2008: 0.645047357358948
2007: 0.518068754969393
For pitchers, it seems that people use more extreme weights to reflect the apparent more frequent change in “true talent” with pitchers, I don’t know if that’s true or not. People use 3-2-1 or 5-3-2 (basically the same). I think 0.9987^((projected season – source season)*365.25) approxiamtes that:
2009: 0.621801789146064
2008: 0.386637464985247
2007: 0.240411867478725
I'm not a sabermetrician, but I do play one at FanGraphs.
Can't get enough of me? Check out my Twitter feed.
by Matt Klaassen on Jan 22, 2010 10:38 AM EST up reply actions
good detail
i guess my question comes back to the point that what makes:
0.9994^((projected season – source season)*365.25)
the correct formula. Reducing error using one simple formula over another doesn’t answer the question what is the best, it only says what is better.
I don't know, either
I was hoping you guys would have some ideas… I’ll see if I can dig up Tango’s post on it.
I'm not a sabermetrician, but I do play one at FanGraphs.
Can't get enough of me? Check out my Twitter feed.
by Matt Klaassen on Jan 22, 2010 10:48 AM EST up reply actions
Just digging up some stuff
Tango says the 5-4-3 and 3-2-1 weights are based on regression, but doesn’t elaborate:
http://www.insidethebook.com/ee/index.php/site/comments/near_sighted_marcels/#3
Phil, the general equation for weighting is:
weight(daysAgo) = .9994^daysAgo
If you are on a steep slope, then that .9994 would be .9990 or something, perhaps even .9986. If you are on a gentle slope, then it would be say .9996 or .9998.
Just a little research will give you what you need.
Wish there was more
I'm not a sabermetrician, but I do play one at FanGraphs.
Can't get enough of me? Check out my Twitter feed.
by Matt Klaassen on Jan 22, 2010 11:04 AM EST up reply actions
after reading his post
I’m pretty sure I understand what he did. Gives me more confidence in the weighting than I previously had. He just rounds to nice whole numbers for simplicity.
Hey! I went 37-5 in the Tango’s head-to-head drafts this year.
More seriously, I use 8-5-4-2. I put every 5-year run in MLB history since 1920, including 30 years of minor league translations (park and league-neutralized), I modeled what contribution Years 1, 2, 3, and 4 had in projecting year 5. The simplest way to do this, multiple regression, yielded very similar results to more robust methods (if I start talking about heteroskedasticity, eyes will be glazed).
The numbers are rounded in normal conversation because I’m not going to go around talking about a “7.74-4.93-4.12-1.84” weighting.
--
Dan Szymborski
dan@baseballprimer.com
by D.Szymborski on Jan 22, 2010 12:02 PM EST up reply actions
sorry for the oversight, Dan
glad you stopped by, I was hoping… congrats on the big win
Since you are here, can I ask what data sources you use? I have bdb, and I guess I need to start using retrosheet, but can one get minor league data free (and easily importable into a database format?)?
I'm not a sabermetrician, but I do play one at FanGraphs.
Can't get enough of me? Check out my Twitter feed.
by Matt Klaassen on Jan 22, 2010 12:11 PM EST up reply actions
As for the larger predictive models
the initial problems i see:
1. Multicolinearity b/t the variables. For starters, the # of HRs you hit affects the # of BBs you get, and vice versa. Measuring how valuable those are and reporting it is one thing; trying to use the two in a predictive manner is more difficult. Now throw in 20 more variables, all of which exist in a web of colinearity, and you’ve got some work to do. I think the econometric analysis tools to deal with this are out there, but it’s not simple stuff.
2. Defining the model. The regression results are always going to be subject to the model’s accuracy. While we have a pretty good idea as to what generates value, the theoretical solidity of the model will always be up for debate.
Conversation b/t Special baseball operations consultant Zapp Brannigan and GM Dayton Moore: "...but paper covers rock and rock crushes scissors...we have a conundrum. Get me some paper, a rock, and some scissors."
by SagehenMacGyver47 on Jan 22, 2010 3:39 PM EST up reply actions
Multicollinearity is definitely one of the more annoying challenges! It would be a lot easier on my computer to sim a million seasons of a player if the variables were completely independent.
--
Dan Szymborski
dan@baseballprimer.com
by D.Szymborski on Jan 22, 2010 7:01 PM EST up reply actions
I hadn't thought about multicolinearity when I was writing this post
But it’s an excellent point. It really does make regression analysis difficult – at least if you are trying to do basic linear regression. It would be pretty difficult to compensate for this I think, but not impossible.
there are ways around it, to an extent
that i have read about, but of course i have never actually used them (or remember their strategies, for that matter). one way is to transform the variables (i.e. HR/BB or HR + BB, etc.) but then you open up a new can of worms in that transformed variables reduce the power of the regression in their own way, iirc.
Conversation b/t Special baseball operations consultant Zapp Brannigan and GM Dayton Moore: "...but paper covers rock and rock crushes scissors...we have a conundrum. Get me some paper, a rock, and some scissors."
by SagehenMacGyver47 on Jan 22, 2010 10:39 PM EST up reply actions
Regression is a very limited tool in Sabermetrics
I’ll give you an exampe.
1) Linear weights. If you’d use regression to calculate linear weights, you would get results that just don’t make sense. That’s because of the massive multicollinearity involved in baseball stats.
Linear weights can be much more accurately calculated by actually looking at the average change in runs for each event. Regression tries to infer what that is through team or game statistics, but you don’t have to infer, you can actually find it out exactly!
by vivaelpujols on Jan 22, 2010 6:52 PM EST up reply actions
I shouldn't say very limited
It has plenty of uses; however, for the most part, you’ll get more accurate results using other methods besides regression.
by vivaelpujols on Jan 22, 2010 6:55 PM EST up reply actions
I talked to a friend, sports nut and teach stats.
He says regression gets you close. Which it does with determining run value, but there is definitely error. Even though the standard deviation gives some level of error, you don’t know which way it goes.
A huge problem is the run environment. More runs were being scored 10 years ago vice now. Now if you use the season totals, and regress the number for all the years, your results will be off for both now and then. If you only use this year’s data, you only have 30 samples, way too few to get accurate results.
Jeff Zimmerman - Protecting the world from RBI's and Wins from my mom's guest house.
by Jeff Zimmerman on Jan 22, 2010 10:54 PM EST up reply actions
the differing run environment sounds like the kind of situation
that econometric analysis can really deal with. i mean, if the run environment is a variable that could be controlled for.
Conversation b/t Special baseball operations consultant Zapp Brannigan and GM Dayton Moore: "...but paper covers rock and rock crushes scissors...we have a conundrum. Get me some paper, a rock, and some scissors."
by SagehenMacGyver47 on Jan 23, 2010 1:28 AM EST up reply actions
#3 Regression has some uses, but needs to be kept in check
Here is a great article on predicting attendance with regression:
Done with extra variables and t values compared. I think it is good for finding the values that matter with a bunch of unknown inputs.
Also I used it to find effects on parks for runs scored:
A problem I had was that the higher the outfield wall the more runs were scored, which doesn’t make sense.
There is a place for regression, but it also it results need to be examined for common sense.
I think it is more important to ask a question and then find the solution how ever you can. Publish results. Get torn apart. Go back and fix.
Jeff Zimmerman - Protecting the world from RBI's and Wins from my mom's guest house.
by Jeff Zimmerman on Jan 22, 2010 10:47 PM EST reply actions
It makes perfect sense
A problem I had was that the higher the outfield wall the more runs were scored, which doesn’t make sense.
Home runs are rally killers.
by billexgordler on Jan 22, 2010 11:06 PM EST up reply actions
Expect Jeff to get a 3/36 contract as sabermetric apologist for the Royals soon
I'm not a sabermetrician, but I do play one at FanGraphs.
Can't get enough of me? Check out my Twitter feed.
by Matt Klaassen on Jan 22, 2010 11:24 PM EST up reply actions
I don't follow, as a Royals fan, what is this rally thing you speak of?
Jeff Zimmerman - Protecting the world from RBI's and Wins from my mom's guest house.
by Jeff Zimmerman on Jan 22, 2010 11:36 PM EST up reply actions
Low outfield walls
inspire the outfielders to make fancy leaping catches that look good in the highlight reel. You have to take into account the intangibles!
We have enough statistical brainpower here to start our own baseball stats outsourcing firm
How do the Royals have so many sophisticated fans, yet have so little use for stats in the FO? Maybe we should run a regression on that…
The Royals seems years behind other teams.
I remember last season Luke H was needing help determining which pitch he was giving a tell on. Supposedly the Royals were watching film on it, but it was obvious one pitch and the teams that know are obvious.
I have done some work for one team and they have wanted some pretty decent stuff. I hope to expand that work, but I will see.
Jeff Zimmerman - Protecting the world from RBI's and Wins from my mom's guest house.
by Jeff Zimmerman on Jan 22, 2010 11:51 PM EST up reply actions

by 
















