clock menu more-arrow no yes mobile

Filed under:

Interview Questions for a Data Analyst Job with a MLB Team

A couple of years ago I applied for a job with the Arizona Diamondbacks. I filled out the application and then got a written interview as the next step in the process. I spent a few days answering the questions and sent them back. Then this happened just a couple days later:

A.J. Hinch has been fired as manager of the Arizona Diamondbacks. Josh Byrnes has also been relieved of his duties as General Manager.

From that point on, I never heard another word from the Diamondbacks. I sent them a few emails and got no response.

I having been writing a new resume and saw all this old work and thought it may interesting to some people. Some the information is dated like draft pick valuations under the new CBA, but much is still applicable. Enjoy the novel.

Here are the questions I answered.

As the next step of our interview process, we’d like you to provide your responses to the below questions. For each, please describe in as much detail as possible:

  • The data set(s) that you would use and where you would look to obtain that data.
  • The technical tools/software packages that you would use to obtain, aggregate and filter the information.
  • The algorithm/model that you would use to derive results.
  • Briefly describe what metrics/way you would quantify the output of your project.
  1. How would you assign a dollar value to each draft pick # in the June amateur draft?
  2. What defensive metrics have the most impact on a team’s Win-Loss record?
  3. How would you optimize the Diamondbacks 2010 lineup?
  4. What components of Minor League performances predict future Major League success?
  5. What impact would going to a 4-man rotation have on a Club’s bullpen?

Please remember, it is not necessary to perform the analysis itself, rather to walk through your process and methodology. Feel free to use any resources available to you, but be certain that your responses reflect your personal analyses, opinions, and reasoning.

How would you assign a dollar value to each draft pick # in the June amateur draft?

To get the dollar value of each pick, a data set of previous draft picks going back 10 years would need to be created (player, college or high school, position, draft pick, amount paid, etc.). The data could be used that goes further back, but teams are now spending more and more money on the draft and the information going back 10 years should be more accurate. Currently, there is no such data set freely available to the public.

A couple of methods may used depending on the nature of the data that needs to be collected. It may be available to teams that pay for STATS or BIS data. If it is not available from them, it will need to be obtained via the internet. Two methods could be used to get the data, copy and pasting the data into a spreadsheet or using an internet spider written in a language like Perl to get the data.

Once the data is collected, a few problems with the draft data would need to be addressed first in an spreadsheet program like Excel.

First, all the values should be adjusted for yearly inflation. I would weight this value with two inputs. The yearly inflation from the previous year's draft would be given a weighting of 2. The inflation of this year's free agent class would get a weighting of 1. The second weighting is to find if there is a change in inflation trends since last year's draft.

The second issue is to deal with players unsigned. These players can actually be quite few especially in the later rounds of the draft. For the players that did not sign, I would give them a value based on either the middle point of what the team offered and what the player was asking (if known) or use a weighted average of the players drafted around the players to give an approximate value.

The third problem is teams under- or over-paying the slotted money due to each player at that pick. This could cause some issues because a lower pick, like the 17th in the draft, had a couple of players picked at it over slot. The amount spent is on average a couple million more than the picks around it over the time frame. I would then put the players in order from highest dollar contract to lowest. So if the 6th overall pick ends up being the 2nd highest paid, that person data would be the number 2 pick in the data set and all the rest of the players would be moved down a spot..

After that, I would then divide the data set into 2 groups. The groups would be the supplemental and extra (from not having first round player sign last year) picks, rounds 1-4, and the other group would be the rest of the draft. I would then average the top player paid from these last 10 drafts, giving more weight to the more recent picks. The most recent pick would get a weight of ten and the pick that happened ten years ago a would get a value of 1. The weighting would put more emphasis on recent picks/trends then ones that existed 10 years old.
This procedure would need to be done with the rest of the rounds. The extra few picks at the end of the first 4 rounds that won't have 10 total samples because of different number of drat picks each year would need to be adjusted to the picks around them.

Other data should also be included besides the averaged values. Using the pre-ranked data, I would include the highest and lowest dollar amount paid to this pick and and the highest amount paid to any lower pick. This would give a decent idea of how much teams previously have gone over or under slot on this pick. Also, the standard deviation of how much was paid for the pick should be included, to see the general variance in value. These additional values give a team a general idea of how much they could go over or under slot for the pick.
Here is an example of how the final data would look (numbers are totally made up):

What defensive metrics have the most impact on a team's Win-Loss record?

Basic defensive metrics like errors, assists and putouts are readily available metrics, but to tell how the standings are affected, a run value needs to be placed on the play that the player did or not make. To get this data an in-house metric would need to be created. The extent of this type of data available is limited, but with enough time, it can be created. The data needs to be collected on a group of teams (division to entire league) to be able to compare results. If the data was just collected for the team, it can show that they made 10 great diving catches to stop some hits, but how does that compare to other teams.

To get the best information for this in-house defensive metric, three main pieces of information are needed.
The first data needed is how hard and in what direction the ball was hit. Hit F/X data, supplied to all teams, gives the direction and speed that each ball was hit. If this data is not available to the team, BIS and STATS provide data on where each ball was hit on the field in various zones.

The next set of data is not currently available is the location of the defenders. This data could be obtained by watching every play of every game (or some subset of teams). Also, you could take the hitter tendencies (pull hitter, hits to all fields, etc) and find the average position of the fielder. Getting the position of each player may be rather difficult and time consuming. If the resources are available I would collect all the data on each ball in play. If not, I would collect a sample of data and see if it matters significantly where the defender is positioned. If we know that the amount of variation depending of defensive positioning to be 4 plays per 100 hit in the player's direction, this level of variation can be added in later.

Finally data would need to be obtained if the player made the catch, threw it to for an out or caused and error. This data is readily available from STATS or BIS or could be obtained easily from the which as all the data in easy to get .xml format.

Once the data has been collected, a run expectancy chart needs to be generated for the league. It shows the average number of runs generated given the runners on base and the number of outs in an inning. Say a batter hits the ball and gets to first base on a ball barely missed by the shortstop. The average number of runs with a runner on first with no outs could be 1.1 runs (example). Now if the fielder got to the ball and threw out the hitter, it would be 1 outs with no one on base. The average number of runs scored in this situation would could be 0.4 runs. So by making or not making the play, the average number of runs expected to be scored changes by 0.7. This is the run value for the play.

The data between different players can then be compared to see which players made the most plays given the situation to determines the average percentage of times any given player makes a play. Once this percentage is known, then runs lost or saved can begin to be assigned to players. For example, 50% of the time a shortstop gets the runner out at 1st for a ball hit 4 steps to his left. A shortstop, only makes 3 out of 10 of these plays, cost the team on average 1.4 runs.

Finally, the number of runs saved or lost per team would be known and these values would be summed. Over the past few years, 10 runs prevent or scored is needed to get an additional win. Taking the difference in runs prevent above or below average and dividing it by 10 will give the number of wins or losses that a teams defense generates.

How would you optimize the Diamondbacks 2010 lineup?

There as been quite a bit of previous work done of this subject and there is no reason to re-invent the wheel on this subject. Studies have shown that the difference between the best and worst possible lineups is only 3 wins a year. While 3 wins is important, most teams do put their best hitters at the top of the lineup and the worst hitters at the bottom, so the difference is only a win difference or two.

Though the difference is small, it should not be ignored. I would use the work already done by Tom Tango, Mitchel Lichtman and Andrew Dolphin in The Book to set the lineups.

First, I would collect (download or just copy the data by had since we are only dealing with 25 players at any one time counting the pitchers) 3-4 of the better publicly available projection systems (ZIPS, PECOTA, CHONE or OLIVER) for each player . I would then average the values to to come up with a projected talent level for the player, with emphasis at looking at getting each players wOBA, a measure of the total production (combination of on base percentage and slugging percentage) per at bat.

Besides projected numbers, I would average in the current seasons data. The two sets of data will be combined together get the true talent of the players. The reason for the combination is that current season data is too small to accurately determine the players ability, but a player may have changed from their previous seasons, so the projection may not be accurate. Finally, I would calculate the player's historic split between right handed and left handed pitchers.

The Book shows the league average for platoon splits for right and left handed hitters against right and left handed pitchers. Also it shows how long a it takes for a hitter to achieve a split larger or smaller than the league average. I would take the number of at bats the player has had in the league so far subtract it from the amount needed to determine if the player's split is larger than the league average. Then weight the two values accordingly. For example, if a player has a career wOBA against right handed pitching of 0.300 in 1000 at bats and the average league split says he should be at 0.310 if under 2000 at bats. I would set 0.300 for the first half of the players average and 0.310 for the rest of the players wOBA to get a total value of 0.305.

The players in the game will then be ranked from best to worst for both right and left handed pitchers. The Book found out that the lineup should be set like like the following.

The best three hitters should be in the #1, #2, and #4 spots. Put players with higher OBP higher in the order and hitters with higher SLG lower in the order. Then fourth and fifth best hitters are next with the fifth spot being the 4th best hitter and the 3rd spot being the 5th best hitter,. Finally place your four remaining hitters in decreasing order of overall hitting ability, with base stealers ahead of singles hitters in order to utilize their speed.

Now the lineup can be adjusted a little more. One consideration is to see if the left and right handed hitters can be split up as much as possible. If one of the 2 players in contention for the 1 and 2 spots is right handed and the other is left handed and the player in the 3 spot is left handed, the first hitter should also be left handed.
The main goal with setting a lineup is to find a player's true talent as accurately as possible and put the better players at the top of the lineup.

What components of Minor League performances predict future Major League success?

The first priority would be to find what what the team considers to be Major League success. Is it plate patience, on-base percentage, power, speed, etc needed for hitters. Is ERA, strikeouts, walks, ground ballz rate needed for pitchers.

Once the team has decided on what stats to use to determine major league success, the data for all minor and major leagues needs to be obtained. This can be done by either getting the data from BIS, STATS or create a program (written in Perl or other scripting language) to pull the major and minor league data from for all the all the available players. I would only use the first 1 to 3 years of ML (depending on availability) so the data become compared is uniform for new players and not 20 year vets.

Next, I need would need to to get each team's home and away stats, to determine the park factor for each minor league stadium. If a stadium is shown to increase a teams statistics (small stadium at a high altitude) by 10%, then the teams stats for that stadium would be lowered. Once this number is found for all stadiums, the players numbers will be adjusted for the home park or if needed each park the team played in.

Next the player data would need to adjusted for the league the team plays. For example the AAA Pacific Coast League may bloat offensive statistics by 5% compared to AA Texas League . At this point there should be a set of data for each player that contains stats for each level of competition they played out.

Finally, the player's age needs to be considered into the stats. This is done by grouping all the players of the same age together and seeing if there needs to be an adjustment. This done because a 22 year old with 4 years of college that starts at A ball would be expected to do better than a 18 year that was just drafted on that same team.

Since the data is adjusted for parks, leagues and age, all the data can be added together to get the players minor league totals.

All the preceding data manipulation could be done in SQL or in basic spreadsheet like Excel. At this point, the best method would be to link the SQL database with the statistical program R. Using R in conjunction with some scripting language like Perl, I would begin running a regression the minor league stats vs each of the major league stats that show success. When the regression is run, parameters could be set to remove certain players (I.e. set a minimum required number of at bats in the minors and/or majors).

The minor league data will have to have some human involvement here making sure the data being looked at makes sense. If we are looking at runs allowed, we would not want to look at total strikeouts because those in the minor longer would have more. Instead the data will need to be in a rate stat like strikeouts per nine innings. For example, major league ERA is the stat that is considered to be important, and when the regression was run, it was shown that high K/9 and low HR/9 and BB/9 are the minor league stats that most closely lead to major league success.

With the regression run for all the players, we would know how much error and variance is involved in the final data.

What impact would going to a 4-man rotation have on a Club's bullpen?

If a team would go to a four man rotation, several different pieces of data would need to be collected. The first piece of data I would collect is from the team/coaches on how many pitches the starters would be allowed to throw.

Most of the data on 4 man rotations is from at least 20 years ago when starters went more innings than they currently do. I could look back at these patterns and see what worked then or didn't, but if the team has no plans to have each of their starting pitchers go 130 pitches, the data would be useless. Also, I would need to know how long the relievers would be allowed to pitch on any night and how many days in a row they would be allowed to pitch. Finally, I would need to know when the coaches want to use the relievers, such as the closer will only only be used in a save situation in the ninth innings or someone else will only be used for long relief when needed.

Once the team's limitations are known, I would take the 4 known starters and all the possible starters that could be used when one of the four initial starters gets hurt or their performance is sub par. Using data from (.xml data) imported into an .SQL database, I would find how many innings the starter can go knowing the limited pitch count. If the pitcher was allowed to throw more pitches previously with a 5 man rotation, I would then limit them to the inning when the pitch count was reached. The distance the starter goes, on average, can easily be determined along with the frequency of the distance the starter could go in any game depending on how much they have pitched in previous games.

Next a bullpen usage grid would need to be created for all possible relievers. For example the closer can throw 1 inning two nights in a row, as long as he didn't throw 2 innings the previous day. This could get a little complex, but would be needed fully completed to determined the feasibility of the 4 man rotation.

The process can now be automated by writing a program that would take the starters and the available bullpen and see if the bullpen would be used knowing how usage of the starters and relievers. For example, for the first game the starting pitchers goes 6 innings and three relievers would be needed for inning each to finish the game.

The program at this point could be as simple or complex as need. It would run with just the set number of pitchers or actually calculate in possible injuries. The game's score would added in depending on how far the starting pitcher could go in the game. It would then run several thousand times to look for all possible situations that the bullpen is inadequate.

One possible problem detected would be when the bullpen would not be able to keep up and their usage exceeds that available. The teams currently bullpen only has 1 long reliever, but do to the tendencies of the starting pitchers, there may be case where a 2nd long reliever is needed before the first is fully rested.

Another aspect that could be detected is the the usages of each reliever. If the team is carrying 3 long relievers, but only two are needed, the program would show which are getting under or over utilized.