Invariably, the SSS argument always seems to pop-up. When is there enough data to draw a meaningful conclusion? Is the sample size large enough to mean anything? The great thing, is that you can answer these questions with some simple statistics.
In particular, I injected a witty comment into a fanpost that can be seen here regarding Salvador Perez and his offensive statistics relative to having a day off previously versus when he did not. Someone said there may be enough of a sample size and someone else questioned whether that was true. So, I did what any good basement dwelling dork would do; I went to fangraphs and exported all of Salvy's game and plate appearance into a spreadsheet that you can find here (note: two separate worksheets). I manually added columns to the game spreadsheet to identify when he had a day off prior (either rest or team day off), whether he played catcher and for each particular game, how many days in a row that he had played as of that day's game.
*NOTE - There are a handful of games where Salvy either played 1B, DH or came in as a PH. In order to control for this I eliminated any game from the sample where he had only 1 PA or less and included them as a 'day off'. You can rip my dirty stats; it won't bother me.
Using this spreadsheet, I generated some aggregate statistics based on whether or not Salvy had a day off previous to the game or if he did not. I came up with a total of 270 games for Salvy; 818 PA in those games came after playing the day before and 250 of those PA came after he had a day off (or 1 PA or less). On days where he had a day off prior (or 1 PA or less), he put up a 0.325/0.348/0.472/0.820 quadruple slash over 250 PA. On days where he did not have a day off, he put up a 0.286/0.314/0.397/0.711 slash over 818 PA. His K% and BB% over the two samples were pretty similar as well (see graph below).
Wow, that's a pretty big difference. I don't need no damn stat geek telling me there's not a difference in those numbers, I can see it plain as day for myself. Almost .040 higher in BA, .034 higher in OBP and .075 in SLG; his OPS was over .100 higher!!!!!!! Suck it with your freaking sample sizes.......
But wait, how do we know if the sample size is large enough? Should we just argue it out by using our gut instinct? Again, great thing is that you can figure this out using simple statistics.
Instead of just looking at his cumulative statistics, I wanted to see if there was a correlation between whether or not he had an off day and his offensive production. In order to simplify, I used the outcomes of hits and total bases to correlate with whether or not he had a day off (0 if no, 1 if yes). In addition, I wanted to see if there was a correlation between the number of days in a row that Salvy had played and his hits and total bases. There are various ways you can do this, but for simplicity, I ran a 2-tailed Pearson Correlation.
As you can see below, there is a slight positive correlation between the number of hits and total bases Salvy collects and whether or not he had a day off the day previous. It's basically showing that about 8-10% of the variance in the number of hits and total bases Salvy gets can be predicted or explained by whether or not he had a day off. However, when you look at the significance (Sig. (2-tailed)); you will see that it does not meet the requirement to indicate statistical significance. So, what does this mean? Well, basically, that means that while there is a positive relationship between the two, it is not significant enough to attribute to something other than random chance.
Additionally, if you look at the correlation between the number of days Salvy played in a row and his total number of hits and total bases, you will see a slight negative correlation for both. That is; the more days he plays in a row, the less hits and total bases he gets. However, again when you look at the statistical significance of the correlation, it is basically non-existent. You can predict or explain about 1-2% of the variance in hits and total bases for Salvy based on the number of days in a row he has played. Again, nothing that cannot be explained by simple random chance.
So, how do sample sizes play in this? Well, it's pretty simple. The more reliable your variance is in your data collection and data elements, the less of a sample size you need to prove statistical significance. The same is true otherwise; the less reliable your variance is in your data, the larger your sample size needs to be. You can show statistical significance with smaller sample sizes as long as you have large variances in your data sets. You can also show statistical significance with small variances in your data sets, as long as you have a large enough sample.
In conclusion, there is a difference in Salvy's performance depending on whether or not he had a day off. However, there is not enough data to prove it is anything other than random chance. However, if you start combining this data with other data points (such as direct observation or measurement of perception) you can make a basic inference that may or not be true.
And then we can argue about it.