So your pitchers suck? Do you know how badly?
Posted: Fri Dec 07, 2012 7:59 pm
Damn! This took a lot more effort than I anticipated. One thing I track for Carolina during the season that helps tremendously is the variance of my team stats to enhance the league rankings given in OOTP. When the 2007 season ended, I decided to run the numbers for every team and provide them for consumption. The spreadsheet for all of MBWBA along with graphs (explained below) for each club on separate tabs can be found at the link below:
MBWBA_TMstats
Here's a quick explanation if needed:
Unique Stats
Offensive stat: BB% - Takes the number of walks earned by a team and expresses the rate at which the team earns walks: BB% = BB / PA
Offensive stat: K% - Takes the number of strikeouts by a team and expresses the rate at which the team strikes out: K% = K / PA
Fielding stat: UR% - Attempts to provide another overall assessment of team defense by expressing the percentage of total runs allowed that were unearned runs: = UR% = (RA - ER) / RA
Variance vs. Ranking
So why care about variance at all? Let's say that a team ranked 22nd out of the 24 teams in MBWBA in stolen bases and also 22nd out of 24 in errors. The easy analysis is that the team has as much work to do in catching up to the rest of the league in stolen bases as it does in defense (errors). That's not necessarily true, however. Imagine you took every team in the league and plotted their number of stolen bases on a straight line from least to most, and then did the same for errors. It might be the case that there is generally a lot of room between teams on the stolen bases line but on the errors line the teams are clumped up very closely. It would be obvious, then, that the team in question is actually much worse in stolen bases (relative to the rest of the league) than in errors, even though they rank #22 in both. That's a very critical piece of information. As most of you likely know, stats that are widely spread out from least to most have high variance and stats where teams are clumped together have low variance.
How to analyze the team graphs
You probably expect a long explanation of normal distributions and bell curves here, but it's not really needed. Here is what I did for each team with each stat:
1. Calculate the league average for every stat (i.e., the mean).
2. Use Excel's built-in capabilities to calculate the standard deviation for every stat. This is, of course, the first step in calculating variance, but I don't need to go that far. The standard deviation calculation assumes that all of the teams' stats in one category fall into a normal distribution (most teams found around the average with a small number being much higher and a small number being much lower). The distribution itself could be tall and skinny or wide and flat, but regardless it turns out that if I start from the average and go n number of standard deviations below that I will have covered essentially the same percentage of teams as if I had gone n number of standard deviations above the average.
For anyone needing a primer, if you look at the stats spreadsheet you'll see that the standard deviation (STDEV) for each stat is different. The "number" of standard deviations is easily explained by considering the case where the average of a set of numbers is 100 and the standard deviation is given as 10. In that case, 90 would be -1.0 standard deviation from the average. 80 would be -2.0, 85 would be -1.5, 110 would be +1.0, 130 would be +3.0, and so on.
3. Given (1) and (2), I calculate the difference between a team's stat (say, number of strikeouts) and the league average.
4. Taking the result from (3) and dividing by the standard deviation from (2) gives the number of standard deviations away from league average for each team. This is the number that is plotted on the graphs (there is a correction done at this step where stats that are better the lower they are - ERA and WHIP for example - are normalized to make the graphs easier to read...in every case, GREEN = GOOD and RED = BAD).
The Real Skinny
Basically, the further your team is from zero the more extremely better (positive numbers; GREEN on the graphs) or extremely worse (negative numbers; RED on the graphs) the team was in 2007. For reasons I won't delve into, +/- 0.68 is considered average, +/- 1.50 means your team is either very good or very bad, and beyond +/- 2.0 is extremely good or bad.
If you see corrections that are needed or have questions, or just for discussion (including opinions on whether this is useful or not, or how to improve it), use this thread.
As I said, I do this for my team throughout the season and have it as automated as possible. I'll gladly provide that spreadsheet if anyone wants to do the same...there's just not enough time to do this for every team for posting or I would.
MBWBA_TMstats
Here's a quick explanation if needed:
Unique Stats
Offensive stat: BB% - Takes the number of walks earned by a team and expresses the rate at which the team earns walks: BB% = BB / PA
Offensive stat: K% - Takes the number of strikeouts by a team and expresses the rate at which the team strikes out: K% = K / PA
Fielding stat: UR% - Attempts to provide another overall assessment of team defense by expressing the percentage of total runs allowed that were unearned runs: = UR% = (RA - ER) / RA
Variance vs. Ranking
So why care about variance at all? Let's say that a team ranked 22nd out of the 24 teams in MBWBA in stolen bases and also 22nd out of 24 in errors. The easy analysis is that the team has as much work to do in catching up to the rest of the league in stolen bases as it does in defense (errors). That's not necessarily true, however. Imagine you took every team in the league and plotted their number of stolen bases on a straight line from least to most, and then did the same for errors. It might be the case that there is generally a lot of room between teams on the stolen bases line but on the errors line the teams are clumped up very closely. It would be obvious, then, that the team in question is actually much worse in stolen bases (relative to the rest of the league) than in errors, even though they rank #22 in both. That's a very critical piece of information. As most of you likely know, stats that are widely spread out from least to most have high variance and stats where teams are clumped together have low variance.
How to analyze the team graphs
You probably expect a long explanation of normal distributions and bell curves here, but it's not really needed. Here is what I did for each team with each stat:
1. Calculate the league average for every stat (i.e., the mean).
2. Use Excel's built-in capabilities to calculate the standard deviation for every stat. This is, of course, the first step in calculating variance, but I don't need to go that far. The standard deviation calculation assumes that all of the teams' stats in one category fall into a normal distribution (most teams found around the average with a small number being much higher and a small number being much lower). The distribution itself could be tall and skinny or wide and flat, but regardless it turns out that if I start from the average and go n number of standard deviations below that I will have covered essentially the same percentage of teams as if I had gone n number of standard deviations above the average.
For anyone needing a primer, if you look at the stats spreadsheet you'll see that the standard deviation (STDEV) for each stat is different. The "number" of standard deviations is easily explained by considering the case where the average of a set of numbers is 100 and the standard deviation is given as 10. In that case, 90 would be -1.0 standard deviation from the average. 80 would be -2.0, 85 would be -1.5, 110 would be +1.0, 130 would be +3.0, and so on.
3. Given (1) and (2), I calculate the difference between a team's stat (say, number of strikeouts) and the league average.
4. Taking the result from (3) and dividing by the standard deviation from (2) gives the number of standard deviations away from league average for each team. This is the number that is plotted on the graphs (there is a correction done at this step where stats that are better the lower they are - ERA and WHIP for example - are normalized to make the graphs easier to read...in every case, GREEN = GOOD and RED = BAD).
The Real Skinny
Basically, the further your team is from zero the more extremely better (positive numbers; GREEN on the graphs) or extremely worse (negative numbers; RED on the graphs) the team was in 2007. For reasons I won't delve into, +/- 0.68 is considered average, +/- 1.50 means your team is either very good or very bad, and beyond +/- 2.0 is extremely good or bad.
If you see corrections that are needed or have questions, or just for discussion (including opinions on whether this is useful or not, or how to improve it), use this thread.
As I said, I do this for my team throughout the season and have it as automated as possible. I'll gladly provide that spreadsheet if anyone wants to do the same...there's just not enough time to do this for every team for posting or I would.