So the first thing I did was to do an analysis of how important the various pitching ratings are for predicting pitcher outcomes in 2045 BBA. The results are included below. But first I want to describe how I reached these results.
I took all the pitcher data and combined it with the ratings information. For all the analyses included herein, I used the vL and vR splits so that I had true measures of how each pitcher did against both left-handed and right-handed batters.
Then I ran stepwise regression models for each of the outcome measures I was looking at. For those that don't know, I will describe how these work and then what I did for each one. Suppose I am trying to predict FIP for pitchers using a variety of different ratings for the pitchers. In this case, I might use STU, MOV, CON, gb_pct (a conversion of NEU, FB, GB, EX GB, etc. to supposed ground ball per cent), etc. What it does is the model asks, "Which variable is MOST predictive of FIP of all those choices." Then it asks, "Am I confident enough in that prediction that only once in a 'blue moon' (defined as a probability called alpha) that the data could have been arranged to look like this if that variable was not really predictive?"
Once it has the "best predictor variable", it asks: "Now I am going to subtract out the influence of that variable on the target, and try again with all remaining variables to repeat the process" But before it does that, it asks, "Are there any variables that I added as predictors before that no longer need to be in the model in order to get adequate predictions?" If so, then it removes those variables.
It continues the process until it stops being able to find any additional predictor variables that have significant influence beyond that alpha probability. It then returns an equation, like so: <Prediction>=Intercept + w1<var1> + w2<var2> + ... for all significant predictor variables. The intercept corresponds to the value of the prediction if all other variables are 0. For each w1, w2, etc. term, there is also a "standard error" computed for it which indicates how much the actual weight might vary from the computed weight. Generally, if we knew the actual relation, there would be a 67% chance that each weight would be within one standard error of what we compute, and a 95% chance within two standard errors.
Ok, so now I have explained it, I'll tell you what I did. I first separated all the pitchers into "starters" (those that had started at least 85% of their games), and "relievers" (those that had started less than 10% of their games). Any pitchers that had less than 50 batters faced (bf) or that did not meet either criteria were excluded from the analysis.
So then I did analyses for all sets of pitcher handedness X batter handedness X reliever vs. Starter (i.e. 8 analyses), in each case trying to predict bb per bf (bb_bf), strike outs per at bat against (k_ab), home runs allowed per AB (hr_ab), FIP, and ERA. In each case, I did a "weighted" regression where I weighted by bf so that pitchers that pitched more had a stronger effect on the results. The variables that I used to predict each one were the "appropriate" STU, CON, MOV (i.e. if I was looking at vR batters, I used the vR variants, if I was looking at vL batters, I used the vL variants), gb_pct, kb (1 or 0 regarding whether they were a knuckleballer or not), kc (1 or 0 whether they had a knuckleball curve pitch or not), PIT (# of pitches in their repertoire), good_pit (# of pitches in their repertoire that they had greater than a 2 rating on), and Stamina. In all my tests, I used an alpha of .01 ---- saying it would be ok if 1% of the time it thought something was significant that really wasn't.
The result was that the ONLY significant variables in any of those regressions were the relevant STU, MOV, and CON ratings. I was surprised that good_pit did not matter for starters, so I checked and there was only one pitcher with a "starter" role with at least 50 batters faced in all of BBA in 2045 that had fewer than three good pitches. It didn't find an effect because there weren't enough examples to show it. I therefore removed that one player (Miguel Angel Garza) from the rest of my analysis.
The other thing that I found was that the effect of STU, MOV, and CON for all the predictions for starters LvL, LvR, RvL, RvR and for predictions for relievers LvL, LvR, RvL, RvR were within roughly one "standard error" of each other suggesting that it really was using CONvL, CONvR, etc. as a prediction for what would happen: i.e. left handed relievers weren't getting any "boost" from the engine compared to right-handed relievers, etc.
Since this was true, for each pitcher, I could just take the actual percentage of time that pitcher faced a left-handed or right-handed batter, compute their combined stat based on that actual percentage (e.g. if a certain pitcher faced righties exactly 2/3 of the time, I would compute MOV = 2/3*MOV_vR+ 1/3*MOV_vL for that pitcher). I then could do an analysis across only two groups, starters and relievers, again weighted by batters faced (for starters and relievers, there WERE significant differences in the computed weights).
The other thing I did was to verify via looking at diagnostic plots whether a "linear" function, as mentioned above, was a good fit for the data, or whether a more complicated function would be needed. What I found was that a linear function worked great except for when a pitcher's CON was below 5. As you can verify from the player editor (and I have also found previously looking at perfect team and perfect team tournament data), each point in CON below 100 in the editor (or 5, for us) is about equivalent to two points in CON above 5. What I did was to adjust the scores so that I subtracted five from STU, MOV, and CON for each pitcher. For CON, if that left a negative number I then multiplied that by two (so a CON of 4 would be converted to -2, a CON of 3 would be converted to -4, etc.)
Now that I have explained the procedure, let me go through the results. First, here is a table for the results predicting bb_bf (walks per batter faced):
bb_bf:
role | _TYPE_ | _RSQ_ | Intercept | STU | MOV | CON |
---|---|---|---|---|---|---|
relieve | PARMS | 0.60326 | 0.12829 | . | . | -0.017912 |
relieve | STDERR | . | 0.00256 | . | . | 0.001000 |
starter | PARMS | 0.72557 | 0.11872 | . | . | -0.017348 |
starter | STDERR | . | 0.00239 | . | . | 0.000871 |
Since we have subtracted five from each stat before doing this analysis, this tells us (from the Intercept) that a reliever with CON of 5 vR and vL would expect to walk 12.83% of batters, while a CON=9 reliever would expect to walk about 5.72% (12.83% - 1.79%*4) of batters in this league.
Starters have a slightly lower baseline: a starter with a control of 5 would expect to walk 11.87% of batters (footnote: is this statistically different than relievers: note that it would take five standard errors to make them equivalent, so yes, as a baseline starting pitchers walk a lower percentage of batters at the same CON value as a reliever would). However, the difference in the percentage of batters that starters walks is almost exactly the same as it is for relievers, so while starters have a lower baseline, the effect of a point difference for starters appears to be about the same as for relievers (they are within one standard error of each other).
So, now let's look at percentage of abs that result in strike outs (k_ab):
role | _TYPE_ | _RSQ_ | Intercept | STU | MOV | CON |
---|---|---|---|---|---|---|
relieve | PARMS | 0.80793 | 0.064011 | 0.038591 | . | . |
relieve | STDERR | . | 0.007822 | 0.001295 | . | . |
starter | PARMS | 0.71182 | 0.089721 | 0.026739 | . | . |
starter | STDERR | . | 0.006446 | 0.001389 | . | . |
Note something else: a baseline 5 STU reliever gets about 2.6% fewer strike outs than a baseline 5 STU starter. Well, duh... recall that a starter gains about a point or so in STU when set as an RP, which probably not coincidentally makes them even (i.e. a 5 STU starter would strike out about 9% of batters, and a 6 STU reliever would also strike out about 9% of batters). However, each point in STU results in MORE strike outs for relievers than it does for starters. A 10 STU starter would strike out about 8.97%+5* 2.67%= 22.3%, while, as an 11 STU reliever that pitcher would strike out about 6.4%+ 6*3.86%=29.6% of the batters he faces.
Moral of the story: high STU pitchers might work better as relievers.
Let's look at hr_ab:
role | _TYPE_ | _RSQ_ | Intercept | STU | MOV | CON |
---|---|---|---|---|---|---|
relieve | PARMS | 0.32600 | 0.051478 | . | -.008992451 | . |
relieve | STDERR | . | 0.001521 | . | 0.000890140 | . |
starter | PARMS | 0.29843 | 0.048840 | . | -.006994349 | . |
starter | STDERR | . | 0.001561 | . | 0.000875614 | . |
Now, let's look at the measures that you have all been waiting for, FIP and ERA. First, FIP, since a pitcher has more control over that:
role | _TYPE_ | _RSQ_ | Intercept | STU | MOV | CON | STU_pct | MOV_pct | CON_pct |
---|---|---|---|---|---|---|---|---|---|
relieve | PARMS | 0.49788 | 7.50995 | -0.33721 | -0.57310 | -0.18424 | 0.30808 | 0.52360 | 0.16832 |
relieve | STDERR | . | 0.22906 | 0.02609 | 0.06365 | 0.04386 | . | . | . |
starter | PARMS | 0.53263 | 7.23568 | -0.26882 | -0.36175 | -0.31752 | 0.28354 | 0.38156 | 0.33491 |
starter | STDERR | . | 0.19906 | 0.02924 | 0.06184 | 0.04191 | . | . |
Recall that differences in MOV and STU seem to result in bigger performance differences in relievers than starters, while CON differences were about the same. That seems to account for it. To me, this implies that you want to make your high CON pitchers starters, high STU and MOV pitchers relievers.
Something else to note: a 5-5-5 pitcher would not expect a very good FIP in this league: about 7.51 for a reliever or 7.24 as a starter. Each point across the board is worth about 1.2 FIP as a reliever and about .95 FIP as a starter: so a 8-8-8 reliever would expect about 3.9 FIP and a 8-8-8 starter about 4.8 FIP.
Here is the table for ERA:
role | _TYPE_ | _RSQ_ | Intercept | STU | MOV | CON | STU_pct | MOV_pct | CON_pct |
---|---|---|---|---|---|---|---|---|---|
relieve | PARMS | 0.31513 | 7.39922 | -0.32010 | -0.56718 | -0.20660 | 0.29263 | 0.51850 | 0.18887 |
relieve | STDERR | . | 0.32625 | 0.03717 | 0.09066 | 0.06246 | . | . | . |
starter | PARMS | 0.28348 | 7.04395 | -0.25327 | -0.31463 | -0.32401 | 0.28396 | 0.35276 | 0.36328 |
starter | STDERR | . | 0.32236 | 0.04735 | 0.10015 | 0.06787 | . | . | . |
Note how similar it looks to FIP, only the r-square is lower since fielding is now a big component.
I must admit, though, I was very surprised about the ERA results. Not by what is there, but what isn't. What isn't clear to a lot of people is that MOV is also a composite: of the built-in MOV (basically, percentage of fly balls that are home runs) and the ground-ball percentage. PItchers that force more ground balls, everything else being equal, will give up less home runs (thus show a higher MOV). There is generally a trade-off for this, however. Generally, ground ball pitchers give up more hits (since a greater percentage of ground balls are hits than fly balls). In previous analyses I have done for perfect team, ground-ball percentage plays a significant role in predicting ERA, but not necessarily FIP, since ERA includes BABIP as part of it.
So I did a final analysis of BABIP. I don't need to show the table for that. None of the ratings had a significant effect for BABIP --- not even gb_pct. Perhaps it is that one season is just not enough data to show it. But I find it perplexing to some extent.
So I have really geeked out here, and it might be that none of this is of interest to anyone except the geekiest of us all (myself). I currently plan to do two or three more features: batting, defense, and perhaps running. But it is also time-consuming to do all this analysis and to write it up as well.
Is this of interest to people? If it really isn't, I could just keep it all to myself (and use it to secretly beat everybody )
Thanks for your attention, assuming you read this far.
Oh, one final note: I did also perform the analyses WITHOUT CON below 5 as double importance. Results were clearly better (higher r-squared) for bb_bf, FIP, and ERA when low CON was double-pointed.