Relative Ratings: Perhaps you don’t want to go here…

Post by **RonCo** » Sat May 09, 2020 6:04 pm

Okay…this chart is, um, intense. I’m serious. You may want to turn back now. Turning back will not hurt your game play, and in fact may just serve to keep you sane. I know it’s intense because I keep looking at it myself and constantly trying to (1) remember where the information comes from, and (2) try to determine exactly what it means.

So, yes, seriously, this foray into relative rating is so esoterically geeky that even I am almost embarrassed to have spent two day s on it.

Almost.

I mean, who am I kidding, right?

Anyway...you might consider scanning the chart and moving on. Otherwise, pull up a table because this one is going to take some real thinking.

Origin Story: Bottom line, on one of our podcasts Justin said he liked relative ratings because he could look at a 50 and say that was average. Or, at least, that was the gist. My initial reaction was “I don’t know that this is really right,” but to be honest, I don’t know really how relative ratings work. So, yesterday, I sat down to try to figure them out. What is a relative rating? What things get adjusted, and which things don’t? And how—how do these things move?

TLDR: Executive Summary

Here are some things I think I’ve determined (*)

Relative ratings change only overall ratings (and potentials) and primary components (Contact, Gap, Power, Eye, AVK … I assume then we’re also talking Stuff, control, and movement, though I didn’t do this dive for them.
Relative ratings do not change Speed, steal, running, bunting, or defensive component ratings.
Relative ratings definitely do not use the league average rating of each component as their “50” point.
Instead, it seems more likely that, for each individual component, the game first determines the standard deviation of the population, then attempts to back into a rating structure based on that standard deviation

(*) I say “I think I’ve determined” because there are still somethings—especially in bullet 4 above, that I’m having difficulty getting my mind around.

The Process: I took an old test league I’d used to study something else, and gave it the once over to be sure the ratings were representative of a general league. In this case, the league had 40 teams. I had used it to study injury and team doctors, so all the doctors were controlled, but the players had ratings scattered in a normal fashion. I then did the following:

Set scouting off, ratings to 1-100
Set ratings to true (not relative), and “rescouted” to get the ratings set. I then copied off the ratings of every player.
Set ratings to 20-80, and “rescouted, then copied the ratings
Set ratings to relative, then rescouted and copied.
Set ratings to “relative by all” and rescouted and copied
Exported the “raw” 1-250 ratings from the game via the csv export.

The Results/Analysis: At this point I got knee-deep into Excel and began counting players at each rating level and mapping their ratings by component as they converted from “Raw” (1-250) to “20-80,” to “20-80 Relative.” At each level I took averages and standard deviations and whatnot just so I could compare the groups.

Then I took a deep breath and started staring at my notes and results, trying to figure out what it all meant. The first thing that was obvious: no, a “50” is not the average of the ratings. In fact, a “50” seems to be calculated differently for each population. I think this is probably a key learning (which we probably knew, but what the heck):

If the population changes, what relative ratings means changes, too.

Ultimately, I was struggling to get my mind around the best way to look at things, so I slept on it, and woke up with some ideas of how to put this together in ways that made more sense. (aside, that’s a great thing about humans, right? Our brains never really stop working on things we’re interested in).

Anyway, the chart is a doozy. I mean, it’s almost embarrassing, so brace yourself, man … here it is.

Bottom line, there’s still stuff I don’t know. I have to stare at for a long time to really understand…but I’ll do my best to describe...

How to Look at That Chart

First, the upper sections:

1) The top row is broken into 250 columns to represent each 1 point of ratings on the raw scale. I’ve colored every 10th block orange and every 50th red in order to be able to get a better visual on where a rating might be. A “46” for example, is four orange blocks, plus six (Or one red block, minus four).
2) Rows 2-4 are the Raw Ratings scales. These are rock-solid, standard structures for “true” ratings. A “true 46” (to stick with that number) would therefore always be a “3” on the true 1-10 scale, and a 35 on the true 20-80 scale. A “true 110” for another interesting example, will score a “6” on the 1-10 scale, and a “55” on the 20-80 scale.
3) Rows six through ten are where things get interesting. This is the segmentation that occurs for this collection of hitters when relative ratings are selected. You should note that Contact, Gap, Power, Eye, and AvK all have scales unique to them. This, I assume, is because their populations are different.
4) You might notice the “?” toward the end of the scales on both sides. This is where the population of hitters I had as examples were missing cases of raw 1-250 scale ratings. For example, the 25? For contact means the demarcation between in and the “20” box (which does not exist in this set) is uncertain, and in this case the line between 25 and 30 could be off 1-2 points of raw score. There is, however, a clearcut break between 30 and 35 on the contact scale, so there’s no “?” in the 35 box.
5) To understand how to read these, here are some examples:
a. The smallest Contact rating in the sample is a “raw 29” in the 1-250 scale. Moving down the chart, this becomes a “2” on the true 1-10 scale, and a “30” on the true 20-80 scale. But it’s a “25” on the relative contact rating scale.
b. A “150” raw rating (the third red bar at the top), would be an “8” on the 1-10 scale, and a “65” on a 20-80 scale. It converts to a “70” on this relative rating scale for contact. Note that the same “150” power rating would be a 65 on that scale or a “60” if it were an Avoid K rating.

If you’re following me at this point, you get the idea behind the ratings flow part of this chart.

Whew.

But we’re not done, yet. Not by a mile. Now that we have these boxes identified, two more questions come to mind:

How did the game come to this segmentation for each skill set?
What does it look like in practice?

Let’s take them one at a time.

HOW DOES THE GAME DECIDE THESE RELATIVE RATING BREAK POINTS?

Bottom line, I don’t know. I’ve looked at this hard for a couple days now, and I’d be hard pressed to say exactly what’s happening. That said, I’ve got some ideas—and they are ideas that come from data in the tables to the right of this chart.

Before I go further, let me say what is obvious is not happening, and that is that it’s clear the game is not taking the “average” raw rating and making it the center “50” point (and then presumably spreading out from there). This, I think, is what Justin was suggesting was happening, and in reality what I would have expected myself. It’s not the case, though.

Example: Look at the table on the right. The average “raw Contact” rating is 95.37. By the flow above, this would convert to a “5” on the 1-10 true scores and a “50” on the 20-80 true score structure. In relative ratings, however, it becomes a 45. You’ll find the same kind of behavior in Power, whose average of 86.03 converts to 45 in both true and relative 20-80 scales.

That said, the relative rating feature is compressing and warping the rating scale in interesting ways—which you can see merely by letting your eyes play across each rating population.

So, what do I think is happening?

Well, I think—but don’t know—that rather than look for a center point, the game is starting with a population’s standard deviation, creating a set of ratings groups based on that, and then fiddling around with center points until it finds one it likes the best based on the population. This would make some sense based on the idea that a 2-8 scale is based on variance from standard deviations, anyway, and if you look at the size of each rating spread, for example, you’ll find they are essentially consistent within each rating type (contact, gap, power, etc.). Since the 20-80 scale is broken into 13 buckets instead of the 7 of the 2-8 scale, I think the gaps are effectively half the standard deviation.

Take Power, for example. Its population has a standard deviation of 47. Half of that is 23…which is very close to the spans that the game is using to separate raw 1-150 scores into relative buckets. Note here, that if this is true—and if the game was to try to use the true population average of 86 as its center—that it would quickly run out of space on the bottom end (you’d see a few 35 powers on the low end, and a LOT of 80s on the high end. So instead, for this population, it moves the scale up (to the right) somehow.

Not all need that last step, though. AvoidK, for example—with its 18ish half-STDEV—fits in pretty nicely all on its own, with the scale expanding out a little to give sum additional pretend deviation.

Anyway, that’s the purpose of the tables on the right. You can also see that in the conversion from 1-250 raw to 20-80 true to 20-80 Relative, that none of the ratings center exactly in 50, and in fact some actually move the “wrong” way if you think that’s the design criteria.

And, finally, this chart addresses the last question, too:

WHAT DOES THIS LOOK LIKE IN PRACTICE?

Ultimately, what we really care about in the end are whether we are reading the data right. For this there is good and bad news here in this study.

Bad News First: I have no idea how this directly relates to the BBA.

The populations of players I’m looking at are NOT BBA players. And I’ll guess that every game environment could be different based on its scatter. So, let me warn you that trying to translate directly from these tables based on our players is probably a bad idea. I’ll also note that we use a 1-10 scale for components rather than a 20-80, so there are translation issues, too.

What does seem helpful, though, is to realize these scales exist, and to do our best to translate them based on our out thinking/guessing about the population and their performances. Therein lies the rub.

Regardless, by now if you’re this far into this piece, you’ve already looked ahead and scanned the histograms on the chart. These are what they should appear to be, namely “before” and “after” looks at ratings populations as relative ratings are applied. By scanning them, blue rating bars are “true 20-80,” orange are “relative 20-80” you can see that the game does seem to be attempting to normalize the curve a bit. You’ll note I’ve also included current and potential Overall ratings on the right hand side for good measure.

CONCLUSION

In the end, I’m not sure what it all means when it comes to playing the game.

More bars is more better, right?

I can say I would be careful about saying “50” is average because I don’t think that’s quite right – and that’s even before we get to the selection bias that is our GM’s sending the best guys to the plate I would also say that given the shifting of scales going on under our feet, the ability to look at stats and weigh sample sizes is even more important then normal. Relative ratings become an additional layer of that fog of war folks talk about, which is maybe fun as long as you know how to read things a bit … or can at least convince yourself you know what you’re doing—which is a totally other thing, and which in my case is perhaps as much a curse as it is anything else.

Grin.

jleddy · Post by **jleddy** » Sat May 09, 2020 7:41 pm

Ron conveniently posted this the day before the deadline to try and distract other Heartland teams from making moves. I applaud the effort.

Post by **RonCo** » Sat May 09, 2020 7:47 pm

I'm not above such.

Spiccoli · Post by **Spiccoli** » Sat May 09, 2020 8:20 pm

I have no idea how to interpret this...

One thing I’ve noticed is the ratings graphs in the scout tabs don’t always correlate to their ratings (actual or potential).

I’m thinking the graphs are the “real” ratings before being adjusted to make them relative.

Could very well be wrong too... the whole game is a mystery sometimes.:: and that’s just fine with me.

jleddy · Post by **jleddy** » Sat May 09, 2020 8:21 pm

Here's a much shorter and less-scientific-nor-remotely-meaningful analysis...

Literal average ratings of all current BBA batters:

OVR	POT	CON	GAP	POW	EYE	AVK	SAC	BFH	DEF	SPEED	STEAL
50.6	53.2	6.1	6.4	5.8	5.2	6.3	3.9	2.2	7.5	5.8	6.7

aaronweiner · Post by **aaronweiner** » Sun May 10, 2020 9:00 am

For sure those curves look more or less normalized.

A normal curve would, of course, explain why there are just ten SP with a 70 rating or higher (and just three with a 75+), and confirms the BBA left skew hitting explosion with about 10% hitters above a 70.

Maybe they take minor leaguers into account when calculating.

The Brewster

The Brewster

Relative Ratings: Perhaps you don’t want to go here…

Relative Ratings: Perhaps you don’t want to go here…

Re: Relative Ratings: Perhaps you don’t want to go here…

Re: Relative Ratings: Perhaps you don’t want to go here…

Re: Relative Ratings: Perhaps you don’t want to go here…

Re: Relative Ratings: Perhaps you don’t want to go here…

Re: Relative Ratings: Perhaps you don’t want to go here…

Who is online