It’s been a surprising 2017 for (name of player) so far. The (team) (position) has put up (adjective) numbers so far and is one of the reasons his team is (record). In the offseason, (name) worked with (name of famous trainer) on (new trick). “I made it a point in spring training to really work on (new trick) and to (action) (adverb).”
The early results have been (breathless adjective). As of last night, (name) has put up a (numbers) stat line and the question that everyone in (city) is wondering is whether he can keep it up. Are we seeing a new (name)?
There’s evidence that the answer is yes! Ten years ago a pseudonymous internet hack who called himself (auxiliary kitchen utensil) wrote an article which detailed the point at which certain statistics “stabilize.” We’re not at the point in the season where things like OBP or ERA are stable, but there are certain indicators about a player that are reaching that stability point.
(Name) has a (stat) of (number) percent compared to his (number) rate from last year, and that statistic tends to stabilize more quickly than other stats. In fact (auxiliary kitchen utensil) work suggests that it’s enough to say that he really is a changed man from last year. So (team) fans, the outlook is (adjective) for this year going forward, at least for (name). I’m not saying print the World Series tickets now, but do get the printer warming up.
***
Ahem.
YOU
ARE
DOING
IT
ALL
WRONG.
(Don’t worry. I have too.)
Warning! Gory Mathematical Details Ahead!
We need to talk about what “stabilize” means, and the details are really important. You keep using that word, but I don’t think it means what you think it means.
When I wrote that original article 10(!) years ago, I wanted to answer a specific question. If I wanted to see, for example, whether walk rate was correlated with something else over some period of time, I obviously couldn’t have batters in my sample with five plate appearances. I needed some empirical way to set my filter for PA > X. Reliability analysis is perfect for that sort of thing.
Reliability analysis answers the question I see that Smith has had 100 PA this year and a walk rate of X. If I were to go back in time and give Smith another 100 PA in the same basic circumstances, how confident would I be that he would reproduce the same performance?
We express that level of confidence in the language of a correlation. Now, I can’t really go back in time and have a batter repeat the same set of circumstances, but I can do the next best thing. I can take a sample of 200 PA that he really did have and split it into two equal parts. Perhaps I can take the even-numbered and odd-numbered PA in that sample, so that I can get some PA from the same day or against the same pitcher in each basket. (Mathematically, I can do something even better using a technique known as Cronbach’s alpha, but the idea is pretty much the same.)
I looked for the point where the reliability estimate/correlation crossed .70, because at that point, the R-squared is (just shy of) 50 percent. We have accounted for 50 percent of the variance just by keeping the batter and the circumstances (roughly) the same. That’s helpful, because it allows me to say “within that timeframe, I can have pretty good confidence that Smith’s performance really was (past tense) consistent with his true talent level."
There’s nothing magical about .70. Any line between “reliable” and “not reliable” has a bit of arbitrariness to it. The only thing that .70 has going for it is that it’s the point where the majority of the R-squared is accounted for by factors that are “endogenous.” It’s not a perfect method, but it’s the best that we’re going to do without a time machine. And I’m fresh out of time machines, because honestly, if I had one, I would go back and tell myself not to hit “publish” on that article.
Five years ago, I published another article in which I said this:
The generally accepted "stability numbers" chart is a good chart for researchers who are doing retrospective research. I think it's also a good one to look at in terms of understanding which stats stabilize more quickly relative to others, which I think can show us some interesting truths about the game. However, I would kindly point out that they are not nearly as powerful in predicting future performance as people seem to believe that they are.
The problem with using that chart as some sort of indicator of future performance is that it’s asking something of the chart that it was never intended to do.
In that same article, I looked at strikeout rate, which “stabilizes” (according to the chart) at 60 PA. The problem is that most people use the 60 PA threshold in a way that fundamentally changes the question that’s being asked. Instead of asking “what would happen if I gave a batter 60 more PA in the same basic circumstances?” it becomes "what would happen if I gave him 60 more PA in a completely different set of circumstances?” He’s older than he was in the first 60. He’s facing a different set of pitchers. He may have made a change in his approach or maybe he has a nagging injury now that he didn’t have back then (or maybe he has healed a bit since those last 60).
It turns out that when you look at sequential blocks of 60 PA for strikeout rate, you get correlations around .50 (which means an R-squared of 25 percent). That’s not a horrible correlation, but let’s put that in context. Using the split-half method, where the two baskets of plate appearances were drawn from the same games, I was able to get a correlation of .70 (and R-squared of 50 percent). By using a method where we kept the batter the same, but can’t assume that the circumstances of those plate appearances were the same, we lost half of our R-squared from when we were keeping the circumstances (roughly) the same. That means that the circumstances are as important as the batter’s talent level.
If the question that you want to know is “how good is Smith, going forward, and how well does his performance to date predict his future performance?” then you can’t use the old chart, at least if you plan on using my arbitrary cutoff of .70 for when the reliability coefficient hits “stability.”
We could do the sort of analysis where we look at sequential blocks of plate appearances and look to see how long it takes to get back to a reliability estimate (this time, we can use straight up Pearson correlation) of .70. For strikeouts, it happens a little bit after 150 PA, rather than the 60 PA on the original chart. So, to feel safe that a player’s performance is stable, you either have to wait longer into the season or you have to live with a lower threshold for “reliable.” Your pick.
But even then, we need to deal with a couple of other issues. There are two major assumptions that go along with “he’s been good in his first X plate appearances (and X is “enough!”), therefore, we can expect this from him the rest of the way!” One is that his body of work in April is still going to hold in September. If a batter has struck out 20 percent of the time in his first 60 PA, then it’s reasonable to assume (knowing nothing else) that in his 61st PA, he has a 20 percent chance of striking out, but would that hold in his 461st?
I again used strikeout rate as my bellwether, looking at all player seasons from 2012-2016 which had a minimum of 480 PA in them. The reason for 480 was that I created eight sequential blocks of 60 PA (that is, PA’s 1-60, 61-120, 121-180, etc.) I then looked to see how well strikeout rate in those first 60 PA correlated with strikeout rate in the next set of 60 PA. And then how well the first and third sets correlated. And then first and fourth, and on down the line.
The answer, in a bit of a surprise, is that all of the blocks correlated with the early-season block at a correlation around .5 (not exactly, of course, but it was within .48 to .52 each time.) I ran a second batch of the same analyses, this time with walks, and found the same basic thing. Early-season performance is correlated with performance through the rest of the season at roughly the same rate. So, we’ve at least cleared that hurdle.
But there’s another assumption that’s lurking in all of this that we need to call out. Past performance is a good predictor of future performance … until it isn’t. In fact, when these early season “no really, it’s stable!” articles are written, they’re usually written about players who have had conspicuous changes in their performance. No one bothers to write the article that “Mike Trout is having an amazing season so far and, well, that’s basically what he’s done for the last five years, so we’re not surprised, but hey, we’ve reached the point of stability, so we’re pretty sure that Mike Trout is still awesome.”
Why do people write articles about a player who has clearly made some sort of big change, and then assume that because we’ve hit some magical point in the season, he will never change again? Sure, if it’s working, he’ll probably want to keep his new approach, but people backslide into bad habits all the time. Maybe some other change will come along and undo all the good that the first change did.
There’s a fascination with stats that “stabilize quickly” because there’s not a lot to write about numbers-wise early in the season without sounding silly. Quick-stabilizing stats offer a chance to talk about the numbers without having to hear the dreaded “but it’s a small sample size” comment. Usually, the stats that stabilize most quickly are the ones that are more tightly in the control of the player himself. Swing rate is a good one, because the batter is the one who decides whether he will swing or not. (What happens when he does swing involves a complicated series of bounces, which may or may not go his way.)
Well, that sword cuts two ways. If the batter is more fully in charge of the decision that the statistic represents, then there is a danger that he will simply start making different decisions tomorrow, which will render false our idea that we now have enough information to know deep secrets of his soul. People are constantly growing and changing. We don’t live in a steady-state universe.
We also need to ask whether X PA, even once we adjust for all of these issues above, is a good cutoff for all hitters (or pitchers). For example, are older hitters more likely to show inconsistency over time? Are low-contact hitters more likely to show variance in their strikeout rate? We know that certain players tend to show marked variations in their abilities over the course of a season. Maybe there are certain types of players who are just more given to change over time. That particular area has gone largely un-studied (oddly enough, outside of aging curves, which have plenty of work done on them).
All Statistics Face Backwards
A statistic is a reflection of what a player has done in the past. It is an assumption that he will continue to do the same in the future. It might not even be a bad assumption, but it’s an assumption and as we’ve seen today, a methodologically shaky one. Sometimes players really do change for the better (or the worse), but if we’re going to do real research, we need a more solid approach to the question.
I get the love for “quick-stabilizing” stats, and maybe we’ll eventually find that some of them really do stabilize quickly, or we’ll find some mark which portends stability, but I think that for 10 years, this chart and this idea of quick stabilizing stats and the assumptions that the idea implies have gone critically unexamined. That’s a problem.
Please, in the next couple of weeks, resist the urge to pull out that 10-year-old table (or the updates I made to it five years ago) and write the Mad-Lib article at the beginning of this column. You keep using that word, “stabilize.” I don’t think that it means what you think it means.
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.
Subscribe now
This is a key point that undermines the concept of "stabilizing". These articles are selecting for outliers and while outliers may have genuinely changed something, they are also the most likely to be statistical flukes.