I think that we've really misunderstood pitcher BABIP over the years.
One of the main tenets of what's become known as DIPS Theory is that there are three "true" outcomes of a plate appearance from a pitcher's perspective, and that what happens when the ball is in play is mostly luck. It's one of those assumptions that's been around so long that it's baked into a lot of what sabermetricians hold dear. We have component ERAs that assume that a pitcher should have a league-average BABIP. We confidently state that a pitcher will regress to the league mean as if it were a matter of course. We predict doom for pitchers who have a .260 BABIP and salvation delayed for pitchers who have an "unlucky" .350 mark. "Danger Will Robinson! (Insert name of pitcher) has been running on luck and will collapse any moment now!" makes for an easy article. I know, I've written plenty of them. In fact, as recently as last week, I predicted that the Orioles would relapse into mediocrity because four of their main relievers from last year had BABIPs in the .260 range and thus, their success was a vast mirage. Because once the ball leaves the bat, it's all random chance, right?
At this point, there's a pretty good consensus that the real answer to the question is "Yeah, but… hang on a minute, there's more to it." There are a bunch of logical factors that can influence BABIP.
- For one, line drives that don't leave the park tend to fall for hits about 70 percent of the time (70.9 percent in 2012), while ground balls find a hole around a quarter of the time (23.8 percent in 2012), and fly balls drop in a place that is not over the wall 13 percent of the time (13.1 percent in 2012). The rate at which pitchers give these various types of hit balls up is fairly stable across time. BABIP could simply be a function of what sorts of batted balls a pitcher likes to yield. More than that, my former BP (and Statistically Speaking) colleague Matt Swartz also found that pitchers who yielded a lot of ground balls tended to have lower BABIPs on ground balls than would be otherwise expected. Not all grounders are created equal!
- My former BP (and Statistically Speaking) colleague Mike Fast found that BABIP can depend a great deal on whether batters tend to hit pulled or opposite field air balls off a pitcher. Opposite field hits are more likely to be line drives. Liners are more likely to fall for hits. Pulled balls are more likely to be fly balls and to fly over the fence (which takes them out of the BABIP discussion).
- Mike Fast also found that pitchers have some amount of control over how hard a ball is hit, and that harder-hit balls tend to go for hits more often.
- Also, another researcher going by the improbable pseudonym "Pizza Cutter" found that ball-strike count made a difference. Balls put in play in pitcher's counts were less likely to go for hits than those in hitter's counts. Getting to an advantageous count was an outcome that appeared to have some stability for pitchers.
- The kitchen utensil guy also found that while BABIP might take around 3,800 balls in play to show enough statistical reliability to be considered stable, it does eventually stabilize.
Still, most of the common ERA estimators (and a fair number of writers) continue to assume that BABIP is something that is out of the pitcher's control, and that, over time, it will return to league average (or at least a small window around that average).
Maybe we've been wrong all along. What if BABIP isn't a random event? What if we've just massively misunderstood the concept?
Warning! Gory Mathematical Details Ahead!
(This one is very dense and very math-heavy, but I promise it's worth fighting through.)
Proof no. 1: If BABIP is random, then why can I find a nice easy predictor of what's coming on the next ball in play?
Let's start with a fairly obvious question. In addition to groundball/flyball/pull/opposite field tendencies, wouldn't BABIP vary by how well the pitcher in question was throwing on that day? It's well known that pitchers vary in how much stuff they have from start to start. Any given pitcher might also have some minor injury that he pitches through (don't we all?) that still affects him over two or three starts. So, I decided to look at whether recent BABIP performance might predict the outcome of a single plate appearance.
To do this, I pulled a new trick out of the bag. For the years 1993-2012, I isolated all balls in play and coded whether they fell for a hit or not. I found the league average for that year. The way that BABIP is currently conceptualized, this should be the only number that we need. I converted the league BABIP into natural log of the odds ratio. In addition to the league number, I calculated what had happened to the 10 previous balls in play for this pitcher within this season. I did this as a moving average, so each ball in play had as a predictor the average fate of the 10 balls immediately before it. Again, I converted the BABIP to a logged odds ratio.
At first, I ran a logistic regression using only the previous 10 BIP as a predictor, controlling for the league BABIP for that year. And I got…nothing. There was no significant association between recent performance and what happened on the next ball in play. It looked like each ball in play, once it left the bat, was equally as likely to fall in as any other. Or at least like recent performance wasn't going to help me.
But then I changed to the previous 20 BIP as my sampling frame, and a funny thing happened. Significance. Pulling in more data from the pitcher's recent past made the predictor better. I went to 30 BIP and got significance again, and somewhat stronger significance at that. I went to 40, and then 50, and it kept getting better. There's a way to tell whether a predictor in a binary logistic regression is better or worse than another. It's a model fit statistic called -2 log likelihood. All you need is a consistent set of cases. Run a series of predictors on the same set of cases and the one that gives you the greatest amount of change in the -2 log likelihood is your best bet. You can also compare -2log contributions of different variables in the model. I isolated cases where I could calculate a running mean from 10 BIP to 250 BIP (in 10-BIP increments) within a season (thus, the pitcher needed to have at least 251 BIP for that season and only plate appearances from the 251st ball in play onward were used). To allow for streaks where a pitcher had an 0-for-10 groove going (you can't take a logarithm of zero), I excluded those cases from all analyses.
Looking at the comparisons of how the moving averages fared against the league-average BABIP was a revelation. At 10 BIP, the league BABIP had a 4-to-1 edge in predictive power, consistent with what we've been taught about BABIP all these years. But as the sampling frame crept up, the pitcher's recent results on balls in play started to become a relatively stronger predictor. By the 100-BIP sampling frame, a pitcher's recent performance was the stronger of the two predictors. Around 150 BIP, it was about a 60/40 split in favor of the pitcher's recent results, and it stayed around that ratio up to 250 BIP.
It's hard to argue that a pitcher's recent performance is unrelated to some sort of underlying skill that he has, and the sampling frame needed to show that is much shorter than we would have imagined. (We'll talk about that "skill" in more detail in a minute.) If BABIP is simply a matter of luck and pitchers are tethered to the league average, why is this skill-related predictor doing a better job than league average of predicting the results of the next ball in play?
Proof no. 2: It's not defense…
One obvious critique of the above is that I may simply be picking up on the effects of the defense behind a pitcher. A groundball pitcher with four vacuum-cleaner infielders behind him will look amazing when it comes to BABIP. We need a way to separate what the pitcher is doing from how much his defense picks him up. Another less-obvious critique is that a pitcher's BABIP might depend more on the quality of the batter whom he faces.
Fortunately, from 1993-1999, Retrosheet data contain an indicator of what sort of ball the batter hit (ground ball? line drive? Fly ball?) and where on the field the ball was hit based on a grid system. Now, these data have to be treated with some caution. Stringers classifying batted balls have biases. A line drive vs. a fly ball is something of a judgment call. And so is location. There is likely a tendency to place a ball that gets through the infield as being hit to the '56' zone (between the 3B and SS), but the same ball that another shortstop manages to get to as being in the '6' zone (right at the SS). Some of these data points are also 20 years old, and we have no data on how hard the ball was hit. It's not perfect, but it will do for now.
For each ground ball (excluding bunts), I looked at what zone the ball was recorded as entering. For each zone, I calculated the league-wide expected BABIP for a ball hit to that area. By doing this, I was able to get both the pitcher's and batter's overall expected BABIP on grounders, based solely on the location of where the balls were hit. If the pitcher was steering grounders to areas where his fielders should have gotten them, and the fielders were simply subpar or he was facing batters who were good at "hitting it where they ain't," this method should account for that.
I also calculated the BABIP for the pitcher's team on ground balls over the course of the season in question, excluding those that happened with the current pitcher on the mound. This will give us a rough estimate of the team's defensive quality overall. Finally, I calculated the league BABIP on grounders. I converted all of the above to logged odds ratios again. I created a logistic regression for all ground balls in the data set coded for whether they went for hits or not. I entered each of the four indicators above as predictors, including only plate appearances where both the batter and pitcher had 100 grounders or more during the year.
After that was done, I went back and did the same for line drives, and then for flyballs/pop ups. (For line drives, I dropped the inclusion criteria to 50 or more.)
I found the -2 log likelihood contributions for each of the predictors, similar to how I apportioned blame/credit in this article. Below is a table showing how well each of the predictors performed relative to each other for each type of batted ball.
Batted Ball Type |
Batter |
Pitcher |
League Mean |
|
Ground Ball |
47% |
29% |
13% |
11% |
Fly Ball/Pop Up |
39% |
26% |
21% |
13% |
Line Drive |
46% |
28% |
13% |
13% |
We see that the batter's tendency to hit the ball where they generally ain't holds the greatest amount of sway over whether the ball will go for a hit. This squares with what we know about batter BABIP being a much more stable stat than pitcher BABIP. But the pitcher's tendencies to direct ground balls and fly balls to where the defense can generally get to them checks in as more important than the defense's general ability to turn batted balls into outs (the spread is closer for fly balls). And the league mean is present, but not a very strong predictor.
Far from being tethered to the league average, pitcher BABIP has a perfectly rational set of factors that influence it, and a good chunk of it belongs to the pitcher. Sure, the pitcher doesn't have full control over about 70 percent of the equation, but his contribution is generally twice as strong as that of the league average being used as a predictor.
Proof no. 3: An outcome and a skill are not the same thing.
Let's start this one with the language that surrounds the idea of DIPS and BABIP (Note: Always study the language that someone uses. Always. Language always betrays hidden assumptions.) In Voros McCracken's original BABIP study, there were four types of outcomes of a plate appearance: a strikeout, a walk, a home run, or a ball in play. Everything was kept in its own separate box, as if these were completely separate things, but within the box the assumption was that they were completely unified skill sets.
The three true/one false outcomes model of a plate appearance assumes that we should classify events based on whether they are discrete outcomes on the scoreboard, rather than whether they reflect some underlying skill of the pitcher. Because we equated outcomes with skills, we saw that while strikeouts, walks, and home runs (somewhat less so) were repeatable from year to year, BABIP wasn't. The consensus on BABIP was "no skill involved." Maybe it should have been "poorly designed construct." Maybe the problem with BABIP isn't that it's all luck, but that getting outs on balls in play encompasses different skills in different situations, some skills which are more influenced by factors outside the pitcher's control—whether luck or defense or the batter— than others. Maybe getting outs on grounders is a different skill than getting outs on fly balls that don't leave the park.
Statistically, it's hard to create a meaningful single number that represents the sum of a wide range of only mildly related (both in terms of covariance and conceptually) components. Those who are familiar with the statistical technique of factor analysis will be familiar with this idea. For those who aren't, a quick example: Suppose that I wanted to create an index of how sad and depressed someone is. I might ask questions like how often the person feels hopeless about the future or how often the person has uncontrollable crying spells or how often the person feels that even things that used to be fun just aren't anymore. As the answer to one of these questions goes up, the answer to the others will probably also go up as well. (For the initiated, they will have high factor loadings.)
Now, let's say that I tried to add in a question about how often the person had intrusive and obsessive thoughts. Obsessive thoughts are certainly a problem and may happen along with depression, but one can have depression and no obsessive thoughts or have obsessive thoughts but no depression. If I tried to shove this extra question into my measure, it will make the measure less stable.
Maybe we've been trying to put too many unrelated skills under the umbrella of BABIP. And for some reason, we've been surprised when it doesn't work. I'd argue that instead of a component ERA, maybe the first step is a component BABIP (like an xBABIP, which BP's Derek Carty has shown to be a good indicator of future performance)
Enough of this theoretical musing. The gory math awaits!
For the year pairs 2003-2004 to 2011-2012, I found all pitchers who had at least 250 balls in play in each year. Among these pairs, the year-to-year BABIP correlation was .193, which is the sort of lowly correlation that got this whole DIPS thing started. (Note: yes, I know I'm violating assumptions about the independence of data points. For just 2011-2012, it's .205. Happy?)
I ran a regression predicting the following year's BABIP using outcomes from the previous year that everyone assumes are "true": strikeouts per PA (year-to-year correlation of .77), walks per PA (.66), HR per PA (.30), GB% (.81), and FB% (.79), as well as BABIP.
The following equation produces a prediction that correlates with the next year's BABIP at a multiple-R of .305. That's not huge, but it's a) better than .193 and b) the same number as the year-to-year correlation for home run rate.
The equation: .291 + .143 * BABIP * GB_rate – .057 * K_per_PA – .630 * BB_per_PA + 1.765 * BABIP * BB_per_PA.
When we try a very simple component-level prediction for next year's BABIP, our predictive power goes up. Suddenly, this doesn't all look so cut-and-dried. The point is that when you take a more component-based view of BABIP, the skills—plural—and the interactions between those skills tend to come out. Maybe there is no difference between major-league pitchers in their ability to prevent hits on balls in play. But there certainly are differences in the abilities that go into preventing hits.
Well then… why does BABIP always seem to regress to .300?
My hope in writing this article is that we can finally put to bed the idea that every time a batter hits the ball, but not over the fence, the pitcher is some luckless (or lucky) dolt in the matter. There most certainly is skill in preventing hits on balls in play. We've just been conceptualizing the problem (and thus, measuring it) in the wrong way.
But the cynic will point out that despite all this, while BABIP may not be a unitary skill, it is an outcome that makes a large amount of difference in what happens on the scoreboard. And it does not correlate well from year to year. And yes, most guys who have .260 BABIP one season follow it up with a .300 season the next year and show a resulting decrease in their headline stats.
I still hold to the idea that BABIP is (multi-)skill-based, and have no trouble reconciling these two facts in my head. I offer the following three thoughts:
1) There will always be random variation in any measurement from year to year, and the smaller the sample size, the more likely that random variation creeps in. There probably are seasons where a pitcher had a good BABIP that really was just good luck, and we'll expect him to revert back to form in the following year. But if we took a more component-based look at BABIP, we'd probably be able to tell which inputs are more or less given to randomness. If a pitcher got lucky on an indicator that we know really is luck-based, we might predict regression. But if it was on a skill that we know to be stable, we might predict that the magic will continue. Being able to discern who got lucky vs. who might sustain that performance would be a massively interesting talent, now wouldn't it? I think a component-based view of BABIP gets us closer to that.
2) If BABIP really does consist of several skills acting in concert, a "lucky" season is likely to be the result of a pitcher who has put it together on several different skills over the course of a year. The problem might be that a loss of one of those skills might be enough to tilt him back toward the mean, and while maintaining good form on one skill is hard enough, what if it's four or five different skills? That's four or five things on which the pitcher might mess up, and the result is that he would become simply ordinary again.
3) I think there's one other measurement error that we tend to make in sabermetrics. We assume that a player is his yearly average throughout the course of the season. This makes about as much sense as noting that the average high temperature in the city of Chicago is around 50 degrees, and packing for crisp, autumn weather—in January. Sure, the overall average is 50, but as seasons change, the climate changes too, and you have to adjust your expectations. We wouldn't make that mistake in packing for a trip, yet we do it all the time in sabermetrics.
In proof no. 1, we saw that a moving-average approach to predicting BABIP was quite effective in predicting what happened next, and at that, we needed to look back at only 100 BIP before it overtook the league average as a good predictor. This leaves open the possibility that whatever the skill or skills are that are involved in BABIP, it or they may fluctuate over time. These fluctuations may not represent random variation around a mean, as is often assumed. They might be real changes in true-talent level.
There's probably a natural floor (and ceiling) to how good a pitcher can be in preventing hits on balls in play. Major-league hitters will eventually square on up on even the toughest pitcher. But maybe the untapped concept that differentiates the regresser from the maintainer is the ability to hold on to a good true-talent level over a long period of time. Maybe that's a talent unto itself. Maybe studying those variations from month to month and seeing who is steady across time vs. who fluctuates wildly from week to week will shed some light on the subject.
More than anything, I hope that what we've learned is that saying "He got lucky!" isn't enough anymore. I worry that for too long, we didn't question the DIPS hypothesis strongly enough. I believe that the preponderance of evidence points to there being real differences between pitchers in their abilities to prevent hits on balls in play and that the assumption that the league-average BABIP is the best baseline going forward is false. Balls in play are not completely within the pitcher's control, but the pitcher's contribution is not trivial. We should build our assessments of pitcher quality with that knowledge in mind going forward.
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.
Subscribe now
The only part I didn't really follow was Proof #2. I think you created an xBABIP based on where the ball was hit and compared that to actual BABIP, or something like that. But did you try to account for a skill in which a pitcher induces *where* the ball was hit. I guess I just don't understand how that analysis was pulled together.
Also, what's a logged odds ratio? I understand logs and I understand odds ratios, but what are you doing when you put them together? Perhaps there's an article somewhere where you explain it?
BTW, fantastic job. That said, whenever I read one of these studies that debunk DIPS theory, I'm still always struck that pitcher BABIP is dang hard to predict. Which, to me, is all that DIPS theory is.
(Natural) log of the odds ratio is just a statistical trick that I used because I used a lot of logit regression. It has to do with raw percentages not being normally distributed, and using LOR corrects for that. Also, when logit does its actual modeling, it spits out a function that gives you the LOR of the probability that you want to model.
In #2, the idea was to see how well these predictors performed relative to each other from the point of view of variance explained (as much as logit lets you do that.) Was it the pitcher's general talent in steering the ball toward a fielder? Was it the sparkling defense? Was it the batter steering the ball himself?
Enjoying my premium subscription because of content like this. Keep it up.
I'd love to see you run a simulation with existing season-data to see how well your component-BABIP-predictor does against the "dumb" TTO/regress to league avg predictor. Would also be really cool to regress to career BABIP in addition.
Cool research, but I would love to see a simulated test of the prediction.
Then I can use it to my advantage in fantasy ;). After all, that's what really matters.
You say:
"In addition to groundball/flyball/pull/opposite field tendencies, wouldn't BABIP vary by how well the pitcher in question was throwing on that day?"
Then you say:
"At first, I ran a logistic regression using only the previous 10 BIP as a predictor, controlling for the league BABIP for that year. And I got...nothing. There was no significant association between recent performance and what happened on the next ball in play. It looked like each ball in play, once it left the bat, was equally as likely to fall in as any other. Or at least like recent performance wasn't going to help me."
10 BIP is probably about 2-3 innings of work. I'm not sure the average number of BIP in a game but I'd think it'd be around 30-40. So, basically, 10 BIP is not predictive of BABIP, though 20 has some significance, though not as strong as 30, not as strong as 40, etc...
To return to the original quote, is this saying BABIP does not do a good job at predicting performance on that one specific day?
Is it also saying that BABIP does a better job at predicting performance over multiple starts (and might be pretty much useless for relievers over the course of two months).
Then, thirdly, if BABIP is far from the league norm, which matters more.. the length of time the BABIP is measured at or the variance from the league norm (and the amount of "snap back"/regression to the mean)?
On the third point, if BABIP is far from the league norm over the last 100 BIP (say it's .240), then from a variance explained point of view, the recent personal history of the pitcher is more important than the league average. However, understand that the recent personal history of the pitcher is not a static number.
Maybe it might be interesting to take pitchers who have extremely low or extremely high BABIP over 300 innings last year and see how much that predicts their performance for this year?
I'd love to see a follow-up about which pitchers seem to possess which skill and how that component positively/negatively affects their BABIP.
Matt Cain and Zack Greinke come to mind.
On the third point, if BABIP is far from the league norm over the last 100 BIP (say it's .240), then from a variance explained point of view, the recent personal history of the pitcher is more important than the league average.
In my uneducated (from a statistics perspective) brain, if the recent history is MORE IMPORTANT than the league average, that implies that you would regress a pitcher's last 100 BIP BABIP less than 50% toward the league mean in order to predict the BABIP of his next few BIP. Clearly, that is not the case. Give me a pitcher who is .240 through his last 100 BIP and I will show you a pitcher who is .2997 (or whatever) through his next 10 or 20 or 100 BIP, where league average is .300 (for pitchers with similar profiles, like GB rate), after adjusting for the opposing team, his defense, and the park. So I don't understand what you mean by "more important" or even what the those relative percentages in the chart mean.
So at that moment, he is better described as a .240 BABIP rather than a .300 BABIP.
I'm trying to imagine the implications of this. So, you're saying it is way too much of a leap to say a pitcher's tendency to have a more consistent BABiP or one more consistently above or below average happens in swings of 150-250 balls put in play does not imply that his overall effectiveness would be more likely to swing over the coarse of the same interval? Would that be worth looking into next?
Secondly, I'm trying to speculate how this happens. You mentioned the change of seasons as a metaphor, but taking it more directly I thought the notion of cold weather pitchers vs. warm weather pitchers is overstated if existent at all. . . the same for early season pitchers vs. late season pitchers (or is that another issue to be studied?) Are pitchers streaky - do they have grooves then get sloppy after so many appearances, then take the same number of outings to get back on track? Again, I'm getting ahead of the scope of your study, but I'm looking for ways to apply it.
One technical question, do you know if a pitcher's BABiP overall improves at the same rate as it degrades?
As to your second point, I'd love to know how this works too! If the fundamental message of what I'm trying to say here holds, then it opens up a lot of different avenues of investigation!
On your third question, I don't know that one yet.
Using the same basic framework as I did in the original, I took the league average and the past 100 BIP and let them fight it out in the same logistic regression.
I only took cases where the last 100 BIP yielded a prediction of .280 or lower, then .275 or lower, then .270 or lower, etc.) There does come a point where league BABIP is a better predictor, and it seems to happen somewhere between .270 and .265. However, it should be noted that the past 100 BIP still holds some significant sway, even as you descend even further.
Perhaps .240 is too lucky to believe, but .270 is not.
I'm surprised, but .265 does seem more likely than .240. This is a very interesting finding.
2) Is there any better correlation within an appearance?
3)If 50 is where past babip becomes more predictive than league average, what's the next crossover point where league average is again better?
In addition, he was not "stable" inside any range. His performance followed a random distribution around that .232 mean.
That's what I find important about BABIP: it can be random from one season to the next. It isn't necessarily totally random in a large enough sample size but baseball players tend to be judged with arbitrary end-points so it's important to consider what role BABIP is playing in a player's success over the course of 10, 50, or 162 games.
Players are not the same day-to-day, much less year-to-year. Here's a very basic example from my own experience.
I tinkered with my mechanics way too often, daily really. I would find certain "feels" that would produce immediate, good results and for a week I would mash at the plate or throw a filthy hammer curve. But concentrating on that feel would cause me to over correct and produce new, less optimal mechanics.
Obviously, I'm not and wasn't a professional. 99.9% of major leaguers are probably much better at maintaining consistent mechanics. But those little mechanical variations throughout a season mean that a player's "true talent level" is constantly in flux around some moving average. Even the most mechanically consistent player will still fight this battle due to minor injuries.
That said, simple constructs like DIPS are helpful for forming a null hypothesis when analyzing a player or set or players. Given the current state of the arts, it's impossible to evaluate players perfectly. But that shouldn't prevent us from trying, there are a lot of people asking us to try!
Thanks for this article.
I suspect that this is likely confirmation bias, or a placebo effect. The question just got me thinking. Thank you for your work!
I also like the "count" analysis. We've always heard about batters "waiting for their pitch" (I suppose so they can "square it up"and hit it harder) v. "protecting the plate" when behind in the count. Your analysis is intuitively congruent.
I wonder, do pitchers with a higher "nasty factor" (i.e., more movement) tend to have lower BABIPs? Does a pitcher's fastball velocity impact BABIP? How about "break" on other pitches? So much potential analysis, so little time.
Thanks again.