Let’s talk about batted balls.
I’m sure we’re all familiar with the category labels that we use to describe batted balls—ground balls, line drives, fly balls, and popups. Precise definitions vary, but David Cortesi gives a succinct set of criteria:
A ground ball is a batted ball that touches the ground short of the outfield grass. The line drive, the fly ball and the popup are all balls that are hit into the air and are caught before they hit the ground, or if they aren’t caught, fall to earth in the outfield.
As I have discussed previously, there is evidence for park biases in the way batted balls are assigned these category labels. The question becomes—what do we do about it? It turns out that sabermetricians have a handy tool to use in handling these issues—park factors. But how best to apply this tool to the problem at hand?
Prior Art
There have been efforts to park adjust batted ball rates in the past, of course, and it would be remiss of me not to acknowledge them.
-
Brian Cartwright wrote one of the first articles that I think really brought the question of line drive scoring to light. He discusses his line-drive park factors briefly—he later clarifies that he looked at line drives per air balls.
- David Gassko presented batted ball park factors as well—again, looking at rates of batted-ball types per all batted balls.
There may be others, but I can’t locate them—if you know of any, please drop me a line or leave a comment.
So, what do we know about batted balls, and how can we use that in the construction of park factors?
Hold Steady
The key thing to keep in mind is that when we say that a park causes more ground balls—either in fact or in the perception of the scorer—there has to be less of something else created. The total number of events is fixed. And when looking at rates of batted ball types, there are certain constraints as to what the can be less of—a higher grounder per batted-ball rate doesn’t directly affect the number of strikeouts or walks. (And if we’re talking about a scorer bias effect, rather than an atmospheric effect on the type of batted balls, there isn’t even an indirect effect.)
So if we see a park effect creating more ground balls, those ground balls have to be coming at the expense of other batted balls. The epiphany that I had is that most of this “theft” has to be coming from the most adjacent batted ball type.
Think about it—for any particular batted ball that is “borderline,” there are two categories it can be placed in. And the park effects have to act primarily upon these borderline batted balls, don’t they? It doesn’t matter if it’s scorer bias caused by parallax or an actual change of the trajectory of the batted ball due to atmospheric effects.
The Method
Here’s how I did the park adjustments—all numbers for illustrative purposes. First, I calculated the ground-ball rate (per batted ball) for a team and its opponents, both home and away, each year from 2003-09. (This includes the batting and pitching side for each team.) Each of those rates were then regressed to the mean, to try and cut out the effect of random variance.
Taking regressed home GB rate over regressed road GB rate gives you a one-season park factor. Taking a three-year average of those gives us three-year, regressed park factors.
So let’s say a team has 2,200 ground balls at home, cumulative, and a park factor of 1.05. We take:
2000 – 2000/1.05 = 95
That’s 95 ground balls more than the team “should” or “would” have hit in a neutral park. What do we do with those 95 ground balls? We add them to line drives, to get our first set of adjusted line drives.
Now, we start the process over again. We take our ground ball adjusted line drives to figure home LD rate (this time, looking only at air balls, or batted balls minus ground balls), and regress that. We also regress the observed road LD rate. From there, we derive a park factor for the ground ball adjusted liners.
Now, we adjust line drives a second time. Say we have 900 LDs, after adjusting for GB rate, and a LD park factor of 0.90. We take:
900 – 900/0.90 = -100
So we subtract 100 fly balls from our team totals.
The process repeats one more time, as we adjust fly balls per fly balls plus popups.
To give you a sense of what these park factors look like, the complete 2003 set:
YEAR_ID |
HOME_TEAM_ID |
GB_PF |
LD_PF |
FB_PF |
2003 |
ANA |
1.00 |
1.01 |
0.99 |
2003 |
ARI |
0.96 |
0.98 |
1.02 |
2003 |
ATL |
0.98 |
0.92 |
1.00 |
2003 |
BAL |
0.97 |
0.95 |
1.00 |
2003 |
BOS |
1.01 |
0.84 |
1.02 |
2003 |
CHA |
0.97 |
0.95 |
0.97 |
2003 |
CHN |
1.00 |
0.92 |
0.99 |
2003 |
CIN |
0.95 |
1.16 |
0.99 |
2003 |
CLE |
1.09 |
1.15 |
1.02 |
2003 |
COL |
0.95 |
1.01 |
1.02 |
2003 |
DET |
1.01 |
1.03 |
0.99 |
2003 |
FLO |
0.98 |
0.87 |
0.98 |
2003 |
HOU |
1.06 |
0.90 |
1.00 |
2003 |
KCA |
1.00 |
1.07 |
1.00 |
2003 |
LAN |
1.02 |
1.09 |
0.98 |
2003 |
MIL |
0.98 |
1.09 |
0.98 |
2003 |
MIN |
1.03 |
0.77 |
1.01 |
2003 |
MON |
1.06 |
1.14 |
1.02 |
2003 |
NYA |
1.00 |
0.96 |
1.00 |
2003 |
NYN |
1.03 |
0.94 |
0.99 |
2003 |
OAK |
0.99 |
1.07 |
0.98 |
2003 |
PHI |
0.99 |
1.09 |
1.00 |
2003 |
PIT |
0.99 |
0.90 |
1.01 |
2003 |
SDN |
1.06 |
1.04 |
1.02 |
2003 |
SEA |
0.97 |
0.91 |
0.97 |
2003 |
SFN |
1.03 |
1.19 |
1.01 |
2003 |
SLN |
0.98 |
1.08 |
1.00 |
2003 |
TBA |
0.95 |
0.90 |
0.98 |
2003 |
TEX |
0.99 |
1.14 |
1.00 |
2003 |
TOR |
1.02 |
1.03 |
1.03 |
The “line-drive” factors (which is really a misnomer, since they’re a far greater park of our adjustment of fly balls than the factor I’m calling “FB” here) has the greatest varability—in other words, the fly ball-line drive distinction is the one most subject to variability. That isn’t to say that the ground ball-line drive boundary is “stable,” or at least as stable as we may have thought. (To present all years here would take an egregious amount of space; the full set of park factors is available here.)
The Next Step
The trick is that we’ve park-adjusted the batted balls, but we haven’t park-adjusted the batted-ball outcomes. Say we want to take this and apply it to ground-ball BABIP—how would we go about doing that? I don’t know.
Let’s say, again, we know that there are 90 balls that “shift” from GB to LD when we do our park adjustment. The question is, how many of those are hits?
Well, since we know those are “really” line drives, if all else is equal, we know that line drives are more likely to be a hit than grounders, so they’re likely to have a higher hit rate than your typical GB (but perhaps lower than your typical LD).
But—is all else equal? In other words, is a scorer as likely to have trouble scoring a batted ball if it’s a hit or an out? Or does the very act of catching a ball affect its scoring?
Consider—for the ground-ball/line-drive boundary, what matters is where the ball lands—or in the case of a ball that is caught before it lands, where the ball would have landed if not acted upon by the fielder.
So you’re essentially presenting the scorer with two different tasks, depending on if the batted ball was a hit or out. So my supposition is that you will see a disproportionate amount of outs among these borderline batted balls. But that’s a supposition only—I can’t tell you how many there would be.
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.
Subscribe now
Nice. I can agree with that, Colin. No matter how much (or how little) scorer bias exists, the net park factor is real. It's next to impossible to isolate one from the other with just one data source.
Seeing Colorado near the top almost annually suggests there's a lot of natural park influence in there.
-Ben