Balls and strike calls are among the most fraught and noticeable impacts of umpires in the game. In theory, making them seems like a matter of simple, objective truth: did the ball cross through the TV broadcast’s floating rectangle, or not? In reality, it’s a lot more complicated, and a thousand tiny factors play into whether the pitcher gets the favorable call or not, from the count to home field advantage to the stance of the hitter immediately before seeing the pitch. These factors complicate the ongoing case for robotic replacement of this vital umpire function.
But regardless of how you feel about robot umps, one very important factor that we should all be able to agree shouldn’t be part of the ball/strike call is the race of the pitcher or the hitter. And yet, a new study suggests that umpires are granting thousands more strike calls to white pitchers, and thousands fewer to non-white, specifically Black and Hispanic, players. With the largest and most powerful sample of pitches to date, the study by Claremont McKenna student Hank Snowdon shows significant racial biases in how umpires call pitches.
Previous studies of biases in baseball have returned somewhat mixed results, especially in regards to umpires’ called pitch decisions. Some studies have found significant effects, others have found that those effects depend strongly on how exactly you specify the models, meaning that they may not be exactly certain. It’s worth noting here that in other domains of baseball—like whether organizations promote BIPOC players through the minors equally, or when umpires decide to eject players—there is evidence of significant racial biases.
Focusing on ball and strike calls offers an interesting way to measure bias for two reasons: first, there are hundreds of thousands of these calls per year; second, thanks to Statcast and PitchF/X, we know an astonishing amount about whether a given pitch should be called a ball or a strike to begin with. That makes quantifying the errors much easier.
For his paper, Snowdon grabs the entirety of data from the pitch tracking era, which amounts to millions of pitches with data from 2008-2020. Previous studies have had less precise and less numerous data to rely on. It’s a bit like going hunting for something microscopic using a magnifying glass—you might not see it even if it was there. Snowdon is pulling up a high-powered microscope to the problem, and he immediately finds evidence of biased calls.
He breaks those calls down in several ways, including whether they were balls-called-as-strikes or strikes-called-as-balls, and based on whether the pitcher or the batter shares a racial category with the ump. (Roughly 90 percent of umpires were white in the studied time period, a severe lack of diversity relative to the league’s player base.) But no matter how he slices it, he finds that umpires tend to make more advantageous calls when they share the same race as the person who would be advantaged.
These effects are small, but also large enough to be noticeable. Mistaken calls are about 0.3 percentage points more likely due to race effects, according to the study. Snowdon estimates that umpires called about 18,000 pitches differently over the 13-year period of the study because of racial bias, meaning a little more than a thousand changed calls per year. Any individual player might only receive a handful of these in a season, but for Black players in the league already struggling against discrimination in other regards, any additional barrier is a significant problem.
One of the most contentious and difficult aspects of any study of racial discrimination (in baseball and elsewhere) is that in reality, racial identities are much more complex than can be indicated in a single, one-word description (in statistical terms, a categorical variable: “white” vs. “Black” vs. “Latino” vs. “Asian”). The truth of racial identity is that people can be treated very differently depending on the circumstances and the biases of the people around them, and many people have multiple, overlapping and intersecting backgrounds. This issue is especially pronounced in baseball, where many players are Afro-Latino, some hailing (or with ancestors hailing) from the Caribbean.
Snowdon’s study can’t resolve this problem, but his study ultimately comes down to finding a powerful difference in the treatment of white vs. non-white players, regardless of whether they are Black, Latino, or both. Since the main variable of interest is whether the umpire is of the same race as the pitcher or hitter, it doesn’t matter as much for these statistics what demographic background they have, as long as it is not the same as the ump. Although Snowdon collected his information from a combination of pulling Wikipedia pages, country of origin information, manual inspection of photos, and other sources, even slightly inaccurate demographic data is often enough to detect bias. Indeed, for this discriminiation to be a false finding would require a very high level of racial misclassification that seems unlikely.
A downside of the approach that focuses on whether the ump and player have the same race is that it makes an implicit assumption that Hispanic umps will be as biased against white players as white umps are against Latinos. (The study itself uses the term “Hispanic,” which is a linguistic category, while Latino is an ethnic one.) For various reasons—the inherent, structural racism that forms the historical background of our society prime among them—that’s unlikely to be the case. And in fact, in further analysis, Snowdon finds that Hispanic umps display a bias against non-white players, not whites, as the initial approach assumes. This finding echoes a large body of research in policing showing that hiring more BIPOC officers does not always defuse racial disparities. There is pressure to play the part, and in umpiring, that may mean making slightly biased calls.
A missing piece in the study is that catchers play a big role in how a pitch is called, and their race or ethnicity may be relevant as well. Unfortunately, due to biases in how players of different races get channeled into certain positions, there is a severe dearth of Black and Asian catchers (but not Latino catchers). That makes studying racial biases at the position significantly harder, albeit not impossible.
This thesis isn’t the final word on the existence or impact on discrimination in the league. However, as more data builds up with ever-more finely-grained information, we are building a more and more powerful microscope to isolate and quantify racial bias. Previous studies have been, by comparison, less able to detect discrimination, with worse-quality data over a shorter timespan.
The study prompts a number of follow-up questions, but it also contributes to the growing case for robot umpires. With the advent of more advanced tracking technology, some of the technical issues that plagued earlier iterations of Statcast are diminishing, and meanwhile evidence continues to pile up—from this study, and elsewhere—that umpires react to variables that have nothing to do with the pitch in how they make calls. Even if these errors are rare, it is worth investing in a system that will not react to a player’s race, status, age, or prestige within the game to make for a game that’s more welcoming to talented players wherever they come from or whoever they are.
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.
Subscribe now
Do we have to use the word abolish? Couldn't we use a phrase that's more appealing to centrists, like "sensible reform" or "bias training" or "work together with umpires to create a more just environment in baseball"?
That's about 1 missed call every 20 games for a batter, about 7 or 8 per full season.
It's about 2 every 7 games for a starting pitcher, or about 9-10 per full season.
It's about 1 every 17.5 relief appearances for a reliever, or about 3-4 per season for a reliever who pitches about 51-72 games in a season.
Estimate the impact of one pitch call going the wrong way to be a gain or loss of about .300 OPS per pitch changed (It's about .100 when the batter is ahead on the count, and about .400 when even or behind). Net effect of gaining/losing .300 OPS per missed call is about .004 per player-season.
Is it wrong? Yes. Should something be done to change it? Why not, if possible? It's awfully hard to think of what can be done other than robot umps. Anything else might disproportionate to the size of the problem. Is it significant? Depends on what you think significant is. Really, compared to other disparate impacts throughout society, this is small potatoes. I thought it might be considerably worse than it actually is. Ignore it? Fortunately, we don't have to.
Robot umps would take care of this problem, I would think. And as an added benefit, it would eliminate a pretty large number of ejections, at least half of which have to be over ball-strike calls, and probably much more than half. And that would significantly cut whatever disparity exists in that area, just by brute force reduction in total ejections.
Of course, that might eliminate some catchers who are around mainly for their strike-zone warping receiving abilities. If those catchers are disproportionately Latino, then that might initiate a different problem.
Unless I'm missing something, his conclusions seems like an enormous stretch to me.
With a 0.3% difference, this looks more like a textbook example of a model having such strong statistical power that even an extremely tiny difference is "statistically significant", which is why measures of effect size are so important. Just because a statistical test gives a p-value < .05, that doesn't mean the difference is statistically meaningful.