“I never questioned the integrity of an umpire. Their eyesight, yes.”
— Leo Durocher
“Whenever you have a tight situation and there’s a close pitch, the umpire gets a squawk no matter how he calls it.”
–Red Barber
—
Umpires have a difficult job. Not only is their work publicly visible (and viewable over and over again through the miracle of instant replay), but when the job is done well, no accolades come their way. Players at least can enjoy the adulation of the fans when they excel, despite knowing too well that cheers can turn to boos in a heartbeat. With the availability of PITCHf/x data, as reported through MLB.com’s Gameday application, we now have another tool with which to judge the men in blue. So, on the heels of some tabulation here at BP, today we’ll look at what we can learn about the accuracy and psychology of home plate umpires.
Accuracy
Before we delve into the data, there are a couple of caveats. In calculating whether a particular pitch crossed the strike zone or not, we’re here using a model of the strike zone that provides a one-inch buffer for both called balls and called strikes, as shown in the following diagram.
![image 1](news/images/6502_01.gif)
The width of the regulation strike zone is determined by the width of the plate. It’s a constant, while the height of the zone is dependent on the batter, and is input by the PITCHf/x operator for each plate appearance using the rule book definition. We’re using this one-inch buffer because the accuracy of the PITCHf/x system is reported to track the pitch to within an inch of its actual location, so we want to count as strikes any pitch where any portion of the ball (assuming a diameter of 2.9 inches) touches the zone in green, and count as a ball any pitch that does not touch the zone in blue. This gives the full benefit of the doubt to the umpire. In addition, keep in mind that the system determines the location of the pitch in two dimensions at the front of home plate.
Since the strike zone is three-dimensional, it is possible that there are pitches that catch part of the zone that are indeed strikes, but that our calculations may call balls. Finally, there has been some recent discussion that the accuracy of the reported location data is not correct for some subset of the pitches. I’m working on this, as are several others, to determine how large the problem is.
To calculate how well umpires are doing, we’ll use the same basic methodology that we used when examining whether or not certain pitchers and hitters get the benefit of the doubt. By counting the number of strikes called by the umpire and how many of those we think were regulation strikes, we can create a called strike agreement percentage (CSAgree%). We can then do the same for called balls (CBAgree%), and calculate an overall agreement percentage (Agree%).
Finally, we create a derived metric that measures to what extent the umpire has favored pitchers (PAdv%) by subtracting the number of actual strikes that were not called as such from extra strikes awarded the pitcher (PAdv), and dividing it by the total number of pitches. A positive value for these two new metrics indicates an advantage for the pitcher, and tracks the magnitude of that advantage.
Here are all umpires who have called 500 or more balls and strikes behind the plate ordered by Agree%.
Umpire Pitches CS CSAgree% CB CBAgree% Agree% PAdv PAdv% Eric Cooper 519 162 .877 357 .964 .936 7 .013 Tim McClelland 922 275 .844 647 .975 .936 27 .029 Marvin Hudson 856 270 .885 586 .956 .933 5 .006 Gerry Davis 1068 331 .882 737 .953 .931 4 .004 Jim Reynolds 1024 306 .827 718 .968 .926 30 .029 Ed Montague 870 266 .850 604 .952 .921 11 .013 Tony Randazzo 590 172 .837 418 .955 .920 9 .015 Angel Hernandez 1652 510 .833 1142 .957 .919 36 .022 Tim Tschida 930 287 .826 643 .960 .918 24 .026 Randy Marsh 1017 316 .861 701 .943 .917 4 .004 Joe West 797 262 .817 535 .964 .916 29 .036 Wally Bell 545 191 .812 354 .972 .916 26 .048 James Hoye 767 223 .821 544 .954 .915 15 .020 Paul Emmel 884 303 .838 581 .955 .915 23 .026 Hunter Wendelstedt 830 264 .852 566 .943 .914 7 .008 Derryl Cousins 932 302 .864 630 .938 .914 2 .002 Chris Guccione 892 288 .823 604 .954 .911 23 .026 Jim Joyce 1172 358 .799 814 .961 .911 40 .034 Jeff Nelson 606 234 .799 372 .981 .911 40 .066 Sam Holbrook 1175 355 .848 820 .938 .911 3 .003 Charlie Reliford 537 187 .834 350 .951 .911 14 .026 Alfonso Marquez 983 297 .818 686 .950 .910 20 .020 Dale Scott 1020 353 .836 667 .949 .910 24 .024 C.B. Bucknor 567 186 .855 381 .934 .908 2 .004 Bob Davidson 1111 360 .786 751 .967 .908 52 .047 Jeff Kellogg 746 234 .855 512 .932 .908 -1 -.001 Mark Carlson 1377 465 .800 912 .959 .906 56 .041 Rick Reed 950 318 .814 632 .951 .905 28 .029 Brian O'Nora 779 251 .785 528 .960 .904 33 .042 Greg Gibson 563 157 .834 406 .929 .902 -3 -.005 Tim Timmons 1156 367 .807 789 .947 .902 29 .025 Bill Miller 982 313 .815 669 .943 .902 20 .020 Brian Knight 967 319 .777 648 .963 .902 47 .049 Dan Iassogna 587 197 .807 390 .946 .899 17 .029 Kerwin Danley 1049 318 .821 731 .932 .898 7 .007 Gary Cederstrom 852 272 .794 580 .947 .898 25 .029 Marty Foster 754 261 .782 493 .959 .898 37 .049 Mike Winters 940 297 .825 643 .932 .898 8 .009 Larry Young 1557 474 .793 1083 .943 .897 36 .023 Paul Schrieber 603 169 .852 434 .915 .897 -12 -.020 Brian Gorman 1090 385 .795 705 .952 .896 45 .041 Mike Everitt 523 163 .761 360 .956 .895 23 .044 Ted Barrett 1191 410 .785 781 .950 .893 49 .041 Bill Welke 522 182 .835 340 .924 .893 4 .008 Rob Drake 673 217 .797 456 .936 .892 15 .022 Mark Wegner 532 182 .813 350 .926 .887 8 .015 Ed Rapuano 999 326 .801 673 .929 .887 17 .017 Tom Hallion 789 266 .767 523 .945 .885 33 .042 Ed Hickox 573 177 .763 396 .934 .881 16 .028 Ron Kulpa 1018 330 .782 688 .929 .881 23 .023 Doug Eddings 603 227 .758 376 .955 .881 38 .063 Larry Vanover 823 258 .771 565 .929 .880 19 .023 Chuck Meriwether 1056 328 .805 728 .911 .878 -1 -.001 Dana DeMuth 768 251 .821 517 .905 .878 -4 -.005 Paul Nauert 547 172 .738 375 .936 .874 21 .038 Brian Runge 811 263 .787 548 .907 .868 5 .006
Overall, the percentage of called strikes for this group was 81.5 percent, of called balls 94.6 percent, and rings in at 90.4 percent overall. From this list we see that Eric Cooper and Tim McClelland have enjoyed the most agreement with the regulation strike zone at 93.6 percent, while Brian Runge finds himself at the bottom at 86.8 percent by “missing” 107 of the 811 pitches that he’s called. In terms of called strikes Paul Nauert had the lowest agreement at 73.8 percent, and Marvin Hudson the highest at 88.5 percent. Dana DeMuth had the lowest called ball agreement at 90.5 percent, and Jeff Nelson the highest with 98.1 percent. Finally, Jeff Nelson can be said to have benefited the man on the mound the most by swinging 6.6 percent of the pitches he’s called their way-in other words, the pitcher has gotten the advantage on 40 more pitches than hitters out of the 606 pitches he’s called. On the other hand, Paul Schrieber has given the advantage to the hitter on 12 of 603 pitches, or 2 percent.
When looking at lists like these, it’s important to ask the question as to whether what we’re seeing here reflects a real difference in how these umpires are approaching their jobs, or whether the differences can be fully explained by random variation. After all, if the distribution here is random, then we can’t ascribe real differences between umpires.
One way of doing this would be to see if our list comports with anecdotal evidence and the judgments of broadcasters, players, coaches, and managers. Is Jeff Nelson really a pitcher’s umpire, and is Paul Schrieber actually more favorable to hitters? Absent a survey of those on the field, one thing we can do is to analyze the variation in the data to determine if there is more variation than we would expect, using the idea that the observed variance equals the variance due to randomness plus the variance due to underlying skill.
When we do this for Agree%, for example, we find that indeed there are real differences in how pitches are called which accounts for 60 percent of the variance in the data. Given the underlying skill difference between umpires, we would then expect 68 percent of umpires to actually fall between .916 and .892, and 95 percent of them to fall between between .928 and .881. Still, at less than 5 percent that range is small, even at the extremes, and means that the best umpires may miss (in terms of the regular strike zone) around six fewer pitches than the average umpire, and the worst about six more. When we follow the same procedure for PAdv% we find that 90 percent of the variance is reflective of skill differences. Once again, however, the range is small: 95 percent of the umpires should actually fall between .055 and -.009, accounting for a magnitude difference of 15 or so a game.
Psychology
Several years ago I took my older daughter on a fossil hunting trip in the Cretaceous Badlands of western Kansas, in what was then the Western Interior Seaway. After just a few minutes of struggling to see the bits of fossilized bone, shell, and teeth that our guide could see so well, the concept of “search image” had become crystal clear for both my daughter and myself. The basic idea is that we see what we’re trained to see. Our minds interpret the data coming from our eyes using predefined patterns that have been influenced and built up from experience. So to our guide, what was clearly a shark tooth of the species Cretoxyrhina literally right in front of our noses, was for us simply another piece of jagged rock. I’m happy to report that we eventually caught on and made a contribution or two as the day wore on.
I was reflecting on this experience as I examined the called ball and strike accuracy of umpires when broken down by count. To understand why this happened examine the following table, keeping in mind that the mean CSAgree% is 81.4 percent and the mean CBAgree% is 94.6 percent:
Count Pitches CS CSAgree% CB CBAgree% Agree% 1-0 6653 2690 .803 3963 .945 .887 2-0 2349 1078 .828 1271 .943 .891 3-0 1155 710 .883 445 .948 .908 1-1 5066 1199 .772 3867 .950 .908 1-2 3871 400 .670 3471 .962 .932 2-1 2359 616 .756 1743 .952 .901 2-2 2637 336 .732 2301 .969 .939 3-1 1048 388 .799 660 .952 .895 3-2 1168 184 .788 984 .966 .938 0-0 20415 8960 .837 11455 .928 .888 0-1 6971 1450 .790 5521 .944 .912 0-2 3078 233 .695 2845 .968 .947
Now, take a look at the bolded numbers. They differ in a statistically significant way from the overall mean at the 95 percent confidence level. Notice how far they deviate from the means–at 3-0, over 88 percent of called strikes are actually strikes, while at 1-2 and 0-2 the percentages drop to 67 percent and 69.5 percent, respectively. In other words, at 3-0 (and 2-0 to a lesser extent), umpires are more likely to see the pitch as a ball, and with two strikes (likewise at 2-2), they’re more likely to see the pitch as a strike.
Note that in the cases where there are two strikes this is exactly the opposite of the intent of the pitcher, where experience tells us they typically try and get hitters to chase, and therefore should result in more thrown balls. One possible explanation is that umpires, even in the short span of several pitches, have their search image modified, and as a result tend to model their calls on the prevailing trend.
What this indicates is that while umpires may, in the words of George Will, be “natural republicans-dead to human feelings,” they are prone to at least some of the same biases and perceptions as the rest of us.
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.
Subscribe now