A little more than a week ago, Jon Heyman of CBS sent out a tweet wondering why it was that Starling Marte and Bryce Harper had the same WAR. Heyman was quoting Baseball-Reference's version of WAR, which at that moment in time showed Marte and Harper tied at 1.7 wins. Harper had clearly been the superior hitter, but drilling down, it turned out that the fielding metric used by Baseball-Reference loved Marte's defense enough (and thought Harper's was average enough) to call them equals.
The problem with any sort of number this early in the season is that on many measurements, we're still at a time when players haven't logged enough playing time for the measure to be considered reliable. But of course, some measures are more reliable than others. The more reliable a measure, the sooner we can be more confident that it actually reflects what the player's talent level was during that time. The less reliable it is, the more likely it is that there will be fluky spikes and valleys over short (and sometimes long) periods of time. Fielding metrics are an estimate of how many outs a player saved from Opening Day onward, and what that was worth. However, in the same way that a player who went 3-for-4 on Opening Day is technically a .750 hitter for the moment, it’s not real. A fielding metric might need some time to stabilize as well before we get a good read on what’s going on.
There's been research on how quickly various batting and pitching statistics stabilize, but in general, few people have asked the question of how reliable our fielding metrics are. One reason is that several of the most commonly cited fielding metrics (UZR,
Warning! Gory Mathematical Details Ahead!
There is a publicly available data set that has batted ball type and hit location data for Major Legue Baseball. Retrosheet (put them in the Hall of Fame!) data files from 1993-1999 have the type of ball hit (grounder, fly ball, line drive), as well as zone data on where the ball was hit. This isn't the ideal data set for a few reasons. First, the zones aren't very granular, and they were input by stringers scoring the game from the press box, so the difference between a line drive and a fly ball might be in the eye of the beholder. Also, the youngest of these data are old enough to be enrolling in high school this fall. However, if anyone would like to show me a publicly available data set that is better…
I started by looking at ground balls for infielders. First, I calculated what zones "belonged" to an infielder. For each zone, I looked at which infielder(s) made the play at least 25 percent of the time (when the ball did not scoot through) over all seven years in the data set. When a zone had more than one fielder assigned to it, for example, a ball in the 56 zone (between short and third) might belong to the shortstop or the third baseman, I did not penalize the third baseman for not fielding the ball if the shortstop got there first. It simply went as a "no play" for the third baseman. Conversely, I did not reward the shortstop for somehow making a play in short right field. (What the heck was he doing out there anyway?)
My criterion for success was whether or not an out was recorded on the ground ball (either by force out, or just good ol’ throwing the ball to first). I played around with whether or not he got to the ball (regardless of whether he finished the play) or whether he fielded and threw cleanly. (If the first baseman dropped the throw, whose fault is that?) It didn't change the results all that much. All events were coded 0/1 (not out/out).
This is a simpler model than is actually used in the major defensive metrics. What I've created here is a basic "outs per ball in zone" metric. The more developed measures control for more factors and adjust for the difficulty of each play, and they are better off for it. But then again, all defensive metrics boil down to "How many balls was he near and how many did he turn into outs?" I'm happy to concede that I'm dealing with a rough approximation and that your mileage may vary if your model is fueled by more granular data. But this ought to give us some order of magnitude to work with.
I used the Kuder-Richardson, formula 21 to look at reliability. KR-21 is specifically set up to look at reliability in binary outcomes. I considered the stat stable when KR-21 crossed .70. I looked at sample sizes of up to 600 balls per fielder, meaning that I can see stability numbers to sampling frames of 300 in real life. If a measure failed to reach .70 within the frame available, I used the Spearman-Brown prophecy formula to estimate the point at which it would reach the reliability line in the sand.
The results for ground balls to the infielders:
First basemen: We need 290 GB at or near the first baseman before our crude measure of fielding stabilizes
Second basemen: 540 GB
Shortstops: 420 GB
Third baseman: 400 GB
Next, I looked at fly balls and pop ups for all seven non-battery positions. I used the same basic logic, except that I assigned each zone to the fielder who made more than 50 percent of the plays in that zone. I excluded fly balls that left the park. Also, this does not include line drives, and catching those is largely a matter of luck. I coded each fly ball 0/1 based on whether or not the fielder caught the ball.
For infielders, I was only able to go out to a sampling frame of 200 pop-ups (so my top resolution was 100 pop-ups). For outfielders (who get more fly balls), I was able to go to 500 (so my estimates run to 250 fly balls)
First basemen: 48,000 pop-ups.* Really.
Second basemen: 400 pop-ups.
Shortstops: 320 pop-ups.
Third basemen: 3,240 pop-ups.*
Left fielders: 370 fly balls.
Center fielders: 280 fly balls.
Right fielders: 210 fly balls.
*Those corner infielder numbers are mostly the result of the fact that reliability numbers barely budged from zero in the tested sample. We'll talk a bit more about what this means in a minute.
To give some context around those numbers, the average team in 2012 had to take care of 6.3 ground balls and 4.7 fly balls/pop-ups per game. (Surprised?) That means that even if Starling Marte had played every inning of every game for the Pirates in left field and every single fly ball that the other team hit was hit his way, after 40 games, we would only expect him to have 188 fly balls hit his way, and that's a only halfway to getting a reliable measure of his outfield range. However, after 40 games, there are certain parts of Starling Marte's batting line that can be considered reliable.
(Careful readers will note that I didn't address throwing arm stats, and that was more a matter of sample size than anything. In the past I've found that performances in throwing runners out on the bases aren't very stable year to year, primarily because there just aren't a lot of chances that a player gets to show off his arm.)
What it Means
Here's the dirty little secret about WARP. It's an amalgamation of a bunch of different measures, converted into the same denominator so that they all add up, and sold as a coherent whole. By packaging all of the parts together, it gives them the illusion that they are all on equal footing with one another. They aren't. WARP is what happens when you add offense, defense, and baserunning (and compare them against a position-adjusted baseline). The problem is that we can be a lot more confident a lot more quickly that a player's offensive numbers accurately portray both how he’s performed and what his true talent level has been over the course of a season. Colin Wyers has much more to say on that subject today.
With defensive numbers, that point of reliability just doesn't happen that fast. It takes longer for a player's true colors to shine through on defense. When a guy like Starling Marte has a big number on his defense, it might reflect what he's done to date, but we can't be completely confident that it captures what he did during that time, and there’s even more uncertainty about who he was deep down. And even if the metric isn’t overstating his performance, we can’t be sure whether he’s really the best fielder in the league or just enjoyed a convenient spike of luck. Either way, we need to be careful to drill down a bit to see what is driving a high (or low) value and to frame our understanding accordingly.
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.
Subscribe now
Two points. First, the assumption that all grounders or pop flies are of equal difficulty is obviously wrong (nor do you claim it to be otherwise, for sure), and it leads to the inclusion of lots of plays in your data base that really don't contribute much in terms of discriminatory power. Any ground ball hit within 5 feet of a fielder is going to turn into an "accepted chance" for that fielder, to use 60-year-old terminology, unless the guy is immobile on the scale of a late-career Frank Howard. Those chances may shed light on the inadequacy of guys with real hands of stone, or a terminal case of the throwing yips (think Steve Sax or Chuck Knoblauch), but otherwise they don't contribute much except added statistical clutter.
Second, the contention that all that clutter "cancels out in the wash" is dubious, because not all fielders have the same proportion of non-trivial plays attributed to them. There are a number of reasons for that, ranging from the fielders' own reputations to the reputations of teammates to the surfaces they played on to the pitchers they played behind, and so on. In essence, they aren't all taking the same fielding exam -- which again is one of the key points about KR21.
Yes, I understand now that with the limitations of the data set, you probably can't do better. But with the "right" data set, that is, a reduced set that looks only at balls in play that really do have discriminating power, I'd be pretty confident that the numbers required to achieve some degree of stability would be much reduced, although you'd have to use a more powerful algorithm to test that claim.
I've heard Harper's rookie year characterized as "the greatest ever by a teenager." Tony Conigiliaro sprang immediately to mind, so I checked the numbers. Using basic stats, Tony C's rookie season at age 19 in 1964 he hit .290/.354/.530 with an OPS+ of 137. Harper at age 19 hit .270/.344/.477, OPS+ 120. Tony C's rate stats were better too, and it certainly appears that he had the better teenage season. But then you turn to WAR...Conigliaro's was 1.6, Harper's 5.2, and I'm sure that's where anaylsts trumpeting Harper as the best teenager ever are basing their statements. That's a big difference, and it appears to be in dWAR. Was Tony C that bad a defender? And given that it was 1964, how do we know? Should we even compare the two using strictly WAR?
The broader point is that Harper's claim to being the best teenager ever rests a good deal on his superior (in the eyes of the metrics) fielding performance. Conigliaro was clearly the more productive hitter. We can put more faith in the reliability of those hitting metrics than the defensive metrics. There's a decent case to be made that Conigliaro deserves a second look and that the case in favor of Harper is not so clear cut.
Still, I remember an article on this site years ago that quantified the defensive value of a good defensive catcher vs a good offensive catcher and the defensive metrics at the time showed a completely negligible difference between the best and worst defensive catchers. So small that if it were that small at the toughest position on the field, there would never be a reason to consider defense when filling a lineup card as long as the player was adequate. Fast forward to last year, where I see Yadier Molina as a top 10 WAR player and mike trout rack up more WAR than Miguel Cabrera despite miggys significant edge not just in then triple crown stats but in ops*, and I could only conclude that the brand new stats had gone from under rating defense to overrating it.
This article confirms what I've suspected ever since the MVP debate, the formula overstates defense and base running. Just because all three are part of the game doesn't mean they are equal and until a corrective factor is found, supporters of WAR should hold off on calling non supporters dinosaurs. Either that or start hyping Marte as much as Harper.
*(before anyone tries to read team based bias in my arguments, my favorite tiger is Justin Verlander and I still voted for David Price for Cy Young last year on the strength of his ERA.)
From looking at Trout's performance last year, it appears around 3/4 of his WAR was generated by his bat. It's simply not true to say that hitting, baserunning and defence are equally valued.
On the otherhand, what is possibly more broken is PWARP. It has gotten to the point that one of the BP writers actively dismisses PWARP. It is very easy to locate examples where "worth" as defined by PWARP appears to be contrary to other more traditional, but acceptable, metrics. As a simple example:
Barry Zito
year ip W-L QS-BQS ERA
2010 199 9-14 19-1 4.15
2012 186 15-8 17-0 4.15
After a quick glance at the numbers and ignoring W-L, one would suspect the two years are nearly equivalent, but by PWARP, Zito was nearly 2 wins more valuable in 2010 than 2012.
And at 7-0 with 5 QS out of 8 and a 2.44 ERA, Matt Moore has been nearly replaceable this year: PWARP = 0.3.
Hopefully some food for thought.
I have no problem with this interpretation.
So as a metric, what is the error associate with it? It appears to be a systematic error thus affecting its accuracy. Given sufficient samples the random error should disappear. Or is it a random error - that eventually all FIP pitchers eventually regress toward the mean?