At his website, Bill James recently published a column entitled “Judge and Altuve,” as well as a follow-up column. Therein, James argues that Wins Above Replacement (WAR) is wrong to evaluate Aaron Judge’s run contributions as equivalent in “win” value to those of Jose Altuve, because the Astros won more games than the Yankees.
The backdrop for the criticism is this: wins obviously arise from runs, specifically the difference between the number of runs scored and those that a team allowed. The question is how many runs should be considered equivalent to a win, and whether that value should be static or dynamic.
James’ argument, as I understand it, is that there needs to be a 100 percent equivalency between the games a team actually wins and the runs they actually score or prevent. Thus, his “run-to-win” value would be dynamic and vary by team. WAR(P), by contrast, uses the overall league-average relationship between runs and wins to assign win value. James’ ire was focused on Baseball Reference’s WAR measurements in particular (Altuve 8.3; Judge 8.1), but the criticism is fairly generalized to any system that has a similar philosophy, and he does not limit his criticism to MVP evaluation. Rather, it is clear that James sees the MVP situation as a symptom of a larger defect in how WAR operates.
While his argument is interesting, I don’t think it proves as much as James seems to believe. It is fine to point out that WAR—by using the MLB average relationship of runs to wins—can produce some curious results on the margins; but that does not mean those incongruities are automatically meaningful or indicative of a true problem. Back-fitting a team’s actual wins to a player’s value, as James suggests, repackages the same problem in a different form: we still have hitters who get on base but don’t get driven in, or pitchers who keep the ball on the ground but have poor fielders behind them, and we then have to decide how to fairly adjust for those situations.
In fact, James built his career upon observing and skewering such incongruities, so it seems rather strange for him to criticize a more statistically reasonable approach—using the grand mean value of a run to the entire league—as opposed to the noisier estimate of what a run ended up meaning to a particular team (and even then, still only the average value to that particular team).1 The fact that a player’s value is not fully realized does not mean that player has no unrealized value. Put another way, even if Reds hitters struck out to complete every inning in which Joey Votto drew a walk, it seems odd to claim that those walks were that much less worth doing.
Ultimately, player value depends heavily on your assumptions, and particularly on how you decide to measure and compile a player’s supposed contribution. Let’s take James’ apparent position and label it as Position A: the value of a particular player is identified by that player’s production of events which are valued on the league-average values for those events in runs, adjusted for park/environment, and then further adjusted by the average value of runs to wins for the player’s particular team during a given season. Let’s also add a second, very important assumption that James implicitly makes, but does not discuss: that the sole events worthy of consideration are the outcomes that actually occurred.
By contrast, let’s label the traditional WAR approach as Position B: the expected value of a player is identified by that player’s production of events which are valued on the league-average values for those events in runs, adjusted for park/environment. Here, value depends not on your team, but on the overall average relationship between runs and wins in baseball during a given season. Once again, this valuation is judged solely by the outcomes that actually happened, and by assuming, as James does, that the players credited with the play’s results are 100 percent responsible for them.
In my opinion, both Positions A and B, although arguably reasonable, are inferior to what I will call Position C: that the value of a particular player is identified by that player’s production of events which are valued on the league-average values for those events in runs, adjust for park/environment and other contextual factors, but—critically, because there are multiple individuals involved in every baseball play—must be further adjusted by the run value that each player most likely contributed to an outcome. Position C is what increasingly motivates our new statistics here at Baseball Prospectus, including Deserved Run Average (DRA), Swipe Rate Above Average (SRAA, or stolen bases), and Catcher Framing. I also believe that Position C best reflects the approach generally taken by state-of-the-art front offices in baseball.2
Let’s summarize these positions and their components as follows:
Position | Win Value | Contributions Considered |
A | Average team run value | Actual results only |
B | Average MLB run value | Actual results only |
C | Average MLB run value | Most likely contribution to actual results |
The elephant in the room here, as usual, is variance. James’ articles seem to brush off randomness as mere “luck,” but variance cannot be dismissed so cavalierly. Baseball analysts hear and talk a great deal about linear weights, which are the average values of various batting events in baseball. An out is worth—on average, over the course of a half-inning—about .3 negative runs, a single about 0.7 positive runs, and a home run about 1.4 positive runs.
What we ought to hear much more about is the variance among those events. This variation can be estimated: assign the linear weight to every event in baseball for a season (~185,000 in 2017), fit a mixture model to accommodate the multiple modes, and take the weighted mean of their standard deviations: it turns out to be about 0.3 runs. This number is interesting for a number of reasons, not least because the standard deviation is essentially the same as the value of an out itself.
In other words, any play, regardless of how capable the players are, can end up being an out, even if, on average, it should be something quite different. Thus, an out can actually win a game if the out scores a runner; likewise, a double combined with someone else’s baserunning blunder can guarantee a loss. This is why we watch; it is why we smile stupidly and scream “baseball!” after a particularly improbable sequence; it is why no game is “over” until the final out has actually been made. It is, at bottom, the same variance that gives us the predicament in question.
So, how do we account for this variance? According to Position A, the only thing that matters about Joey Votto’s walks is how the other Reds hitters capitalized on them. Any walks that did not translate into runs scored were, statistically speaking, a waste of everybody’s time. Votto might as well have struck out and spared us the terrific battles between him and so many pitchers. As James points out, this approach has the advantage of ensuring that everyone’s “contributions” retroactively sum to zero; however it also has the disadvantage of seeming ridiculous—at least to me and others interested in evaluating Votto’s contributions in various contexts.
Virtually all run-scoring events require timely assistance from other teammates. Why should the inherent value of a player depend almost entirely on the contributions of other players, with the sheer randomness of those contributions often amounting to an undeserved out? If Votto’s on-base skills were plopped onto the Astros, under James’ system, his “value” would skyrocket, as the remaining Astros sprayed hits all over the place, uniquely rewarding his on-base skills. While this might true up the ultimate “results” of any team, a player whose value depends heavily on his teammates is not being given his inherent “value” at any time. If this is truly what you prefer, that of course is fine, and it is fine for James to prefer it for his own purposes. But most people, I suspect, would find it highly problematic.
Let’s check in with Position B, the traditional “WAR” position. James disagrees with the usage of average MLB run-to-win values. But WAR does this because it sees the best measure of a player’s value as an inherent, neutralized number: a number which does not penalize the player for the team to which they were assigned or the stadiums in which they were ordered to play. As James notes, this of course results in unexplained variance, the sort that causes the Yankees to win only 91 games instead of the 100 that their statistics suggested they should have won.
But so what? Sticking with Baseball Reference, we can correlate team WAR (batting and pitching combined) with winning percentage, and the correlation is .93,3 which means that bWAR accounts for about 87 percent4 of all run scoring and prevention in baseball. That’s pretty darn good, and not atypical for the various WAR systems. Does that leave 13 percent of what happens on a baseball field unaccounted for? It sure does. But again, so what? WAR doesn’t pretend this variance does not exist; it merely refuses to punish individual players for the inherent volatility we enjoy seeing in the game. And while there are those who enjoy complaining about WAR for this reason, my sense is that many of these people would complain about WAR regardless.5
That leaves Position C. Whereas Position A declares that variance is always somebody’s fault, and Position B assumes that variance is not worth accounting for, Position C embraces the variance and tries to work within its constraints. This requires attacking the assumption that each play’s outcome is 100 percent caused by the players officially credited for that outcome. Position B—the WAR approach—still relies on the outcomes of each batting event as a true reflection of the credited player’s entire contribution to each play; to get around this unrealistic assumption for other purposes, its adherents typically use “regression to the mean” to try to get a sense of the player’s true “ability” and likely future contributions. This doesn’t affect the credited WAR, but is one way to ensure that the present does not unduly cloud the future.
Position C rejects this duality: instead, it focuses on reasonably apportioning each player’s responsibility for each play at the time it is measured, and compiled into win values. When you address the credit issue up front, there is no need to worry about the issue later, and no need to “regress” any player’s statistics to get to some better place eventually. Instead, you focus on getting it right the first time: use shrinkage and prior information to give credit only when it is most probable, and allow simple variance to take credit for the rest. This is how Deserved Run Average works, and how Baseball Prospectus’ pitcher WARP operates also.
This approach has real added value. As we’ve shown, DRA manages both to substantially describe what has actually occurred, while also doing a better job of consistently anticipating player results from season to season. Since Position C already discounts the way the play has been officially credited, it makes sense to stick with the WAR approach of evaluating wins by the average run-to-win value, rather than any team’s particular value. This makes Position C in the end almost the polar opposite of James’ Position A when it comes to player valuation. But it is a position we find to be much more sensible and reflective of how well a player has most likely contributed, both to his team and to baseball in general.
We also believe that Position C is the future. The field of statistics increasingly seems likely to coalesce around the understanding that statistics is about appreciating uncertainty, not precision. By embracing uncertainty, you recognize that the correct approximation of a player’s win value is neither “8.3” nor “8.1” per se, but rather “8.3 plus or minus 1.5 runs” versus “8.1 plus or minus 1.2 runs.”6 The comparison between the two players then is not between two-tenths of a run, but rather the extent to which the uncertainty intervals around those two players actually overlap.
The extent to which they don’t overlap tells you the percentage likelihood that WAR(P) is missing something, and the analyst can then sensibly consider what additional factors—clutchness, tough ballparks, terrific managing, or what have you—can fill in the gap. In doing so, analysts (and columnists alike) can consider additional information with an appreciation of how much, or how little, those additional factors most likely can be said to actually matter.
We readily concede that we have a ways to go in our effort to make Position C a reality. The point of this discussion isn’t to brag about what we’ve done so much as to recognize how much left we and many others have yet to do. We have started down the path of Position C, but much of it remains unfinished. While we consider variance in generating many of our WARP estimates, that isn’t true for all of them (most notably, for offensive statistics). Even the WARP estimates that embrace variance haven’t yet given you those intervals around our point predictions.
Hopefully we can start doing that soon. In time, we believe that effort will be recognized by most to have been well spent, and that Position C will come to be seen as the best way to evaluate player contributions in sports. Indeed, its goals are so distinct that it may ultimately be helpful to find some entirely different term to describe what Position C aims to compile. For now, our addition of the “P” to the end of “WAR” will have to do.
In the meantime, I don’t really care whether you decide to take Position A, B, or C (and this includes Bill James!), as long as you disclose which method you chose. Regardless of your preference, it certainly isn’t worth getting upset about.
Footnotes:
1. It is particularly strange to see this argument coming from the originator of the so-called Pythagorean Theorem of Baseball, which advocates looking at a team’s deserved wins rather than their actual wins, the former being determined by run differential.
2. Admittedly, this may be because front offices usually care little about past performance, and instead focus on ability level, with an eye toward the future. This caveat is important, but I suspect most advanced analysts would favor Position C even if asked solely to grade past performance.
3. Pearson correlation.
4. The square of the Pearson correlation, aka R-squared, assuming a generally linear relationship.
5. Thoughtful WAR criticisms are always welcome, but many of WAR’s critics seem to be frustrated by WAR’s tendency to discourage contrary and more convenient narratives of player contribution.
6. The ranges are for illustration only.
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.
Subscribe now
Although this is the first time I've ever seen "true" used as a verb, and I'm not sure I liked it.
Using WARP & PWARP, Yankees players produced about 55 wins above replacement. Assuming a replacement team wins 48 games, the Yankees 'should have' won 103 games. However, they only won 91 or 12 fewer than expected. Contrast that to the Astros. Their expected wins based on WARP was 101, just what they won.
Therefore, how can really say Judge produced the 7.4 wins WARP suggests (or Gardner produced 3.9 or Frazier 1.1) when there is such a big disconnect between the team's WARP expected wins and its actual wins. The difference is so big that if you throw out Judge's 7.4 'wins' the Yankees still under-perform their expected number, almost suggesting his contributions were worthless (which, of course, they were not). The question then is, if they Yankees only really won 43 games above replacement (instead of 55) how do you really value Judge's 7.4. Maybe that 7.4 was really only worth 5.8 actual wins (7.4 x 43/55).
Furthermore, James believes the key to working this out lies in evaluating and considering context, which goes against the religion of WAR/WARP. Saying the difference is all a matter of luck (or for the Yankees this year, the lack there of) just does not cut it for James, nor should it considering how much data that is now easily available.
His point, then, is that the MVP award should be awarded based off of what really happened, rather than what would happen if you simulated the season a million times or off of who would provide the most future value or whatever. And so simply looking at WAR, which is based around the average relationship between runs and wins, divorces the MVP award from that reality.
Of course, due to a myriad of problems with the stat, no one relies on Win Shares for analysis. It sounds like Bill James wants to get back to having the nice correlation that Win Shares provided, without recognizing the inherent problem (again).
Great analysis, Jonathan!
It is my understanding that James argues against WAR in its usage. WAR is an approximation of player value and his contribution to his team. Too many people don't understand that, which stems from a lack of understanding of the bigger picture. It all makes sense to me - those that don't understand the big picture want to simplify the analysis and WAR does exactly that... in the wrong context, which is what James is pointing at. It is ironic that those with a partial understanding want to argue against someone who has spent their life building that deeper understanding.