Michael Wenz recently wrote a review of the openWAR system for Baseball Prospectus' "Caught Looking" column (read that article first). I really enjoyed his review of openWAR, which, imho, is the most thorough review of the openWAR system to date. He constructively points out some strengths as well as several areas of weakness pertaining to openWAR. Here I would like to respond to some of his comments. (Article excerpts are in bold; my comments in regular font).
This review will cover openWAR: An open source system for evaluating overall player performance in major league baseball, by Benjamin Baumer, Shane Jensen and Gregory Matthews in the June 2015 Journal of Quantitative Analysis in Sports.
Most baseball statistics are easy to define—a run scored is a run scored. Sometimes a bit of judgment goes into the definition—sacrifice flies appear in the denominator for on-base percentage but not batting average, for instance—but the definition is at least widely agreed on. Wins Above Replacement (WAR), however, is a statistic that involves much judgment and little agreement. Baseball Prospectus publishes a measure called WARP, and FanGraphs (fWAR) and Baseball-Reference (bWAR) have measures of their own. In a recent paper, Benjamin Baumer, Shane Jensen and Gregory Matthews have declared openWAR on the others.
I think the fact that there are many implementations of WAR is an important point to make when talking about Wins Above Replacement. As advanced baseball statistics have permeated the mainstream, it seems that many baseball writers refer simply to WAR as if there is an unambiguous computation for this quantity. The major versions of WAR, for the most part, agree on a rough ordering of players. However, the ingredients to these WAR implementations are often unknown (for proprietary reasons) and even changing as more research is done. The concept of WAR is more or less agreed upon; how to actually compute it in the "best" way is still a very open question.
Their paper, openWAR: An open source system for evaluating overall player performance in major league baseball, proposes a new manifestation of WAR that is different from the other measures in some important ways. They also emphasize reproducibility, and along with their paper, the authors make available an R software program that allows users to recreate their work. Reproducibility and transparency have become increasingly important topics in academic research in recent years, and meeting very exacting standards for reproducibility is one of the authors’ stated goals. This stands in contrast to existing methods that rely on proprietary methods and opaque calculations. Whether their method outperforms the other measures is, of course, an open question and a difficult one to answer.
I believe one of the major strengths of openWAR is that all of the ingredients are known. (You can download all the code here). If one is interested, they can follow every single calculation in the process of computing openWAR. There is no other implementation of WAR that can make this claim about all the pieces of their formula. I am NOT, however, claiming that this makes the openWAR methodology superior to other implementations (though, as an author, I am biased in favor of openWAR). Rather, I am simply reiterating what I believe to be an important distinction.
It is useful, though, to review the approach of Baumer, Jensen and Matthews to see how it differs from WARP, fWAR and bWAR in philosophy and construction. First and foremost, this paper uses a conservation-of-runs framework based on changes in run expectancy at the plate appearance level. The gory details are in the paper, but essentially after every plate appearance, the change in run expectancy is apportioned to hitters, pitchers, fielders and baserunners. After some manipulation to deal with park and platoon effects, these changes are compared to what would have been expected from a replacement level player. Finally, run expectancy impacts are converted to a wins measure. In the openWAR approach, context matters.
One area that can be improved upon in future version of openWAR is the conversion from runs to wins. I believe this is one of the weakest parts of the whole openWAR system. Our current approach is to simple consider 10 runs to be worth 1 win.
One possible future direction for openWAR is rather than converting runs to wins, simply apportion changes in win probability, rather than runs, directly to the players to avoid the runs to wins conversion entirely.
Another advantage of openWAR is that conceptually the computation of all four components (pitching, hitting, fielding, and baserunning) is the same for each. This is different than other versions of WAR in that they compute different components in different ways and then sum the different parts. openWAR uses the same basic concept for all of its components: Attributing runs to players relative to some baseline level.
This contrasts with other methods that begin with a player’s season-long stat line and assign run values to different outcomes like walks or stolen bases. Context-neutral models like WARP are generally interested in constructing an estimate of a player’s underlying skill level or true talent, while openWAR is attempting to describe what impact a player actually had in a given season. There are reasons to prefer one approach over the other, but the differences in philosophy are important to keep in mind when making comparisons.
That's a good way to describe it: "A difference in philosophy". We don't believe that openWAR is necessarily better or worse than other methods. Just different. It's certainly different.
The data used in the openWAR formulation comes from scraping MLB Advanced Media’s GameDay feeds. MLBAM data allows the program to identify each change in base-out state and run expectancy throughout each game, and also identifies the location of each batted ball. This data is subject to the same kinds of measurement error that trouble many kinds of defensive metrics, but the authors argue that it provides a higher level of resolution for hit location than other available sources. Importantly, this data source allows the authors to assign defensive responsibility for runs to fielders at the plate appearance level in a reproducible way that is not possible with something like Ultimate Zone Rating.
Fielding is the most problematic aspect of the openWAR methodology. We are first limited by the data that we have. For instance, we don't know the starting position of the fielder. This limits our ability to correctly account for defensive shifts. If a fielder is correctly positioned and the ball is hit directly to him, this can result in it appearing as if the defensive player made a nearly impossible play on the ball. When in reality they may not have moved at all. (Note: When a team puts on a defensive shift, and a fielder makes a play as a direct results of being in the correct place, I think a compelling argument can be made that some part of the WAR should be attributed to the coaching staff for putting the players in a good location, but this needs to be thought about a lot more. Also, I'm not exactly sure how you define a replacement coaching staff….)
Second, we are currently not fully apportioning the value of each defensive player involved in a play. On a groundout to the shortstop, the shortstop gets all of the credit and the first baseman, who is necessary for the play, gets no credit. Even worse, on a double play (let's say 6-4-3), the shortstop gets all the credit even though TWO other players were involved in the play. While this is certainly not perfect, on the whole a vast majority of the time the first player to interact with the ball is by far that most important fielder in the play.
Another key element of the openWAR model is the careful definition of replacement level players. The authors are critical of the ad hoc approach used by bWAR and fWAR that involves calibrating the model in such a way that exactly 1,000 wins of WAR are distributed among all players, with the rest of the wins representing the replacement level contribution. The openWAR approach instead selects a number of players equal to the number of available roster slots—750—based on those with the most playing time, and assumes that everyone else represents the replacement pool. These fringe players form the baseline for comparison. This approach runs some risk of a star player who spends most of the season on the DL appearing in the replacement pool and may also lead to overrepresentation of fringe relievers relative to fringe starters, but it’s a reasonable approach.
While we still believe this is a reasonable approach, we noticed some issues with the replacement level in 2015. For instance, the average fielding runs above average for all right fielders in the defined replacement pool was actually positive, even though conceptually replacement players should by definition be below average. This certainly warrants more investigation and is currently the subject of a research project with a graduate student.
Another important contribution of this paper is the attempt to put a variance around their estimate of how good a particular player was. In particular, the authors address player-outcome variations using a resampling strategy that simulates the player’s season by drawing (and replacing) runs-above-average values for each plate appearance from the player’s season. They use this method to create 3,500 simulated seasons for each player and generate a distribution of WAR that reflects what might have been expected to occur but for the random variation in outcomes in each plate appearance. This is a useful exercise for understanding how a player’s actual WAR might differ from what would be expected, given performance and context, but it should be made clear that this should not be interpreted as variance in an estimate of the player’s true talent or underlying skill.
True. It's something like the variability around the estimate of what each player actually contributed to his team. Imagine running the season some large number of times in independent parallel universes. Then looking at the distribution of accumulated openWAR across these universes. That's essentially what we are trying to estimate. And it is definitely different than a player's true underlying skill, though these are often highly correlated quantities. Further, one of the foundations of statistical analysis is the quantification of uncertainty that is communicated much better using an interval estimate than a simple point estimate, which makes no statement about underlying uncertainty. It seems only natural to specify WAR with an interval estimate rather than a point estimate only.
The authors do not estimate variance caused by model uncertainty, or imprecision in parameter estimates, as this is unlikely to be a significant source of variation relative to player outcome variation. It is probably even less of a concern in the event-based approach to computing WAR relative to a linear weights-based approach that relies more heavily on parameter estimates for valuing a particular outcome. They also do not estimate situational uncertainty that comes about due to different players having different opportunities—players who come to bat more often with men on base will get more impact from their hits than those who don’t. The openWAR approach makes no attempt to strip this sort of context out of the model; rather, the authors view context differences as a key feature of their approach.
This has been a major source of criticism of our approach, which I certainly understand. The argument against our approach is that if player A is on a good team and comes to the plate with the bases loaded more often than a similar player B on a weaker team, player A will have an advantage in openWAR simply by being on the better team. I would counter that argument by mentioning that while, yes, a player does get more credit for a home run with the bases loaded than a solo home run in openWAR, that same player A is also penalized more heavily for a strikeout with the bases loaded than player B is for a strikeout with the bases empty. But yes, we do not strip out the context of the plate appearance.
There are some noteworthy and curious choices in the openWAR model. First, like other systems, the authors make positional adjustments to reflect differences in defensive responsibility—a shortstop has a bigger fielding impact on the game than a left fielder and this needs to be reflected somehow. In the same vein, a pitcher in the National League probably will be much worse at the plate than the average replacement player. It is misleading to convert all of these negative batting runs into negative WAR when the average replacement player for that particular player will be a pitcher. Some adjustment is reasonable. However, the authors make the positional adjustment by scaling the hitting performance in each plate appearance based on the performance relative to other hitters who play the same position. In other words, the player’s position in the field determines in part how much credit they get for their batting accomplishments. This is counterintuitive, even if a positional adjustment is necessary. An alternative approach would be to calculate a position-based replacement level for each position and adjust accordingly. It’s not immediately clear how this would impact estimated WAR in practice, but it would be more internally consistent.
I believe that reasonable arguments can be made on both sides of this issue. On the one hand, one can argue that when someone is at the plate and we want to compare their hitting ability to another player, the defensive position they play should not factor into the comparison. I think that is a reasonable argument. However, you can also argue that position IS important when evaluating a player's performance as a hitter. The most extreme example of this is a pitcher. If pitchers are compared to position players on a level playing field, the pitchers compute so much negative RAA that their value as a player is much lower than would be expected. Further, this would greatly punish National League pitchers, as their American league counterparts do not have to bat (i.e. don't rack up large amounts of negative batting RAA). As a result, I stand by this decision to account for players positions offensively.
Another way to argue this that may be more appealing to many would be to further partition our current offensive RAA into two pieces so that batting RAA + positional adjustment = offensive RAA. We are simply lumping our positional adjustment together with batting RAA to create the offensive RAA piece, but this can easily be partitioned as I mentioned. Since, all of these components are added together in the end, you'll end up in the same place even with this proposed partitioning of offensive RAA. However, partitioning in this way would allow for more direct comparisons of batting ability across all players regardless of position and is certainly worth exploring in the future. The line of code related to this issues can be found here, and can be modified easily by simply removing batter position as a predictor in the linear model. In future versions of openWAR, we plan to add an option for the user to specify if they want to include this adjustment or not.
Also, the authors’ formula apportions credit or blame on the defensive side between fielders and pitchers by making use of a fascinating map of the field. Each park-specific pixel from the MLBAM feed corresponds with a potential hit location and a probability that a batted ball in that location will result in an out. As it stands, however, the weight of an outcome is assigned to pitchers or fielders based solely on the location of the batted ball. The less likely a ball is to be caught, the more responsibility given to the pitcher. This is problematic in the sense that fielders get very little credit for making difficult plays. An alternative solution would be to assign credit or blame based on whether an out was made. A deep drive in the gap would be heavily weighted toward the pitcher if it fell for a double, but heavily weighted toward the fielder if it were caught. A similar strategy of dividing credit for called balls and strikes between pitchers and catchers might be useful for measuring pitch-framing, something not currently captured in the model.
The author is completely correct and this needs to be fixed (a clear victory for open source!). We have already implemented this suggestion into our code on github. It will be interesting to see how this changes players' measure of fielding ability. I believe that the practical implication of this is that players' fielding measure will increase in most cases and pitchers will be measured a bit worse.
Finally, the steady improvement of fielding data is likely to provide interesting opportunities for improvement of the model. It’s not hard to envision a model that converts batted ball trajectories and exit velocities into a map of the probability of recording an out on the play. It’s equally easy to see an updated version of openWAR that incorporates this sort of data into improved measures of a player’s defensive contributions.
That is a fantastic idea. I believe this is a simple extension of the current kernel density estimation that is being used in openWAR's fielding models to include the predictors that the author mentions (i.e. trajectory and exit velocity).
While openWAR is unlikely to move the sabermetric community toward an agreed-upon measure of WAR, the authors of the model have set an admirable standard for transparency and reproducibility. There will still be those who prefer a measure stripped of all context, and the openWAR approach is perhaps better suited for MVP voting than forecasting. But for those who wish to take issue with some elements of their approach, Baumer, Jensen and Matthews have provided a framework and source code that can be built upon.
As a final note, I think even if you hate our implementation of openWAR (we hope you don't!), the R package that we have developed can still be useful for scraping and organizing MLBAM data set, which gives users data at the level of the plate appearance.
Gregory J. Matthews is an Assistant Professor of Statistics at Loyola University Chicago, and an advisor to Baseball Prospectus.
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.
Subscribe now
Could you please demonstrate (to the fullest extent possible) how one of the proprietary WAR measurements is NOT doing what you advocate in that paragraph? It's very hard to envision without a concrete example.
In Fangraphs/BR.com parlance, this is the RE24 value.
So, a bases loaded wild pitch with 0 outs would be counted much differently than a WP with a runner on 1B and 2 outs. The other systems basically assume it's a WP that occurred in a random situation in a random game.
Thanks for the reply! It's a great article.
I didn't write it in the original paper because it's not a well-constructed thought in my head, but one thing I can't quite work through is why (intuitively) do you need a position adjustment at all? You apportion blame or credit to fielders at each play, and in theory you'd assign more blame or credit to shortstops or center fielders because they are involved in more plays. That would give more weight to more important positions, no?
There's still the problem of fielders with overlapping territories and double plays, and maybe shortstops do more things on relay throws and such that hold baserunners rather than record outs, so maybe these unobservables have to be captured by position dummies, but it still feels wierd.
Looking forward to following along!
Mike