Baseball Prospectus’ Director of Technology Harry Pavlidis will be chatting with readers Thursday at 1 p.m. ET. If you have any questions after reading this overview of Deserved Run Average, ask them here.
Introduction
Earned Run Average. Commonly abbreviated as ERA, it is the benchmark by which pitchers have been judged for a century. How many runs did the pitcher give up, on average, every nine innings that he pitched? If he gave up a bunch of runs, he was probably terrible; if he gave up very few runs, we assume he’s pretty good.
But ERA has a problem: it essentially blames (or credits) the pitcher for everything, simply because he threw the pitch that started the play. Sometimes, that is fair. If a pitcher throws a wild pitch, he can’t blame the right fielder for that. And if a pitcher grooves one down the middle of the plate, chances are that’s on him too. Not too many catchers request those.
However, most plays in baseball don’t involve wild pitches or gopher balls. Moreover, things often happen that are not the pitcher’s fault at all. Sometimes the pitcher throws strikes the umpire incorrectly calls balls. Other times they induce grounders their infielders aren’t adept enough to grab. And still other times, a routine fly ball leaves the park on a hot night at a batter-friendly stadium.
ERA doesn’t account for any of that. It just tells us, in summary fashion, how many runs were “charged” to the pitcher “of record.” And so, a starting pitcher who departs with a runner on first gets charged with that run even if the reliever walks the next three batters. The same starter would get charged if the reliever makes a good pitch, but the shortstop can’t turn a double play. And none of these runs count at all if they are “unearned”— an exclusion by which the home team’s scorer decides whether a fielder demonstrated “ordinary effort.”
The list of problems goes on. Pitchers who load the bases but escape are treated the same as pitchers who strike out the side. Pitchers with great catchers get borderline calls. Guys who can’t catch a break for months show immense “improvement.” Guys who are average one year wash out the next. ERA, in short, can be a bit of a mess, particularly when we have only a few months of data to consider.
The problem is this: We know which runs came across the plate, but we can’t tell, just from ERA, which runs were actually the pitcher’s fault. What we need is a reliable way to determine which runs the pitcher deserves to be charged with. That is the challenge we took on in creating Deserved Run Average (DRA).
The Search for an Alternative
Baseball researchers have spent the past few decades trying to figure out a better way to measure pitcher quality. Voros McCracken is popularly credited for discovering that pitchers have varying (and often little) control over the results of balls put in play. Running with that theme, Tom Tango proposed the metric of Fielding Independent Pitching, or FIP. FIP looks only at a pitcher’s home runs, strikeouts, hit batsmen, and walks. From these four statistics alone, FIP can account for almost 50 percent of the variance in runs allowed by pitchers each year. At the same time, most plays in baseball do not involve a strikeout or home run. And so, many researchers have tried to improve on FIP’s formula.
A few years ago, our former colleague, Colin Wyers (now employed by the Houston Astros) thought he had a better solution. Labeled Fair Run Average (abbreviated “FAIR RA” or “FRA”), Colin’s approach tried to adjust for, among other things, what he considered to be a “fair” number of actual innings pitched, and assigned a “fair” number of runs allowed for each pitcher as a result.
Unfortunately, Fair Run Average has not succeeded. While some of its assigned values make sense, others do not. Many researchers have noted what appears to be a bias in Fair Run Average against pitchers who generate a lot of groundballs—a skill generally thought to be desirable. Fair RA just has not caught on, and, more importantly, our understanding of the tools for measuring baseball performance has advanced since the time Fair RA was conceived.
Today, we are transitioning to a new metric for evaluating the pitcher’s responsibility for runs that crossed the plate. We call it Deserved Run Average, or DRA. Leveraging recent applications of “mixed models” to baseball statistics, DRA controls for the context in which each event of a game occurred, thereby allowing a more accurate prediction of pitcher responsibility, particularly in smaller samples. DRA goes well beyond strikeouts, walks, hit batsman, and home runs, and considers all available batting events. DRA does not explain everything by any means, but its estimates appear to be more accurate and reliable than the alternatives. As such, DRA allows us to declare how many runs a pitcher truly deserved to give up, and to say so with more confidence than ever before.
Deserved Run Average
As you may have noticed, we are introducing DRA (and its underlying components) in two articles. This article provides an overview of these new statistics, and is meant both to provide an overview of DRA and to be approachable for all readers. The second article, entitled DRA: An In-Depth Explanation, discusses in detail the inner workings of DRA for our readers who enjoy such things.
So, as an overview, here is what DRA does, step by step:
Step 1: Compile the individual value of all baseball batting events in a season.
When a batter steps into the box, a number of different events can ultimately occur. These range from a strikeout to a single to a double play to a home run. Over the course of a season, those events each, as a category, tend to result in an average number of additional (or fewer) runs. For example, a home run on average results in about 1.4 runs, because sometimes there are runners on base and sometimes there are not. By the same token, a double play tends to cost a team about three-quarters of a run. Although a double play can sometimes allow a run to score (such as when there happens to be a runner on third with no outs), it far more often ends the inning or empties the bases with no runs scored.
In the world of baseball statistics, the average seasonal value of these events is known as a “linear weight.” To understand the ultimate effect of the batting events, we first must assign the typical value of those events. So, DRA begins by collecting every single baseball batting event in a given season and assigning the average linear weight for the outcome of that play.
Step 2: Adjust each batting event for its context.
Once we have the average value of each play in a season, we start making our adjustments. Home runs depend, among other things, on stadium, temperature, and the quality of the opposing batter. Ball and strike calls tend to favor the home team. The likelihood of a hit depends on the quality of the opposing defense. The pitcher’s success depends on how far he is ahead in the count, and both a catcher’s framing ability and the size of the umpire’s strike zone help get him there.
So, DRA begins by adjusting for the average effect of these factors beyond the pitcher’s control in each plate appearance, using what is known as a linear mixed model. These environmental factors include:
- The overall friendliness of the stadium to run-scoring, accounting for handedness of the batter (using our park factors here at Baseball Prospectus);
- The identity of the opposing batter;
- The identity of the catcher and umpire;
- The effect of the catcher, umpire, and batter on the likelihood of a called strike (e.g., framing / umpire strike zone, from 1988 onward);
- The handedness of the batter;
- The number of runners on base and the number of outs at the time of the event;
- The run differential between the two teams at the time of the event;
- The inning and also the half of the inning during which the event is occurring;
- The quality of the defense on the field for each individual play (assessed through BP’s FRAA[1] metric);
- Whether the defense is playing in their home stadium or on the road;
- Whether the pitcher is pitching at home or away;
- Whether the pitcher started the game or is a reliever; and
- The temperature of the game at opening pitch (from 1998 onward).
There are two other aspects that affect how DRA scores pitchers.
First, rather than grade pitchers purely on the number of outs, like ERA does, DRA grades them on the basis of each plate appearance. Thus, pitchers who escape a bases-loaded jam are no longer treated the same as pitchers who retire all three batters they faced, simply because they both got three outs.
Second, DRA judges pitchers on the run expectancy of each play, rather than the runs that happen to cross the plate. If, for example, our hypothetical starter from earlier put a man on first and then was replaced, he would not be penalized the entire run if the reliever subsequently allowed that player to score. Rather, he would be penalized only by the likelihood that said player would have scored from first base on average, with the reliever getting charged the difference between that average likelihood and the full value of the run if it scores. Likewise, when a starter loads the bases, but the reliever gets the team out of it, the reliever doesn’t simply get credit for an out or two. Rather, he gets a bonus for all of the runs that were expected to score from a bases-loaded situation in an average situation, but didn’t. In this regard, true “stopper” relievers get more fairly recognized for their accomplishments, and we more accurately forecast their “deserved” runs allowed.
The DRA component that emerges from all these adjustments is value/PA: the average value of each plate appearance which the pitcher completed during the season.
Step 3: Account for base-stealing activity.
Understanding the average weight of a batting event is essential, but run-scoring also depends on who happens to be on the base at the time. Billy Hamilton is much more likely to score when on base than Billy Butler, all other things being equal. Certain pitchers also hold runners better than others. A runner who is afraid of being picked off will have fewer steal attempts. Runners who stay closer to the base should have a harder time scoring. And runners who are thrown out trying to steal are erased from the basepaths entirely.
To account for these situations, and provide some insight into the effect of baserunning on each event, we created two additional statistics: one looking at base-stealing success and one looking at the frequency with which baserunners attempt to steal bases. They are both (potentially) part of DRA, but are also useful in and of themselves.
We’ve also made an effort to make these statistics more approachable. Because we are looking at how pitchers compare to other pitchers in controlling baserunners, we are describing these stats as Swipe Rate Above Average (SRAA) and Takeoff Rate Above Average (TRAA).
Swipe Rate, as its name implies, judges each participant in a base-stealing attempt for his likely effect upon its success. Using a generalized linear mixed model, we simultaneously weight all participants involved in attempted steals against each other, and then determine the likelihood of the base ending up as stolen, as compared to the involvement of a league-average pitcher, catcher, or lead runner, respectively.
Stated another way, Swipe Rate allows us to evaluate how good Yadier Molina’s arm is while controlling for the inherent ability of his pitchers to hold runners and the quality of the runners he is facing on base. Likewise, we evaluate the ability of individual pitchers to hold runners while controlling for the possibility that they may be throwing to a catcher with a subpar arm. And for baserunners in particular, we now have something much more accurate to evaluate their base-stealing ability than base-stealing percentage.
Remember that base-stealing percentage, by itself, is not very useful: using straight percentages, an elite base-stealer who swipes 90 percent of his attempts and tries to steal 40 times a year ranks lower than a catcher who had one lucky steal all year (and therefore has a 100 percent base-stealing percentage). In the same way that Controlled Strikes Above Average (CSAA) controls for the effect of other factors on catcher framing, Swipe Rate Above Average regresses baserunners’ steal-success rates against both themselves and others to provide a more accurate assessment of each participant’s effect on the likelihood of a stolen base.
The factors considered by the Swipe Rate are:
- The inning in which the runner was on base;
- The stadium where the game takes place;
- The underlying quality of the pitcher, as measured by Jonathan Judge’s cFIP statistic;
- The pitcher and catcher involved;
- The lead runner involved.
Because the statistic rates pitchers above or below average in preventing stolen bases, average is zero, and pitchers generate either positive (bad) or negative (good) numbers. In 2014, here were the pitchers who were hardest to steal a base on:
Name |
Swipe Rate Above Average (SRAA) |
Hisashi Iwakuma |
-3.86% |
Kyle Kendrick |
-2.57% |
Corey Kluber |
-2.53% |
Todd Redmond |
-2.47% |
Madison Bumgarner |
-2.45% |
Jake Odorizzi |
-2.27% |
And here were the pitchers baserunners exploited the most last year:
Name |
Swipe Rate Above Average (SRAA) |
Jake Arrieta |
+2.84% |
Roberto Hernandez |
+2.75% |
Phil Hughes |
+2.24% |
Tom Wilhelmsen |
+2.19% |
Yu Darvish |
+2.14% |
Drew Hutchison |
+2.08% |
The model for TRAA (Takeoff Rate Above Average) is similar to SRAA, but more complicated. With Takeoff Rate, we don’t care whether the baserunner actually succeeds in stealing the base; what we care about is that he made an attempt. Our hypothesis is that base-stealing attempts are connected with the pitcher’s ability to hold runners. When baserunners are not afraid of a pitcher, they will take more steps off the bag. Baserunners who are further off the bag are more likely to beat a force out, more likely to break up a double play if they can’t beat a force out, and more likely to take the extra base if the batter gets a hit.
Takeoff Rate stats consider the following factors:
- The inning in which the base-stealing attempt was made;
- The run difference between the two teams at the time;
- The stadium where the game takes place;
- The underlying quality of the pitcher, as measured by Jonathan Judge’s cFIP statistic;
- The SRAA of the lead runner;
- The number of runners on base;
- The number of outs in the inning;
- The pitcher involved;
- The batter involved;
- The catcher involved;
- The identity of the hitter on deck;
- Whether the pitcher started the game or is a reliever.
Takeoff Rate Above Average is also scaled to zero, and negative numbers are once again better for the pitcher than positive numbers. By TRAA, here were the pitchers who worried baserunners the most in 2014.
Name |
Takeoff Rate Above Average (TRAA) |
Bartolo Colon |
-6.09% |
Lance Lynn |
-5.91% |
Hyun-jin Ryu |
-5.82% |
Adam Wainwright |
-5.75% |
T.J. McFarland |
-5.17% |
Nathan Eovaldi |
-5.17% |
And here were the pitchers who emboldened baserunners in 2014:
Name |
Takeoff Rate Above Average (TRAA) |
Joe Nathan |
9.60% |
Tim Lincecum |
9.41% |
Drew Smyly |
8.80% |
Tyson Ross |
8.08% |
A.J. Burnett |
7.61% |
Juan Oviedo |
7.55% |
Current 2015 ratings for Takeoff Rate Above Average are on our leaderboards. We don’t’ have enough data yet to release Swipe Rate Above Average, but we expect it will have enough to work with in another month or so.
Step 4: Account for Passed Balls / Wild Pitches.
Under baseball’s scoring rules, a wild pitch is assigned when a pitcher throws a pitch that is deemed too difficult for a catcher to control with ordinary effort, thereby allowing a baserunner (including a batter, on a third strike) to advance a base. A passed ball is assigned when a pitcher throws a pitch that a catcher ought to have controlled with ordinary effort, but which nonetheless gets away, also allowing a baserunner to move up a base. The difference between a wild pitch and a passed ball, like that of the “earned” run, is at the discretion of the official scorer. Because there can be inconsistency in applying these categories, we prefer to consider them together.
Last year, Dan Brooks and Harry Pavlidis introduced a regressed probabilistic model that combined Harry’s pitch classifications from PitchInfo with a With or Without You (WOWY) approach. RPM-WOWY measured pitchers and catchers on the number and quality of passed balls or wild pitches (PBWP) experienced while they were involved in the game.
Not surprisingly, we have updated this approach to a mixed model as well. Unfortunately, Passed Balls or Wild Pitches Above Average would be quite a mouthful. Again, we’re trying out a new term to see if it is easier to communicate these concepts. We’re going to call these events Errant Pitches. The statistic that compares pitchers and catchers in these events is called Errant Pitches Above Average, or EPAA.
Unfortunately, the mixed model only works for us from 2008 forward, which is when PITCHf/x data became available. Before that time, we will rely solely on WOWY to measure PBWP, which is when pitch counts were first tracked officially. For the time being, we won’t calculate EPAA before 1988 at all, and it will not play a role in calculating pitcher DRA for those seasons.
But, from 2008 through 2014, and going forward, here are the factors that EPAA considers:
- The identity of the pitcher;
- The identity of the catcher;
- The likelihood of the pitch being an Errant Pitch, based on location and type of pitch, courtesy of PitchInfo classifications.
Errant Pitches, as you can see, has a much smaller list of relevant factors than our other statistics.
In 2014, the pitchers with the best (most negative) EPAA scores were:
Name |
Errant Pitches Above Average (EPAA) |
Carlos Carrasco |
-0.405% |
Ronald Belisario |
-0.403% |
Jesse Chavez |
-0.392% |
Clay Buchholz |
-0.380% |
Felix Doubront |
-0.378% |
Daisuke Matsuzaka |
-0.375% |
And the pitchers our model said were most likely to generate a troublesome pitch were:
Name |
Errant Pitches Above Average (EPAA) |
Masahiro Tanaka |
+0.611% |
Jon Lester |
+0.541% |
Matt Garza |
+0.042% |
Dallas Keuchel |
+0.334% |
Drew Hutchison |
+0.327% |
Trevor Cahill |
+0.317% |
Step 5: Calculate DRA (Deserved Run Average).
We’ve now got our components, so it is time to calculate each pitcher’s DRA. Here are the steps we follow:
First, we put all of our identified components—value/PA, Swipe Rate Above Average, Takeoff Rate Above Average, and Errant Pitches Above Average—together into a new regression, this time looking for their combined effect on run expectancy.[2] We added two more variables that struck us as relevant: the percentage of each pitcher’s plate appearances that came as a starter versus as a reliever (we call this Starter Pitcher Percentage, or SPP) and the total number of batters faced. That gives us a total of six potential predictors for each pitcher to come up with their DRA for a season. We regress these using a method known as “MARS.” If the detail interests you, we invite you to enjoy the In Depth article, which discusses it further.
Second, to smooth out season-to-season variation, and to tease out the most accurate connection between these variables and runs allowed, we actually train our model on the previous three seasons. From this we derive the most accurate connection between our potential predictors and actual runs allowed by pitchers in the current run environment.
Finally, we take the connections determined by our model and use them to calculate each pitcher’s DRA for the current season: his Deserved Runs Average per nine innings. DRA does not distinguish between earned and unearned runs, because that distinction can be arbitrary and over the course of a season it tends to obscure rather than reveal differences between pitchers. We therefore adjust DRA so it is on the scale of Runs Allowed per nine innings (RA/9) rather than Earned Run Average (ERA). We understand that ERA is what many of you are used to, but once you get over that, you’ll be much happier.
We do ensure that, in converting runs per plate appearance to runs per nine innings, we use each pitcher’s individual ratio of batters-faced to innings pitched, rather than just a league average. This allows us to credit the pitchers who are most efficient, and avoid over-crediting pitchers who are putting baserunners on and getting lucky with the outcome. Pitchers in the latter category do not “deserve” the lower runs-allowed numbers they might (temporarily) be putting up.
What It Means
So there you have it: DRA, explained. Most of you really don’t care how we got there; you just care that DRA will be easy to look up and be a good evaluator of pitcher performance. In both respects, you are in luck.
As for the first issue, past DRA is available on our leaderboards right now. In-season DRA during 2015 will be calculated each night after the previous day’s games have concluded. You will be able to use DRA not only to put past pitching performances in context[3] but also to monitor the value of pitchers as we progress through the 2015 season, and beyond. As with our other statistics, DRA will be available for you to download and use for your own comparisons and work.
As for the second issue, rest assured that your time spent reading this article was not in vain. DRA does a very good job of measuring a pitcher’s actual responsibility for the runs that scored while he was on the mound—certainly better than any metric we are aware of in the public domain. And only DRA gives you the assurance that a pitcher’s performance is actually being considered in the context of the batter, catcher, runners on base, as well as the stadium and stadium environment in which the baseball game occurred.
The detailed explanation of DRA’s effectiveness is saved for the accompanying In Depth article. But since you’ve made it this far, we’ll give you the Reader’s Digest version. There are two measures of accuracy that we pay particular attention to in evaluating the accuracy of a new metric.
First, we look at how close, mathematically, the metric’s prediction is to the actual number of runs allowed with the pitcher on the mound. If the pitcher actually allowed four runs per nine innings, we test our alternative metric by how close it comes to that same number. The most commonly used calculation that does this is called the Root Mean Square Error or RMSE.
The second test looks at how accurately the metric ranks the various pitchers relative to each other. Why do we care about rank? Because we know that all pitcher run estimates are a bit off from their actual runs allowed, and more so early in the season. So as a check, we test whether it is at least ranking the pitchers correctly relative to each other. In other words, if the metric can’t estimate runs allowed down to the exact decimal point, the least it can do is tell the difference between Max Scherzer and Ricky Nolasco. This second approach is called the Spearman Correlation.
To judge DRA’s accuracy, we’ll compare it to the leading brand: FIP. We know FIP does a reasonable job of predicting a pitcher’s actual runs allowed in a season. Does DRA do a better job than FIP? It does.
We compared how well FIP and DRA predicted each pitcher’s RA/9 in each of the past four major-league seasons. We looked at their performance with all pitchers, and then two subsets: pitchers who faced at least 170 batters (about the workload of an established major-league reliever, or 40 IP), and pitchers who faced at least 660 batters (about 162 innings, which is a qualified major-league starter).
We then averaged the results over four seasons (2011–2014) to get a consistent (and recent) picture of each metric’s performance. Here is how it ended up:
Metric |
Minimum BF |
RMSE (lower is better) |
Spearman Correlation |
FIP |
|
|
|
0 |
3.59 |
0.64 |
|
170 |
1.14 |
0.70 |
|
660 |
0.67 |
0.72 |
|
DRA |
|
|
|
0 |
2.65 |
0.76 |
|
170 |
0.96 |
0.76 |
|
660 |
0.54 |
0.78 |
DRA is consistently superior to FIP at all sample sizes. By accounting for the context in which the pitcher is throwing, DRA allows us to determine which runs are most fairly blamed on the pitcher. DRA is particularly effective with smaller samples. Even for pitchers with only a few batters faced, DRA is already separating the good pitchers from the bad with superior accuracy.
In the end, of course, we are not satisfied simply to have brought you DRA. In addition to being useful in and of itself, DRA has become the new foundation of Pitcher Wins Above Replacement Player (PWARP) here at Baseball Prospectus. By integrating DRA into WARP, we can do a better job than ever of evaluating how much value individual pitchers delivered to their teams, both during the current season and as compared to past pitchers in other seasons and eras. The new PWARP figures featuring DRA are also available on the leaderboards, under the column “DRA_PWARP.”
Just for fun, here are the 25 best qualified starters by DRA over the past 25 years. You’ll note that in some cases their DRA basically matches their RA/9; in others, it does not. Our position, of course, would be that when DRA and RA/9 disagree, you should go with DRA, as it tells you how well the pitcher really pitched. Without further ado:
Rank | Season | Name | DRA | RA/9 |
1 | 2000 | Pedro Martinez | 1.03 | 1.87 |
2 | 2004 | Jason Schmidt | 1.23 | 3.30 |
3 | 1997 | Pedro Martinez | 1.49 | 2.46 |
4 | 1995 | Greg Maddux | 1.55 | 1.62 |
5 | 2004 | Randy Johnson | 1.64 | 3.29 |
6 | 2009 | Zack Greinke | 1.80 | 2.54 |
7 | 2009 | Tim Lincecum | 1.87 | 2.79 |
8 | 2013 | Jose Fernandez | 1.89 | 2.48 |
9 | 2013 | Max Scherzer | 1.90 | 3.09 |
10 | 2013 | Matt Harvey | 1.93 | 2.33 |
11 | 2013 | Clayton Kershaw | 1.93 | 2.14 |
12 | 2007 | Erik Bedard | 1.98 | 3.29 |
13 | 2011 | Justin Verlander | 2.01 | 2.66 |
14 | 1997 | Roger Clemens | 2.04 | 2.26 |
15 | 2004 | Johan Santana | 2.05 | 2.79 |
16 | 1992 | Curt Schilling | 2.11 | 2.63 |
17 | 1995 | Randy Johnson | 2.14 | 2.79 |
18 | 2011 | Josh Beckett | 2.15 | 3.07 |
19 | 2009 | Chris Carpenter | 2.17 | 2.32 |
20 | 2003 | Pedro Martinez | 2.17 | 2.52 |
21 | 2014 | Clayton Kershaw | 2.18 | 1.94 |
22 | 2009 | Josh Johnson | 2.20 | 3.38 |
23 | 1997 | Greg Maddux | 2.23 | 2.29 |
24 | 1992 | Juan Guzman | 2.27 | 2.84 |
25 | 2002 | Curt Schilling | 2.27 | 3.24 |
One caution: DRA is not (presently) adjusted for run-scoring across different eras. Rather, it is adjusted to the average runs-allowed by the league for that season. So, please don’t directly compare Pedro’s DRA of 1.03 in 2000 to somebody else’s DRA in 1985 or some other season.[4] A DRA metric that compares players across eras will be coming soon.
A second caution: DRA corrects for what is known as survival bias: the tendency of better pitchers to pitch more innings in a season. Applying the full DRA model early on can result in some extreme values. To avoid that, we will keep the model simple at first during the season, and model only value/pa to RE24. As we get further along, we’ll allow the full model to operate and achieve the best explanation of each pitcher’s performance.
Conclusion
We are excited about DRA, as well the other statistics we have introduced: Swipe Rate Above Average (to measure base-stealing success), Takeoff Rate Above Average (to measure base-stealing attempts), and Errant Pitches Above Average (to measure passed balls and wild pitches).
Three final things to remember.
First, while DRA accounts for a great many things, DRA doesn’t need to be complicated for fans. DRA is on our leaderboards. Just look up the pitcher(s) that interest you, and you’ll have the best estimate of how good they’ve been in a particular season. If you want to leave the details to us, feel free.
Second, remember that DRA was created to evaluate past performance. If you want to project future performance of a pitcher, use PECOTA. And if you want to evaluate how talented the pitcher is regardless of his performance to date, use cFIP, which is also on our leaderboards. In fact, cFIP is in the same table to DRA so you can compare recent results with the likelihood of future improvement (or decline).
Finally, DRA is now the foundation for Pitcher Wins Above Replacement (PWARP) here at Baseball Prospectus. For the time being, if you want to see how many wins a pitcher has been worth in a particular season, check — you got it — our leaderboards and the column DRA_PWARP. (WARPs that appear on pitchers’ player pages remain, for now, the old, FRA-based WARPs.) We’ll change the description it to plain old WARP once people have gotten used to the new idea.
We welcome your comments, and hope you find DRA as useful as we do.
Special thanks to Rob McQuown for research assistance; to Rob Arthur, Rob McQuown, and Greg Matthews for their collaboration; to Stephen Milborrow for modeling advice; and to Tom Tango and Brian Mills for their review and insights.
[1] Please note that FRAA (Fielding Runs Above Average) is different than FRA (Fair Run Average), which we are replacing for everyday purposes with DRA.
[2] We use RE24/PA: the average effect of the pitcher on run expectancy per batter faced over the course of a season.
[3] We have populated DRA back to 1953.
[4] In fact, don’t try to suggest Pedro’s 2000 season is comparable to what anyone else has ever done at any time. It’s probably very unfair to the other player.
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.
Subscribe now
You say that you take gametime temperature into account (measured at start of the game). Does the average time per pitch then play a role? If we take temperature to be an important factor, than it stands to reason that pitchers that take longer to throw the ball (and extend the game) will allow for greater changes in gametime temperature (which will have some effect on conditions, which will have some effect on DRA). This could have an effect in the ten thousandths (maybe. probably not).
Another comically trivial question is with regards to takeoff rate. You use cFIP as a general measure of pitcher skill, but another component that might not be measured (perhaps it is and I don't know) is speed of arsenal. Is that wrapped up in "the pitcher involved?" Seems like it's easier to steal on someone who throws 40% offspeed stuff than someone who throws 10% offspeed stuff, and it seems like someone who throws 98 mph is easier to steal off of (generally) than someone who throws 89 mph.
Thanks for this awesome work.
Certainly the temperature changes throughout a game and with more granular data we would have more information. But, simply accounting for the opening temperature makes a difference and there probably is some uniformity in how the temperature tends to decline over the course of a game, which might be captured in our by-inning controls.
As for velocity, that is certainly part of it. But that is part of who the pitcher is, and if he has a good or bad rating in the takeoff rate, his velocity is by nature accounted for, just like his pickoff move or other aspects of how he pitches.
Also, wind?
I would agree if you separate day/night games.
Unless bloodface's point is that just because more runs were scored in a season you're not sure to blame bad pitching, good hitting or a more offensive environment for the additional runs scored?
One thing I caught is that it's Root Mean Square Error. There is no such thing as Real Mean Square Error.
1) The DRA model doesn't quite match the RA/9 distribution, and so the fitting gets a little weird on the tails.
2) Starters get a little bit of a bonus relative to relievers (I might have read that in the details article?), and so in general starters DRA will skew lower than their RA/9.
You will definitely see a skew like that at first because DRA presumes everyone is average (~4.1 RA/9) until they prove they are not. The more innings you pitch, the more you prove you are something different from average. Relievers haven't pitched many innings yet so DRA is conservative with them.
Plus then you get to make "forgot about DRA" puns when pitchers regress.
What data (years) were used for the regression and, if cross validation was done, how was the data divided into subsets for cross validation?
A best fitting regression model is notorious for having limited predictive ability for new data sets outside those used to develop the model.
Am I wrong?
I thought the whole point of WAR/WARP was to be predictive?
So I guess the question is... What is the point of WAR/WARP?
One (I think) significant point and one trivial one.
First, I don't agree with "RMSE (lower is better)" in your table given the context. If my pitching statistic eliminates fewer of the problems with RA/9 and yours eliminates more, mine might well be more similar to RA/9 as a result, and therefore have a lower RMSE. So, I think instead of "RMSE (lower is better)" this column should be called "Similarity to RA/9 (interpretation is ambiguous)" -- RA is both what you're trying to get away from AND what you're trying to approximate. DRA might well be better than FIP (almost certainly it depends on what exactly you're trying to do) but this table doesn't begin to make that case since by this metric, RA/9 itself would obviously be the best choice.
Second, (and this is the trivial one), it's possible to do a little double counting if you have a park factor and a temperature effect since parks have different average temperatures. Can't imagine this makes any difference though.
So then the question is, what are you trying to measure and how can you tell if you successfully measured it? I think the answer would be pretty hard to come up with. You are trying to measure how many fewer runs your performance should have resulted in, compared to an average performance in the same context. If we took the "should have resulted" runs from every pitcher in the league and added them up, should that equal the "actually resulted" runs? No, because we are laying some blame on the fielders and the catcher, and the hitters, and so on. So I guess that means that it should really be compared to RA/9 - (sum of all non-pitching sources of runs)/9, assuming that you trust those other metrics (like FRAA, BRR, and BRAA).
The whole idea is that some of the probability of a run scoring is under the pitcher's control, and some is not. The system uses every relevant variable to predict the number of runs, then subtracts out all the parts the pitcher has no control of and calls what's left the DRA. But I think when they compare "DRA" to RA/9, they really mean the runs predicted using all the variables, not just the pitcher-controlled variables.
If I'm right, lower RMSE is indeed better.
I think you make arguments in favor of focusing on error or the correlation. We included both since DRA seems to be a better fit regardless how you slice it.
Thanks for reading and for the thoughts.
On the parks, the model explains more variance when both plain old park and raw temperature are included than when only park is included. An interaction between raw temperature and stadium was not adding much value. We feel fine with just letting temperature and park work off each other in the way that they do, for the time being.
http://www.baseballprospectus.com/player_search.php?search_name=Pedro+Martinez
Let's assume that's the case, and the DRA beats FIP when using only last year to predict this year. I would claim that one could quite easily use a modified xERA (adjusted for park and some pitcher qualities) that crushes FIP in accuracy for 1-year predictions. [Source: 10 years as a pro gambler]. I.e., you chose an incredibly easy target in FIP, since it relies on the absurdly noisy HR allowed stat. Don't get me wrong: I'm all in favor of this research. But your validity test should never be something as trivial as beating raw single-season FIP. Beat PECOTA. Beat Steamer. Beat something that a few people actually use for prediction.
But this is just conjecture on my part -- I hope you get an answer, but I think they've already moved on from this article!
The run differential between the two teams at the time of the event;
Is there proof that runs are more or less likely to be scored depending upon the run differential of the two teams? Seems like a weird thing to adjust for, unless I'm missing the point, which I probably am.
It makes sense to me -- the losing team will play small ball late in the game if it's close -- I don't know if the expectation value of runs goes up or down, but it certainly shifts some of the probability from more runs to one run.
I can't get over the thought that it's a bit strange. Maybe just ironic?
It is traditional to score metrics by how well they "predict" or "account" for run-scoring. But the "perfect" metric by that measure will always be run-scoring itself. So, an error rate of basically 0 or correlation of nearly 1 would be meaningless and silly. I think similarity to RA/9 is an important benchmark but I would place an awful lot of weight on whether sound methods appear to be getting followed. We'd like to think we can demonstrate both, which is why we feel good about DRA.
EG. Pitchers who have been good, will mostly continue to be good.
But to judge how well it works, you see how well it fits with RA/9. What if, for example, FIP is farther from RA/9 because the luck/other factors are actually a bigger part of RA/9 than DRA says.
You could test it against future RA/9 but you mention it isn't meant to be predictive.
So, I think it's important to consider both whether sound methods are being followed as well as the statistics ability to account for run expectancy. I feel like we are able to do both.
And it's definitely not a trivial variable when predicting run scoring.
A pitcher who has a higher BABIP is held responsible for that, after we account for defense, park, framing, the umpire strike zone, and all the other factors that we think most likely explain what else besides luck could be the cause of it.
...and...
DRA is an interesting hybrid because we are using data from the past three seasons to "predict" the value of CURRENT events as they happen. Over the course of a season, those events move into the past. We do this in part because otherwise we would not be able to offer in-season predictions: we would have to wait until every season was over until we fit its events.
But if you're confused, that's probably where it arises from. DRA is "fitting" past events; cFIP is "fitting" anticipated future events. Both are predictions, but only one has significant utility for the future. I hope this helps.