From time to time—if not at all times—organizations must examine their own operations and ask some difficult questions.
The answers often reveal a range of things done right and things done wrong. Healthy organizations can handle those answers in more than one way—there are many routes to success, but even more to failure—but one hallmark of organizational integrity, to borrow from James Collins, is looking in the mirror when assigning blame and out the window when giving praise.
Here at BP we’ve been faced with an opportunity to ask ourselves some questions, and we’ve decided to grapple with the answers, even though in some cases we don't like them. In short, we have work to do in order to live up to our own high expectations. Despite our pride in much of the progress Baseball Prospectus has made, now is not the time to rest on our laurels. And some recent events make that abundantly clear.
After the 2010 season, Colin Wyers wrote about replacement level and how he was improving its integration with the rest of the component stats at Baseball Prospectus.
This is something of a culmination of work I’ve been doing over the past few months—taking a menagerie of stats available here at Baseball Prospectus and merging them together under the heading of “Wins Above Replacement Level.” We’ve had WARP for quite a while—and its close sibling, VORP, as well—but it has been rather distinct from the rest of our offerings. That’s coming to an end.
The goal of making WARP play well with the component statistics left behind at BP by previous staffers was worthwhile, but the implementation caused problems: We inadvertently raised replacement level for 2011 and 2012. Taking a summation of the WARP or VORP values for those two seasons resulted in league totals which weren't in line with pre-2011 data. They were much lower. By implication, this meant that replacement level was much higher, or that a “replacement level team” would win more games than the data had indicated for previous seasons.
At any point starting in about May of 2011, it should have been clear to anyone looking closely at the stats that something was different, and not just because Colin had re-engineered (read: greatly improved) some of the WARP formulae or because offense was down in 2011.
For the record, we know that these re-engineered formulae work. The chart below shows league-wide WARP totals by year since 2000, along with the winning percentage of a notional “replacement level team” (really, it's just a subtraction of WARP from wins, so there's some noise there for a variety of good reasons, but it's close enough to give a good idea).
Year | WARP | BWARP | PWARP | WARP Per Tm | Rep. Wins Per Tm | Rep. Win% |
2000 | 884 | 598 | 286 | 29.5 | 51.5 | 0.318 |
2001 | 913 | 588 | 326 | 30.4 | 50.5 | 0.312 |
2002 | 921 | 576 | 345 | 30.7 | 50.1 | 0.31 |
2003 | 939 | 585 | 354 | 31.3 | 49.7 | 0.307 |
2004 | 927 | 578 | 349 | 30.9 | 50 | 0.309 |
2005 | 941 | 577 | 364 | 31.4 | 49.6 | 0.306 |
2006 | 940 | 592 | 348 | 31.3 | 49.6 | 0.306 |
2007 | 907 | 588 | 319 | 30.2 | 50.8 | 0.314 |
2008 | 906 | 596 | 310 | 30.2 | 50.7 | 0.314 |
2009 | 887 | 596 | 291 | 29.6 | 51.4 | 0.317 |
2010 | 912 | 602 | 309 | 30.4 | 50.6 | 0.312 |
2011 | 838 | 563 | 275 | 27.9 | 53 | 0.328 |
2012 | 891 | 573 | 318 | 29.7 | 51.3 | 0.317 |
Voila! Exactly the results we'd hoped to get.
Except…
One of the steps we take to improve the speed of queries—and thus to expand the scope of subjects we are able to research—is to put the seasonal replacement level for each event into our events database. In that process, we allowed some bad data to be introduced in 2011. We didn't catch it. It really was that simple, the data equivalent of a typo. We’ve corrected the data, and Baseball Prospectus WARP values for 2011 and 2012 are now representative of the theory we meant for them to represent.
Two additional things need to be pointed out about the scope of this problem: first, VORP was also affected, though FRAA and BRR were not—this was entirely an “at the plate” and “on the mound” problem. Also, slight adjustments to some previous-season WARP values were made, as some of our calculations rely on a multi-year smoothing of baseline data, even including forward-looking data when available.
While we're on the subject of evaluating our data, we've decided, after extensive testing, that the 10-year projections just weren't producing the results we desired. It’s difficult to evaluate long-term projections, and we intend to make that a more standardized, easily repeatable process in the future, but we hold our work up to a certain standard, and in this case, we didn't feel that that standard was being met. Instead of putting out an inferior product, we’ve essentially ordered the design team back to the drawing board to get 10-year projections and UPSIDE correctly formulated and out to the public in a timely manner going forward. We will be releasing the PECOTA percentiles soon, and that will conclude our pre-season projections releases.
It's not enough to fix these problems. We will be addressing these issues at their root—with a hard look at and overhaul of our internal processes and quality control.
But we also want to regain your trust. So we’re going to open the kimono and make our work transparent. Not only will this create a wealth of knowledge for everyone involved—readers and writers alike—but it will give BP the opportunity to leverage the wisdom of crowds.
We've recently named Harry Pavlidis our Director of Data Analysis. His first responsibility is to lead this effort. It will be a team undertaking, with all hands on deck. We will be sharing our progress and plans as they develop. But right now we're looking in the mirror. Looking hard.
Harry's first task is to conduct a full audit of our systems and stats. In essence, we're making him do his "day job"—assessing our systems and developing a plan to move forward. Harry will be bringing a process-driven approach to the effort, with the ultimate goal of improving our stat offerings. The experience he has in this area ranges from tiny start-ups to large, publicly traded companies. We'll all be working together to find the best-fitting tools and processes to bring BP up to the level of operational excellence we all expect.
Finally, I want to personally apologize for any inconvenience we may have caused our readers. The people we employ at BP are perfectionists. They spend more hours than anyone knows to get things done right and in a timely manner. They love this game and this company with a passion and will gladly fall on their sword if it means building a bigger and better Baseball Prospectus in the future. But if something goes awry at BP, it’s my fault and mine alone. I’m ultimately in charge, and I take full responsibility for any and all of our shortcomings. I’ve made mistakes and deserve any criticisms I receive. I may hold Baseball Prospectus to a high standard, but I hold myself to an even higher one. I’m sincerely sorry, and I promise you that I will continue to devote my blood, sweat, and tears to make BP the best it can possibly be.
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.
Subscribe now
As a pitchfork-and-torch salesman to the PECOTA mob the past few years, I've been very disappointed in the contrast between confidence level expressed and actual performance. (When Wayne Causey is your best comp for Bryce Harper, you're doing it wrong; predicted results for teams should average fewer than 86 wins.)
I hope that things go well on this front, and I wish y'all the best. I agree that there's a significant process problem. There are some good things BP is doing (Scoresheet Draft Aid!), and there are good articles. I'm glad to see that efforts are being made to fix some product rather than just have BP go to pure cash-cow mode.
Shorter ramble: Humility - while often a vice - is good if it leads to fixing.
--JRM
One question - Dave Cameron and Sean Forman are rumored to be trying to come together on an agreed upon replacement level. Even though their WAR stats are computed differently, the idea is that people will have more faith in them if they start from the same place. Has BP given any thought to joining the discussion? The idea of an industry consensus on replacement level is highly intriguing.
ps I do play in fantasy keeper leagues, and a useful UPSIDE number seems like a great way to evaluate players for the (very) long term. Thanks for taking a new look at that and trying to come up with a more meaningful number. I hope that forward-looking GMs may look at it too (are you listening, Ruben Amsaro Jr.?)
Lloyd Cole
*PHILADELPHIA* (land of short-term baseball thinking)
I'm sure Harry will do a great job with that task. But I have seen efforts to improve development process lead to one or more of the following outcomes:
1. Being overly ambitious and not actually implemented
2. Having too great an impact on the delivery of the product
3. Too general a solution and not solving the original problem
I am less concerned that a mistake was made and more concerned that lessons are learned. Whatever you can share would be appreciated, especially if you have some confidence you can actually improve the process.
You said something like this is ultimately your fault Joe... Did Colin and team tell you the product was susceptible to errors if they didn't improve the development process? And you ignored said advice? What mistakes did you make? What would you have done differently? What are you going to do differently? Are you proposing changes that will affect the product delivery timeline? Or affecting subscriber fees?
As was mentioned, we all appreciate your hard work and effort. Most of us understand mistakes can be made. The apology in the last paragraph makes me uncomfortable though. I'm not sure the right way to put this, other than that if you are apologizing emphatically for a problem which may be a natural consequence of developing a complex system, why would I think even more of your blood, sweat and tears is going to help?
Thanks.
Logic tells me that since offense and defense play an equal part in scoring, the fielding component should be half the difference between BWARP and PWARP, but there may be reasons why that isn't true, so I am curious as to the numbers here.
Thanks for writing. It's not a constant. It changes over time as a function of BIP rates, mostly (the more TTO, the fewer BIP and so the less fielding matters and so the pitching WARP as a percentage of total WARP grows).
Okay, so fielding WARP is relative to pitching WARP. (As an aside, note that if you mouse over PWARP you get "Wins above replacement level as a BATTER" - really)
The question then is,if we break BWARP into its components, does batting WARP = pitching WARP + fielding WARP?
If it doesn't, what causes the variance between offense and defense? Is there more or less BWARP based on increased (or decreased) offense, or vice versa? Just trying to pin things down here.
I'm just trying to figure out what the chart is. I see a drop in WARP between 2010 and 2011 of about 8%, with a little over 6% being recovered in 2012. Is this significant drop due to the "typo", as you put it? Was the 2012 "correction" due to your fixing the "mistake"? Or are these the numbers that are the end result of all the efforts you've made to get things where they should be? If these are the "correct" numbers, then what causes that kind of severe drop in a relative statistic, followed by a significant return in the other direction? Even if there is a drop in offense, the relative WARP should be fairly stable, unless the numbers inherently differ between higher and lower scoring eras. A year-to-year variance like that in the quality of replacement player seems quite odd, if that was the case.
You are correct that the numbers are inherently different based on the league offensive rates (among other things). While this chart seems to indicate that 2011 had a high level of wins for a 'replacement team' (a nebulous concept at best), that's a quirk of starting it at the year 2000 - going back further, 2011 is well within the range that was evidenced, not an outlier as it seems to be by choosing 2000 as the starting year.
So let's take a look at 2002 and 2011. Between 2002 and 2011, run scoring declined approximately 7%. At the same time, strikeouts increased 10% and HR declined 10% - both increasing the TTO positives for the pitchers, as did the decline in walks, also 7%. You said in your previous post that the more TTO, the more PWARP increases as a percentage of total WARP. Yet despite these changes in TTO, all in the pitchers favor, the percentage of PWARP dropped from 37.4% of total WARP to 32.8%. How did this happen?
Also what caused the 8% drop in total WARP in 2011 and the subsequent 6% increase in 2012? Was it a great year in the Pacific Coast League?
It's been over 40 years since I took a college math course and I freely admit that when confronted with a series of formulas and equations, my eyes glaze over. But before I dive in to any series laying bare the entrails of WARP, I still want answers to my questions.
You see, Rob answered my first question with a very logical construct, TTO goes up, pitcher impact, as stated on PWARP as a percentage of total WARP, increases. Makes absolute sense. Unfortunately, the numbers don't support it, they are the equivalent of dropping a rock and having it float skyward. Now I would wonder why something like that happened, especially if I was in the rock-dropping business. I expect your numbers people asked and answered that question and I would like to hear it. If they don't have an explanation, then I have little interest in seeing the layers peeled away, I don't want to cook with that onion.
Similarly, the second question seems like an obvious one. An 8% drop in WARP one year followed by a 6% rise the next - if I was involved in the analysis of that metric, my first question would be why did this happen. There may be a simple answer, one for each year. I just want to hear it, because there is another question which should follow that one - what is the effect of this drop/rebound? Was there an 8% drop in WARP across the board? Or was some class or group of WARP scores changed more than others? This goes to the heart of the reliability of the metric itself.
I need answers.
Out of curiosity, full disclosure-wise, but when did BP become aware that bad data had entered the system?