Yesterday, Dave Pease went ahead and talked about the process of generating the PECOTA forecasts in the past. Now I’m here to talk about how PECOTA has fared, and where PECOTA is headed.
First, let’s look at the hitter forecasts for this season. We have four test candidates:
- The PECOTA spreadsheets available to subscribers on April 4 (that is, Opening Day),
- The PECOTA forecasts published in the Baseball Prospectus 2010 annual,
- The Marcel forecasts, published by Tom Tango, and
- The CHONE forecasts available on February 28, published by Sean Smith.
There are a lot of other forecasting systems in the wild; I chose to look at Marcel and CHONE in comparison because they’ve fared well historically and they have sound underpinnings.
Let’s look at how well each system forecasted the overall offensive levels of the league as a whole. We’ll use OPS, since it’s a “good enough” offensive estimate for the sort of study we’re doing, and it’s typically calculated by every forecaster, so it’s a very transparent way to compare systems. Looking only at players in common between the four projection sets, the weighted average of OBP, SLG, and OPS for each system:
Obs. |
0.332 |
0.414 |
0.746 |
0.342 |
0.427 |
0.769 |
|
PECOTA (book) |
0.344 |
0.430 |
0.774 |
Marcels |
0.340 |
0.437 |
0.777 |
CHONE |
0.339 |
0.430 |
0.769 |
This was a down year for offense on the whole; the most recent PECOTAs and CHONE forecasts were a shade closer on projecting the league offensive environment, but even then there were .023 points of OPS between them and the observed results.
So I adjusted each set of forecasts to line up with the lower offensive environment, and looked at the root mean square error of each forecast from the observed result, weighted by the number of plate appearances that player had. (Root mean square error represents the standard error—in other words, 68 percent of the time, you should expect outcomes to occur within that distance from the forecast.) I now present the most boring chart I have ever had the honor of showing you:
OBP_RMSE |
SLG_RMSE |
OPS_RMSE |
|
0.032 |
0.061 |
0.069 |
|
PECOTA (book) |
0.032 |
0.062 |
0.069 |
Marcels |
0.032 |
0.062 |
0.070 |
CHONE |
0.032 |
0.061 |
0.069 |
That’s, uh, not a lot of difference. Strictly speaking, it’s no difference at all. Now let’s take a look at your standard rotisserie categories in fantasy baseball:
AVG_RMSE |
R_RMSE |
RBI_RMSE |
HR_RMSE |
SB_RMSE |
|
0.030 |
22.5 |
22.3 |
6.9 |
5.9 |
|
PECOTA (book) |
0.031 |
27.3 |
27.8 |
7.8 |
6.3 |
Marcels |
0.032 |
23.7 |
24.2 |
6.8 |
6.1 |
CHONE |
0.030 |
27.7 |
27.3 |
7.5 |
6.7 |
We see a bit more separation here, but not much. PECOTA and CHONE were tops at predicting batting average, the Marcels were best at predicting home runs, and PECOTA was tops at predicting runs scored, RBI, and stolen bases. (PECOTA looks better at these projections because we use our Depth Charts to model a player’s specific role and playing time—that has little to no impact on his OPS, but has a big effect on his counting stats and thus his fantasy value.)
Now, let’s look at pitchers, using ERA as our measurement. First, the predicted versus observed ERAs, as a group:
Obs. |
4.06 |
4.12 |
|
PECOTA (book) |
4.50 |
Marcels |
4.18 |
CHONE |
4.24 |
Again, the run environment this year was lower than any of these forecasts expected it to be. After adjusting the forecasts to account for the difference between expected and actual run environment, a look at how each forecasting system did:
|
ERA_RMSE |
1.17 |
|
PECOTA (book) |
1.21 |
Marcels |
1.21 |
CHONE |
1.18 |
We see a little more separation in the pitcher projections that we did in the hitter projections, but not a lot. Looking at the other roto categories for pitchers (CHONE doesn’t project saves, so it received no score in that category):
WHIP_RMSE |
SV_RMSE |
W_RMSE |
SO_RMSE |
|
0.20 |
4.7 |
3.2 |
33.6 |
|
PECOTA (book) |
0.22 |
5.4 |
3.6 |
38.1 |
Marcels |
0.36 |
5.1 |
3.6 |
36.5 |
CHONE |
0.20 |
— |
3.7 |
40.2 |
PECOTA runs the table here, tying CHONE in WHIP and leading in every other category. Again, PECOTA gets ahead here largely through the playing time forecasts driven by the Depth Charts.
These are not especially surprising results. This has been the state of forecasting for a while now—you have a pretty tight bunching of the advanced forecasting systems. (That’s what Nate Silver found back in ’07, for instance; even in ’04, hitting forecasts were pretty tightly bunched.) It’s been a while since PECOTA’s main competitors were a handful of fantasy touts and the Favorite Toy, and these results reflect that.
Now, of course, PECOTA has always done more than simply project a player’s basic stat line—we have a lot of other things going on, like the 10-year forecasts, the percentiles and the upside/downside ratings. It’s one of PECOTA’s main attractions, but it shouldn’t be its major downside as well.
One of the drawbacks of PECOTA’s additional complexity is simply how long it takes to produce forecasts. But that’s also a consequence of using Excel to generate them. We’ve cut that dependency a while ago, and are continuing to work to integrate PECOTA more with our other statistical offerings. That’s important to you because it means you get your forecasts sooner—because the word “fore” is of course a major component in forecasting.
But it’s also important for the accuracy of any individual forecast. I can take one hitter’s forecast and substitute any number of outlandish findings for him, and that on its own won’t move the needle on those RMSE figures I showed you—it takes a systemic problem affecting a lot of forecasts to show up in that sort of test.
And PECOTA is a computer program—essentially, a list of instructions. It will follow those instructions unerringly, regardless of whether those instructions are correct. It takes a human to write instructions for the computer to follow, and as we all know, humans make mistakes now and then.
Some of you may remember the PECOTA forecast for one Matt Wieters’ debut season. It struck a lot of people as being outlandish—I was certainly one of them. In this case, PECOTA was a victim of its own complexity—by taking so long to produce forecasts, there wasn’t enough time to properly proof the PECOTAs.
Minor league players are heavily dependent on methods of putting their stats in terms of expected major league performance; in this case, it was the Davenport Translations. The foundation of this process is a set of league difficulty ratings that establish how a league compares to the majors.
What seems to have happened is that, when spitting out translations for the two leagues that Wieters played in (and only those two leagues, mind you), those league difficulty factors were significantly inflated from what they should have been—the Eastern League was not only rated higher than the other two Double-A leagues, but above both Triple-A leagues as well. And the High-A Carolina League placed above both of the other Double-A leagues as well.
For most players, that wasn’t going to make a noticeable impact—very few players who are expected to be anywhere close to the majors have only one year of stats split between the Eastern and Carolina Leagues. But one is enough to produce the Wieters forecast.
For last year’s book, we had a fairly involved proofing process of the PECOTAs. (Notably we neglected to do that for the first run of the Depth Charts. There’s a lesson to be learned there, and we’ve learned it—proof everything before publishing.) That’s good, but we want to do better than that. So in addition to having humans proof the PECOTAs, we’re building a set of unit tests to run alongside the PECOTAs, testing each element to make sure it’s functioning properly. What this means is that the output of the PECOTAs are going to be tested at several steps along the process, to ensure that everything is functioning correctly.
We’re also utilizing these tests to make sure that when changes are made to the PECOTAs, that they’re actually improving the underlying accuracy of the product. And we will be notifying subscribers when changes are made to the methods between PECOTA updates.
Of course, PECOTA has had some infamously mistaken forecasts that wouldn’t have been caught regardless of the amount of proofing. Most of them have been of Ichiro Suzuki. Tomorrow, we’ll go ahead and address how PECOTA missed the boat on Ichiro, and what we’ve learned from those mistakes.
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.
Subscribe now
I'd rather put in my leagues parameters, and have you tell me how valuable your projections think that player will be in that league (perhaps VORP?). Not something based on an (I assume) arbitrary value where the total allocation defaults to 260, 180 of which is assigned to hitting.
((-9.8*LN(PICK))+57.8)
Kinda the opposite of what you're looking for, but maybe the math whizzes can take a shot at it with that knowledge.
Is it fair to normalize every projection to the league offensive environment? My instinct says no; the projections are the projections. However, I'd like to hear some arguments from both sides on this one as I'm not too sure what is most fair.
That said - it's not like doing it that way helped PECOTA in those tests. It was the leader in identifying the average OPS (tied with CHONE) and ERA.
I think it's understandable that it tweaked a few people and good to know that it wasn't meant that way.
If the intent was limited to a quasi joke, why do you have it here:
http://www.baseballprospectus.com/subscriptions/
"Complete depth charts and forecasts for AL and NL pitchers and hitters using Baseball Prospectus' deadly-accurate PECOTA projection system--the same one used in MLB front offices."
Publisher: "How should we describe this PECOTA thingamajig of yours?"
Goldman: "Possibly Above-Average, and Possibly Below Marcel"
Publisher: "Nah, too long, it won't fit on the cover of the book. Anything else?"
Goldman: "Uh, how about deadly accurate?"
Publisher: "I like it! That's gold!"
And since it's a big joke, there's no need to support the claim, nor defend it from others as they object to it.
Therefore, kudos to Steven for the genius of it all. You got me.
I guess that I am just bemused by the amount of invective that the "deadly accurate" marketing slogan has generated. The purpose of a marketing slogan is to generate interest in your product. The cover of the annual isn't designed to attract buyers from the sabermetric community, it's designed to generate interest from the public at large.
My sense is that others doing great analysis have landed on the "deadly accurate" thing as a slight on their work. If that is the case, I have to disagree. The slogan exists to promote BP's product (and it isn't like they were selling snake oil). I don't see any obligation to compare their product to completing products, or to even acknowledge that there are competing products.
I'm all for accountability, and I salute Colin's efforts in this regard. But .
The point I was trying to make is that I don't want to discourage marketing of advanced analysis. I get tired of explaining to every Tiger fan I know that Austin Jackson is more likely to hit .260 next year than hit .300 again. And have them look at me like I'm nuts.
So I get frustrated when I see any discord among the sabermetric community (real or imagined by me). Because I'd like to see everyone tugging on the same rope and getting the word out. My introduction to sabermetrics went a little something like Rob Neyer-->Baseball Prospectus-->Hardball Times-->Bill James (yes, he came 4th to me)-->Tom Tango-->Fangraphs. I had to get started somewhere, and for me it was ESPN.com in 1998.
If it takes a person reading the BP2010 annual for the first time another three years to realize that Chone, Pecota, and Marcel all tell a pretty similar story, that's OK with me - at least they are reading a part of the story and hopefully their curiosity is peaked to read further.
I guess I'm saying that even if the marketing is imperfect, at least the marketing is happening. I think in the long run it benefits the entire sabermetric community, not just BP. I view this as a good thing.
(And now I'll be quiet.)
As long as we can get away from "mine is better than yours", and into "mine works best here, and yours works best there", that would go a long way. Unless of course in those situations where something is deficient and should be supplanted.
So, Chone, Marcel, ZiPS, PECOTA can all live happily together, each having its own strengths, with none deserving to be discarded.
Great job Randy, great job.
Back in the day when I played competitive sports, we used to give ourselves nicknames for all the same reasons. Well, except for the marketing angle.
My major concerns are in the use of standard error comparison to demonstrate anything useful or important! It says to me that other major projections are about the same over the set of all players. I am more interested in the projections on some groupings e.g.: "NL", "AL", "Stars", "Bums", "Everyday", ....
My reason for using Prospectus is to help brighten my teams future performance by improving my drafting. It is the distribution of projection error within groups that is most important, I think.
Obviously, this would be a bit more labor intensive, and you couldn't cross-compare to other systems (which don't provide percentile bands), but it would be VERY interesting to know if, for example, do approximately 10% of players actually meet their 90th percentile forecast? Does PECOTA accurately assess the overall number of "breakouts" and "collapses"?
From there, we could sort by stat, position, league, etc..to determine if there are specific trends or categories that we might want to pay attention to with the next iteration.
Appreciate the honest and transparent analysis, but boy this one hurts.
I recognize that this would be a huge undertaking, you might not have access to all the right data required, and you are surely busy with other stuff, but it would be cool to have.
I have a database table for every player who played in the majors from '50 to '09, with what their PECOTA baseline projection (minus aging) would have been for the next season. I have another database table for that player's Marcel forecast. Then a similar forecast comparison to the one presented here is used. Once we finish work on the revised age adjustments, those will be going into the test suite as well.
(Obviously with some things, like minor league stats, we're not going to be able to do 60 years worth of tests, but the concept is similar.)
For those interested, tomorrow we're going to be presenting the baseline forecasts (again, without age adjustments) for Ichiro using the current PECOTA methodology.
A) Will be 25 years old in 2011.
B) Was an uber prospect based largely on scouting and college performance - not his 2009 PECOTAs.
Still a pretty good chance he emerges as one of the best catchers in baseball. Let's not write his obit. just yet. Though, obviously if you watch him recently he doesn't look like an impact bat... the tools are still there and no one should be shocked in the slightest if he turns it around.
most places would gloss over issues like you had to deal with at the beginning of the year. you guys have overcome it. I'm still a bit burned over the extremely low BABIP listed for Oakland pitchers in their Depth charts.
I never could get a straight answer as to what ended up causing it....