Let’s talk percentiles.
It’s probably the most famous thing about PECOTA-the fact that we provide a range of forecasts instead of just a single point estimate. Earlier this week, I talked about the accuracy of the weighted mean forecasts. But what about the percentiles?
First, some notes about the percentiles. They are derived based upon the overall unit of production (TAv for hitters, ERA for pitchers), not the underlying components. This is important, because a hitter who hits more home runs than we expect (I hesitate to call it luck-he may have been underestimated, or he may have found a way to improve his talent) isn’t necessarily going to improve his rate of hitting singles by the same amount, or at all.
What this means is that you can’t look at a single stat (say, hits or strikeouts) and think that’s the range of expectations PECOTA has for that skill. The percentiles are supposed to reflect what we know about the distribution of a player’s skill, but they are in essence the average batting line we should expect from that player if he puts up that level of performance in that season. There are a lot of different shapes that performance could take, however, and that means there’s more variance in any single component than is reflected in the percentiles. So the correct test of the percentiles is the overall level of performance, not the underlying components.
The other thing to note is that the observed performance of any individual player is a function of his playing time-the less playing time a player has, the more variance we expect in his overall performance. Things have a tendency to even out over time (although a tendency is not the same thing as a guarantee), and so the spread of observed performance goes down as playing time goes up. If a player is projected for a full season’s worth of playing time, and only ends up playing 50 games or so, the percentiles are going to be too tight. That’s not a bug-it’s impossible to make one set of percentiles that functions across any amount of playing time.
Let’s start off with the hitters. Looking at only players with at least 300 PA, here’s how the distribution of players looks:
|
DIFF20 |
DIFF40 |
DIFF60 |
DIFF80 |
Overall |
23.9% |
34.9% |
49.2% |
63.5% |
Up |
17.6% |
24.7% |
30.8% |
36.7% |
Down |
6.4% |
10.3% |
18.4% |
26.8% |
Going from left to right-DIFF20 refers to the percentage of players between their 40th and 60th percentiles, through to DIFF80, which represents the percentage of players between their 10th and 90th percentiles. The second row represents those players above the 50th percentile; the third row represents players below the 50th percentile. Adding up plus down gives you the overall percentage.
What we should want to see is DIFF20 equal to 20 percent, etc. We don’t quite see it, though. It may be a bit more helpful to look at a histogram:

The first thing that sticks out should be the fact that most players are in the 50th to 60th percentiles, by a large margin. Why? Fundamentally, players who perform above their expectations are more likely to get playing time than players who perform below their expectations. This isn’t something that should surprise us-this is why we have the weighted means forecasts for PECOTA, which explicitly takes this fact into account. (This is also probably the explanation for why DIFF20 exceeds 20 percent.)
But there’s also more variation in observed performance than what the percentiles expect. Let’s consider the reasons we see variation from what our projections expect. The first point I want to make is that forecasting is not mathamancy; there’s no such thing as a perfect forecast, except in hindsight. PECOTA utilizes a two-stage process:
- As described earlier this week, we generate a baseline forecast based on a player’s past performance, and
- We adjust for our expectation of how a player will age, using baseline “forecasts” for comparable players to create a custom aging curve-what Nate Silver would refer to as the “career path adjustment.”
Both of those estimates are subject to a measure of uncertainty. The third source of variation is simply randomness. We use the observed variation of the performance of the comps to model this variance.
Not all forecasts have the same expected variance, though-it seems as though some players have more variance in their baseline forecasts than their comparables do. This is a relatively simple fix-the uncertainty in a forecast is largely a function of the amount of data you have on a player. (It’s also something of a function of a player’s skill set, among other things.) When we build a player’s baseline forecast, we can compare the uncertainty in the forecast to the uncertainty of the comps’ forecasts and figure out how much additional variance we need to add to the percentiles.
We’ve also been treating the uncertainty of a forecast as symmetrical-apparently there’s more uncertainty on the downside than the upside. This is something we can build into our model as well.
Now let’s take a look at our pitchers, minimum of 70 IP:
DIFF20 |
DIFF40 |
DIFF60 |
DIFF80 |
|
Overall |
18.0% |
29.0% |
37.3% |
50.5% |
Up |
13.6% |
19.4% |
22.4% |
29.4% |
Down |
4.4% |
9.6% |
15.0% |
21.1% |
I should clarify “down” and “up” in this context-up is an ERA below the forecast, down is an ERA above the forecast.
What we see is something similar to the hitters, but much more pronounced. Let’s examine it from a slightly different angle, and look at FIP as a stand-in for ERA:
|
DIFF20 |
DIFF40 |
DIFF60 |
DIFF80 |
Overall |
27.9% |
42.7% |
53.4% |
65.4% |
Up |
23.3% |
29.9% |
35.5% |
38.8% |
Down |
4.6% |
12.7% |
18.0% |
26.6% |
That’s a lot closer to what we saw with the hitters (and of course, everything I said about those applies equally here).
What it comes down to, I suppose, is how you define performance for a pitcher. There are three elements to preventing or allowing runs:
- The pitcher’s ability to affect the batter-pitcher matchup directly (walks, strikeouts, home runs),
- The ability of a pitcher and his defense to prevent hits on balls in play, and
- The sequence these events occur in
I’ve talked in the past about how those figure into a player’s value. Suffice it to say that the range on the PECOTA percentiles are largely focused on the first element (the one which is where most of the variation in pitcher skill occurs and thus the area most relevant to forecasting).
So, lemme ask-what do you find the most useful to you in using the percentiles? Would you rather they reflect the extent to which we know pitchers have skill in preventing runs? Or would you rather the percentiles reflect the rather considerable noise in measuring a pitcher’s performance (really, the performance of a pitcher and his teammates at preventing runs)? Drop me a line in the comments and let me know.
Or you could talk to me about that-or anything else related to PECOTA, or baseball stats in general-in a few hours, when I chat live starting at 1 ET, as the finale of PECOTA week. And again-this is the beginning, not the end, of a long conversation about PECOTA. Thanks for being a part of it.
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.
Subscribe now
Additionally, as this exchange with us continues and develops could you keep us aware of the schedule you are working with as I hope/anticipate your projections (and all Pecota related data) will be available much earlier next year.
Thanks again.
You might be thinking of say between the 1.0 and 1.5 standard deviations or something (scale of SD not percentile), and in that case, you would be correct.
It almost looks like PECOTA might be picking up process variance ok (that is, the variability caused by the fact that not every player will hit their true expectation every year) but may not be adequately accounting for parameter variance (that is, the variance caused by the fact that your estimates of each players expectation is likely wrong as well.) Obviously, that's not anything I can say for sure, but I've seen percentile graphs like that before, and many times it was from not capturing parameter variance enough.
So, it's two failures: a failure on the true estimate, and a failure on the performance.
I mean, I can see a rookie starter maybe having a higher variance than Mariano Rivera, but in most cases, the relievers should be much more variable around the mean.
If the projection system was perfect, i.e., had a connection to a wormhole in space that allowed perfect knowledge of what the player was going to do in the following year, every player would be at exactly his 50th percentile and the distribution would look like a delta function. If it overestimated EVERY hitter's performance such that they all performed at their 25th percentile, and underestimated pitcher's performances by the corresponding amount, you'd have two delta functions, one at 25 and the other wherever the pitchers would fall. (Incidentally, it is NOT required that that delta function would be at 75%. Things are more complicated than that.)
Put differently, in the actual distribution, most of the guys whose projections match their performance at roughly the 50% level would be in the top quintile (or quartile or whatever) of prediction accuracy, and by definition, exactly one fifth (for quintile; one fourth for quartile, and so on) of the projections would fall into that bin. The ones whose projections were way, way off -- in EITHER direction -- would be in the bottom quintile. But that isn't the binning that this histogram is showing. This one is only about *player* performance relative to prediction, not *prediction* performance. Clear?
mean forecast: .330
90th percentile: .370
10th percentile: .290
Actual performance: .375
This player would count in the 90-100 bin.
Are we agreed so far?
Player 2: mean forecast=.330, 90th percentile=.370, 10th percentile=.200, true performance=.260
Player 3: mean forecast=.330, 90th percentile=.400 (note that there is no requirement for PECOTA's 90th-percentile projections all to differ from the mean by the same delta-tAv, far from it), 10th percentile=.260, true performance=.265
Player 4: mean forecast=.330, 90th percentile=.400, 10th percentile=.260, true performance=.405
Player 5: mean forecast=.330, 90th percentile=.400, 10th percentile=.260, true performance=.350
Then players 1 and 4 would be binned in the 90-100 percentile bin in terms of how they did relative to PECOTA projections, i.e., they grossly overperformed what PECOTA expected; players 2 and 3 would be similarly in the 0-10 bin; and player 5 would be somewhere around his own 60th-percentile performance -- the exact percentile he achieved would be dependent upon more details of the PECOTA projections, but it would be somewhere above the 50th but well below the 90th.
HOWEVER: In terms of how well **PECOTA** performed, player 5 would be in (in fact, would *be*, exactly) the top quintile (because PECOTA nailed his performance compared to how it did on the others), player 1 would be in the next quintile (because PECOTA missed him by .045, which is worse than player 5 but better than all the others), player 3 would be the middle quintile (missed him by .065), player 2 the fourth (missed by .070), and player 4 the bottom (missed by .070). THIS HAS NOTHING TO DO WITH HOW THE *PLAYERS* PERFORMED, which is the subject of the histogram that Colin displayed. It has to do with how *PECOTA* performed. Incidentally, in this particular example, it would show that PECOTA had a chronic tendency to underestimate how these guys hit.
Does THAT clear it up?
But this is not the subject of this article or that histogram. That histogram is about tracking the information in your paragraph here:
"Then players 1 and 4 would be binned in the 90-100 percentile bin in terms of how they did relative to PECOTA projections, i.e., they grossly overperformed what PECOTA expected; players 2 and 3 would be similarly in the 0-10 bin; and player 5 would be somewhere around his own 60th-percentile performance -- the exact percentile he achieved would be dependent upon more details of the PECOTA projections, but it would be somewhere above the 50th but well below the 90th."
The counts would be:
n, percentile
2, 90-100
1, 50-60
2, 0-10
That's what the histogram would show from your example.
So what does that tell us about the actual histogram, and turn, what that histogram says about PECOTA? The answer is that the histogram shows PECOTA to do very well with most players, i.e., the top quintile of the PECOTA performance is populated with the guys whose *player* performance was right around their 52nd percentile or so. The next quintile will be ones whose performances were somewhere between their own 45th and 50th, or 55th and 60th, percentiles -- don't take those numbers too literally, they're a SWAG, but probably about right. And so on, with, as it turns out, the lowest quintile occupied by the guys for whom PECOTA missed a collapse.
And that is what is interesting about the histogram. If PECOTA works right, the results *should* cluster around 50th-percentile predictions, and indeed, they do. The width of the nearly-Gaussian distribution centered on that "Schwerpunkt" -- German has a better description for this than the English "centroid" -- is a measure of how imperfect the PECOTA predictions are. If the predictions were perfect, the Gaussian would be arbitrarily narrow (a delta function). If they were purely random, the Gaussian would be infinitely wide. Laying aside the collapses, the message is that PECOTA works pretty well.
Why is that distribution Gaussian? Well, it isn't necessarily *really* Gaussian, but a Gaussian shape is what you expect if players' under- or overperformance is a matter of luck -- getting screwed or helped on BABIP, etc. It is also consistent with the hypothesis that the set {players for whom the information used to form the predictions is exactly correct} is larger than the set {players for whom the information is completely bogus}, with the obvious gradations of correctness in between those two extremes. In other words, these guys do their homework -- but if they did it even better, the distribution would be narrower, except for the "predictions" given to players who collapse.
No. For it to work right, the percentiles should remain the same, BUT the estimate of the percentile levels should be much narrower.
For example, PECOTA would give this:
Pujols:
10th .290
50th .330
90th .370
(Or whatever).
IDEALLY, the best forecasting system would give something like this:
10th .310
50th .330
90th .350
That is, the estimate of each level is as tight to the 50th as possible.
However, the histogram *must* show 10% of players (of whatever the population it's based on) in each 10 percentile grouping.
***
You seem to be saying that we should keep it like this:
10th .290
50th .330
90th .370
And then be happy that 95% of the data falls between the 10th and 90th points.
Well, from that standpoint, why not set the percentile ranges so wide to ensure that 95% of the data falls between the 45th and 55th points?
***
I think you are conflating the issue of accuracy, with the issue of bias. The histograms here speaks only to the issue of bias. It says nothing about accuracy (of the mean forecasts). It only say something about the "accuracy" of setting appropriate ranges.
BillJohnson, you do recognize that PECOTA itself comes with a list of percentiles, and that is what we are evaluating, right? I want to make sure there's no confusion on that. So PECOTA is telling us that it thinks something will happen 10% of the time, so we want to check to see if that does happen 10% of the time.
Now, if we were just using the mean PECOTA forecast, then yes that could very well be Gaussian and would be tighter if PECOTA were better. But that is not what we are talking about here. We are not evaluating how well PECOTA projects mean performance, we are evaluating the very percentiles (really deciles) that PECOTA is giving us.
For the projection system to "work right" is one thing; for it to be "useful" is another, quite different thing, and harder to achieve. What you're saying is that an optimally "useful" PECOTA would have the gap between any player's 10th and 90th percentile projections be as small as possible, AND players' performances against the projection system would continue to be clustered right around the 50th-percentile level or a bit higher (for reasons Colin describes). This is entirely fair and to the point.
If PECOTA had perfect knowledge of the forward-looking capabilities of every player -- not what they *will* do in the coming year, but what their skills will *allow* them to do -- then the gaps between 10th and 90th percentile predictions would be much narrower than they are. They still would be non-zero, because there is a random component to how players perform (e.g. fluctuations in BABIP) as well as sensitivity to things the players can't control (e.g. strike zones). For exactly the same reason, there would continue to be players who, even laying aside injury-driven collapse, fail to meet their 10th-percentile, or manage to exceed their 90th-percentile, projections.
What you are asking for, quite reasonably as a paying customer, is a PECOTA where both of these conditions are met: the prediction bands are narrower than they currently are (i.e., the algorithm is well informed), and the actual performances fall in as narrow a distribution around the 50% predictions as luck will allow (i.e., the algorithm "works right"). That's the holy grail of these prediction algorithms -- so perhaps the discussion should turn to how to get there.
This is plotting percentile results, and by the definition of percentiles, there should be 10% of the players in the 0-10 percentile range, 10% of players in the 10-20 percentile range, etc.
(Note to Colin: I definitely think you should update your chart to reflect the 0-10, and 90-100 numbers. I think this makes it far more clear, considering that the area above the 10% line has to equal the area below the 10% line.)
That's for hitters. For pitchers, it's even worse. 50% of the pitchers (min 70 IP) had an ERA outside the 10-90 percentile ranges, whereas we would have expected just 20% total. It's an alarming total. When Felix Hernandez's 90% percentile is 3.20, and he, for two years in a row, achieves an ERA below 2.50, then you know something is dreadfully wrong.
Now, Colin makes a good point that ERA includes sequencing, something we've talked about alot here in the past few weeks. The equivalent to a hitter's TAv would be a pitcher's peripheral ERA (component ERA, BaseRuns ERA, or whathaveyou). If we do that, we get for pitchers something similar as for hitters. Therefore, if the test is not going to be against ERA, but peripheral ERA, then the PECOTA percentile page should show the header as peripheral ERA.
Nonetheless, a huge issue.
Thanks Colin. You should be proud for doing the right thing.
I have to disagree with your assertion that in a perfect world you'd have 10% of all players in each 10% group, because we have selected a group of players who reached 300 PA. As Colin already said:
Fundamentally, players who perform above their expectations are more likely to get playing time than players who perform below their expectations.
So there. Also, the ERA distribution was also explained by the fact that PECOTA was not accounting for uncertainty in defense/etc and was just looking at true ability. I personally would prefer the percentiles to account for luck/defense/etc but you haven't provided proof that the percentiles are not doing what they are supposed to be doing.
That is, what subpopulation of the 1000 batters in 2010 are we looking at?
And once you look at that subpopulation, do we also see 10% of them going below their 10th percentile forecasts?
I would bet that there is NO subpopulation that you can select where the percentiles come anywhere close to 8%-12% in each 10% bucket.
Variance is a PITA to estimate, and PECOTA and BP wouldn't be the first (or last) to have underestimate just how much noise there really is in the data.
But, you are ignoring the fact that players that are proven "starters" will be given much more leeway, even when they are failing to put up replacement level offense (see: Aaron Hill and most of the Cardinals in 2010). So, we should expect a large cluster near the bottom while these players continue to get playing time.
10, 10, 10, 10, 10, 10, 10, 10, 10, 10
It would be:
4, 5, 6, 7, 8, 10, 12, 14, 16, 18
Or something.
And no way do we see anything like that.
But, the point still stands, if PECOTA is saying: "I expect this player to exceed his 90th percentile 10% of the time", then how are we to evaluate that?
Is it that we are to look at all 1000 batters in MLB, and have no PA minimum? Is it that the claim will only exist if the player is allowed to have 300 PA?
PECOTA is the one making the claim. Therefore, let's see what the conditions are in which we expect the 90th percentile to be exceeded, and let's test it based on that basis.
***
In any case, the #1 problem with the percentiles is that the uncertainty range has to be based virtually almost entirely on the sample size of the player's past performance. And this is not at all what PECOTA has been doing.
Colin himself acknowledges it exactly:
"This is a relatively simple fix—the uncertainty in a forecast is largely a function of the amount of data you have on a player."
Love the attitude. Real comment to follow...
Look...90% of us are casually interested fans of sabermetrics and advanced statistical analysis. We don't have a horse in the race of which-system-is-better, nor do we take sides in the which-site-has-better-analysts debate.
That said, the nitpickiness, condtradiction, and barely-veiled undermining from competitors and detractors on these comment boards are really annoying.
Someone said it well yesterday - aren't you all batting for the same team here? Aren't we all trying to raise usage of advanced baseball analysis? This infighting is just dumb. Stop driving the casual fans away with your annoying bickering about who is better than who.
If there are flaws in existing techniques, it is a good thing for the entire baseball analysis community for those flaws to be corrected. We all want to see the best possible analysis whether it comes from BP or somebody else. I think the main problems arise when a lack of transparency (perhaps coupled with some marketing hype) gives something the illusion of being the best it could be when it in fact is not. I applaud Colin for doing the dirty work to make the process much more transparent and therefore much more open to improvement.
Talking about the competition, and talking about the flaws in PECOTA is good for everyone. Colin has addressed a number of critical issues. Tango's clearly interested in furthering the field; on his own blog he often touts others' research (and sometimes critiques others' research.) It's not just a "My game is better," thing.
I have little doubt that the majority of readers aren't supremely interested in the fine details - but the ones who are are worth something to the system.
Tango's point in the comments is very well taken; I have always been curious about the failure to have bigger ranges for players with limited histories (though the issue isn't just with players with limited histories, obviously.)
For the most part, there's nothing offensive in this. But there's enough finger pointing, accusing, and comparisons going on to make it really annoying to a casual reader.
By no means stop suggesting improvements. Just be aware of how you might come across if you choose your words poorly, that's all.
The "minus"es to my original comment were expected. There's a feature I wouldn't mind seeing go away, since it amounts to an opinion popularity contest.
I think you should be more explicit by pointing to actual examples. I will grant you that as a casual reader who might be giving cursory views to comments, it may seem combatative. But, once you go deep into it, we're all a happy sabre family.
Well, you should, because how do we get resolution to problems unless we see the problem. And maybe it's not a problem, but a misinterpretation. As it stands, you pointing "something" out means nothing at all, since we (I) have no idea what you are talking about.
Why are so many people falling off the charts? I guess what I'd like to get from the projections is a pretty good idea of a likely upcoming season from a player, and then the probability of a windfall or a disaster. Not sure if the percentiles is the best way to do that given that this chart seems to be showing so many actual performances in the outlier columns. This may also contribute to inaccuracies in the star/scrub graphic for each player, I don't know, but if so it is not a good thing.
That being said, in a 40,000 ft view, it is EXTREMELY difficult to forecast baseball statistics for a single season on a large scale(obviously!) The moral of my story is, while we definitely strive for a successful forecasting model while looking at the next immediate season, I believe PECOTA does a great job of identifying performances over 900-1200 at bats and 300-400 innings.
This may be the funniest thing I read all morning. Thank you.
Those in the first hump (players who vastly underperfromed their projections) are those who missed significant playing time due to unanticipated injury or got sent to the minors.
The second hump are those PECOTA nailed really well.
The third hump are those who got unexpected playing time (non-top-prospects who got a lot of ABs or IP- like Thole or Leake; those who overcame past injuries to play a whole season).
Average these together, and PECTOA comes up roses.
Colin, others, what say you?
I highly doubt it. But even if that's the case, then what in the world does "10th percentile" forecast mean? If you want to say that 50 of the 250 players (or whatever) of the players with at least 300 PA had a very down year, then why are you setting the benchmark so high that 20% of the players reached a level that you said only 10% of the players should reach?
That is, vastly underperforming, or sent to the minors, while still reaching 300 PA is not a phenomenon limited to the year 2010.
You are *starting* with the position that only 10% will reach some baseline (hence the 10th percentile). Then you have to ask: "how much below my mean will that be?". And maybe the performance level of a group of such players that you thought should have had a .270 TAv did in fact reach only .230, then that's where the 10th percentile forecast should have been made, and not the .240 or .250 level that IS being made, such that 20% (instead of 10%), get below that level.
(All numbers for illustration purposes only.)
It goes back to exactly what I am saying: once you decide on the parameters of your subpopulation, then it's at that point that you test for the 10th and 90th percentile.
As it is, we have no way to test, because we are not being told what subpopulation to test against.
With that, you should see players grouped around the 50th percentile, with gentle downward slopes in each direction. I would also expect to see an upward slop, or even spike as you near the 0-20% ranges as those players will begin to receive less playing time as a whole (and will not have time to have their numbers even out - getting the short end of the small numbers stick), and maybe, though I'm not certain here yet, a slight uptick at the 90th percentile for those players that exceed their projections (because no one saw them coming) or that the system has a hard time nailing down (Jose Bautista & Carlos Gonzales from the first group and Ichiro, etc... from the second).
It's not a measure of player percentiles or normalization from the league projections, but of how an individual player is likely to perform. In a perfect projection world every player would be right at the 50th percentile and every other percentile would be empty.
Ok, so if every single player has a 10% chance of being worse than his 10th percentile, why shouldn't we expect 10% of all players to hit that mark?
right -- but as a POPULATION, wouldn't you expect 10% of players to have exceeded their 90th percentile projection?
taking your explanation -- if each individual player has a 10% chance of being worse than their 10th percentile forecast, and a 90% chance of being better.... wouldn't you expect that when you take the whole set of players, shouldn't 10% of players have done worse and 90% better?
if you are laying out percentile bands, you would expect a flat histogram. That is how percentiles work!
now, there are some confounding factors in terms of sample size. Even 300AB is a low enough cut-off that there will be some noise. Plus the selection bias of "bad players lose playing time". But if every player projected as allowed to play a theoretical season of exactly 1000 PA, the numbers should converge at exactly 10% falling within each 10-percentile band, right?
Ok, say a player has an AWFUL start. Further assume he's young (or old, but not prime). He may well find his (MLB) season over. Down to the minors or riding pine. Had he kept playing, he might have improved enough that his overall stats ended up being closer to his projection, but the team couldn't/wouldn't wait for him to make the adjustment (or his luck to change). So he ends up with 300 ABs instead of, say, 600.
Or let's consider the player who, due to injury, plays a partial season - and plays spectacularly well. I'm thinking of Robbie Cano's 2006 here. He missed a month or so with a hammy injury. He hit .342/.365/.525. Had he gotten more plate appearances, I would have expected him to come down to earth a bit. But he didn't get them - he was on the DL.
Does any of that help explain how a system (any system, whether it's PECOTA, CHONE, etc) might miss?
There is no way you'd get this by looking at the PECOTA cards. If the percentiles don't mean anything for the broken-out components, don't make it look like they do. Among other things, it just makes the system look bad, when, just for one example, a guy like Mauer, who's essentially hitting his 50th percentile TAv projection right on the nose, has a number of home runs this season that system _appears_ to say is nigh impossible.
Just my 2 cents. Enjoying the series.
It has long been my suspicion that the percentiles did not account for enough variability in performance (Felix Hernandez being just one prominent example), and this data proves that it is, indeed, a problem. I am in full agreement with TT that the fact that 36.5% of hitters are falling outside the 10/90 percentile bars is an alarming result.
"what do you make of the large clustering of outcome in the middle decile? "
I responded:
I just took one guy to see what the shape looks like. This is ARod
90o 0.323
80o 0.315
70o 0.309
60o 0.298
50o 0.288
40o 0.282
30o 0.280
20o 0.277
10o 0.273
Look at the gap between 50th and 70th: 21 points. That's way wider than anywhere else.
So, the reason that PECOTA is capturing so many players in the 50-70 range is because it provides such wide latitude at the 50-70 range.
It won't catch much in the 30-40 range, because, well, look, there's almost no gap there.
I don't know if ARod is an example or an exception.
But, given that I've seen funny stuff, like Felix having a WORSE forecast at the 90th level than the 80th level, I think there is a serious programming bug as well.
The reversal of EqERA between King Felix' 80th and 90th percentile predictions is way down in the noise and not necessarily significant all by itself; the reason why one is the 80th and the other 90th has to do with the number of innings pitched, which differs considerably for the two so that VORP/WARP/etc. also differs by more (and in the right direction...) than EqERA implies. It is noteworthy, however, that his real-life performance far exceeds BOTH the 90th-percentile EqERA and the 90th-percentile number of innings pitched. Note also that fellow studs Doc Halladay, Adam Wainwright, Josh Johnson, etc., also exceeded their 90o EqERA by significant margins (although not always the 90o IP). So yeah, it sure looks to me like you've found a real bug here when it comes to exceptional performances.
"How much wider should the 70-80 band be, compared to the 60-70 band?" is an interesting and difficult analytical question. Whether or not it should be wider is not.
Incidentally, I've often wondered whether in modeling pitchers performance, it would make more sense to use a Poisson distribution rather than a Gaussian.
It's like saying this:
50th: .300
40th: .290
30th: .280
20th: .270
10th: .260
0th: .255
It's not going to happen. This is how it would look like:
50th: .300
40th: .290
30th: .278
20th: .262
10th: .242
0th: .212 (or .000 technically)
All numbers for illustration purposes only.
Incidentally, if you're talking about using a Poisson to model the number of runs allowed (or scored), it's better than a Gaussian but still not right. 0 occurs too frequently (IIRC) relative to a Poisson distribution.
the spread of percentiles is too narrow, meaning any big miss is going to end up outside those 10%/90% goalposts...
PCT ERA EqERA
90o 3.23 3.31
80o 3.22 3.30
70o 3.27 3.35
60o 3.30 3.39
50o 3.54 3.63
40o 3.57 3.66
30o 3.68 3.77
20o 3.75 3.85
10o 3.87 3.97
Three points:
1. PECOTA is already giving us "EqERA", which is the peripheral or component or luck-free ERA we've been talking about.
2. In addition to that PECOTA is giving regular ERA (which should be much wider because it includes more luck from sequencing events, etc).
3. Look at Felix's forecast at the 80th and 90th levels. Obviously wrong. Look how wide it is at the 50-60 level, and then, how tight it is everywhere else. You are naturally going to capture more players in the 50-60 level if you are putting in estimates that are much wider at those levels.
What is your over all opinion of the percentile forecasts? Are they
1) Useful and potentially possible to calculate accuratly, once all the bugs are worked out?
or 2) Useful, but probably impossible to calculate
or 3) Not worth doing
Also, this is probably the best series of discussions in a long time. I was on the fence about subscribing next year, but articles like this keep me coming back.
Colin is accepting the position I've held, and MGL reiterated, and, really, what any stats professor would tell you, and that is that the uncertainty of your estimate is based on the size of your observed sample. What has been frustrating for me is that this is so obvious and commonly accepted that I was getting push back on it (not Colin). Now, Colin is going to be novel about it, and include more to the uncertainty by looking at the kind of player you have (maybe there's more uncertainty in the mean of old players, or fast players, or whatever). That's good, but more importantly is to get the basics down, which is what he is going to do.
Now, is it necessary to publish the 10th and 30th and 80th percentiles? Why not just say:
Pujols .330 +/-.030 (where that's one standard deviation)
Why does this help? Because you can then do this for Pujols' PA:
Pujols 610 +/- 70
The way the percentiles are currently laid out, it tries to give you both, but it's not really. As Colin noted, it "infers" all the component stats based on the TAv stat.
Why not do:
Pujols
(K/PA): .07 +/- .02
(BB/PA): .18 +/- .03
And so on.
Wouldn't that convey far more information, while using up the same amount of real estate?
(Note: not all things are symmetrical. You can get away with that on the rate stats, but not on playing time. On that one, and that one alone, I would LIKE to see the percentile forecasts.)
You can also follow the thread at my site, where MGL made a good point.
Your suggestion of:
Why not do:
Pujols
(K/PA): .07 +/- .02
(BB/PA): .18 +/- .03
is totally valid, but not very digestible as an end product for mass consumption. We need a final stat line for the year!
If you are talking about rookies and guys with limited playing time, sure.... but bench players you don't care about, and all the rookies will have such wide ranges as to be useless as well.
Same thing for relievers... they'll all have similar ranges.
So, I see no practical use for a Fantasy player for the ranges.
What you DO want to have the ranges for is playing time. That's where the value is.
I strongly disagree here too. Wouldn't the range for a guys like Adam Dunn, or Ichiro, be significantly smaller that the range for a guy like Aubrey Huff? As a fantasy player, you want to make decisions based on risk-reward, and there are times where you might seek more mediocre, reliable performance than going for all or nothing.
I'm not a fantasy guy, so no comment on the practicality question, but for one who just loves the game and strives to understand it, the ranges are nice to see -- if they work.
and
"If you want to say that "most" starting regulars should have fairly similar ranges, I'd agree. But not "all" or anything close to it."
In reality, you are right. Insofar as what the data can possibly tell us, then our ESTIMATES will have their ranges virtually all similar (beyond whatever their past number of PA would indicate). Only cases like Ben Sheets or other other players with injuries will be exceptions.
Otherwise, I would be shocked that the 90th percentile of every player is not something like mean TAv + 1.15 to 1.20 and the 10th percentile is not TAv -1.25 to -1.30. Something along those lines.
If someone is arguing that you are going to have some players at TAv +1.10 and others at TAv +1.40, then I don't think your expectations are going to be reasonable.
(Again, presuming we are looking at similar past PA for the players in question, and injuries notwithstanding.)
You might get a skew based on age, but again, that would apply across the board to everyone at that age.
Anyway, let's see what Colin will discover with the refreshed PECOTA, let him make his claim, and then just test it.
1) How reliable is the baseline? A player with a long, consistent MLB performance history has a more reliable baseline than Matt Weiters in April, 2009.
2) How reliable is the similar player pool? A player who has a lot of similar past players to compare would have a more reliable number than Ichiro.
3) How much variance is there in the projections? Based on the similar players available, how much variety in performance has there been? Presumably certain types of players have a smaller range of potential performances than others.
I don't know if it's possible to assign numbers in this fashion, but if so, then it would also be easier to test the accuracy of the system by comparing apples to apples. This may also reveal particular weaknesses/biases of the system.
I would think that publishing the numbers with a ± interval is systemically wrong. You say that not all things are symmetrical... I would image that very few really are. The assumption of normality that is also implied to when you start using "standard deviation" is also a big leap. Publishing percentile data shows at least a rough outline of what each players probability distribution looks like, which then allows factors like "breakout" or "collapse" to be included in the model based on the comps.
The number one thing I want from a projection system is some sort of assessment of the player's potential or ability. How good is this guy now, and how good can he be in the future ?
Once you have that information, then you can do things like run thousands of simulations of the next season and present a range of outcomes that we may see from that player.
In essence, the first part is rate stats, and the second part is to apply those rate stats to produce actual projections of things the typical fantasy player cares about: counting stats like homers, steals, strikeouts, etc.
My feeling is that PECOTA tries to sort of do both of these things at once which I think is a mistake. Perhaps it's not actually doing both things at once in terms of modeling, but at a minimum the presentation feels that way, and I think it muddies the water and is very confusing.
I would much prefer a very clear and distinct assessment of player ability or potential(without regard to projected playing time). That is information that I can't easily provide or create myself(at least in the comprehensive way that PECOTA can). What I can do on my own is get a handle on playing time from various internet sources. The competitive advantage that we as PECOTA users could enjoy I believe comes from applying our personal knowledge/hunches about personnel situations combined with PECOTA's knowledge of what a player might do if given a chance. But if you only present the data through the prism of how BP projects playing time, it makes it harder for me to get the information I really want from PECOTA.
If the sampling that Colin chose should be representative (300 PAs), then the percentiles are busted. Absotively busted. Or, at the very least, they're not going by the right name.
In this sample, PECOTA actually does a pretty good job at capturing the center - 23.9% of the sample fell within the 40-60th percentiles. Not too shabby at all. But, as Colin points out, and Tom and others expound upon, the percentiles above and below this midpoint are under-predicted, until you get to the way-way outliers.
It's pretty straightforward to see that the overall spread is too tight, and would need to be widened to re-capture the true 0-10% and 90-100% ranges and distribute them into the troughs.
As several commenters have done, it's easy and kind of fun to rationalize why a player might perform under his 10th percentile, or over his 90th. But, at the end of the day, only 10% of the overall population should fall into those categories if they are really percentiles at all (and if this sample is representative).
As I said, this is amazing - to see this information analyzed and published for "review." I almost say "peer review," but that would imply something I'm not willing to accept.
Kudos to Colin, Kevin, and the rest of the team. I look forward to the offseason developments with excitement.
Again, kudos. Have a great weekend watching the regular season come to a close! Best, Burr
In the context of this observation from Colin: "... apparently there’s more uncertainty on the downside than the upside. This is something we can build into our model as well;" I wonder if our initial observations were incorrect.
It might be worth revisiting.