Notice: Trying to get property 'display_name' of non-object in /var/www/html/wp-content/plugins/wordpress-seo/src/generators/schema/article.php on line 52
keyboard_arrow_uptop

Any forecasting system is only as good as the inputs that go into it—once you get rolling from there you can certainly end up far worse than your data, but the quality and amount of data you have is a fundamental constraint.

So if you want to beat a forecaster, one fundamental question you can ask is, “What does the forecaster know and what doesn’t he know?” You’re far more likely to beat a competent forecaster on the second point than the first point.

One thing PECOTA hasn’t traditionally known is who was playing hurt and who wasn’t. Injuries can mean a number of things—sometimes you think a player who was nursing an injury is due for a bounceback season. Other times you think they’re likely to do worse than you’d otherwise expect, due to lingering injury effects.

For this to be useful to PECOTA, there needs to be a way to systemically capture this sort of information, study it objectively, and quantify the effect.

So what we’ve done is taken a publicly accessible injury database, created by Josh Hermsmeyer of RotoBase, and worked on proofing it and improving it for incorporation into PECOTA. (Once we’ve finished updating the database, we will be releasing it at some point during the offseason, for other researchers to use.) This tells us when a player goes on the disabled list, how long he’s there, and what he’s on there for.

As an example, let’s consider hitters who went to the disabled list with an injury to the lower arm (hand, wrist, or forearm). It’s widely accepted in baseball that wrist injuries have a lingering impact on a hitter’s ability to hit for power. This gives us 77 hitters to study, with 32,763 total plate appearances the following season.

Using the same method we used to look at Ichiro Suzuki yesterday, we can come up with an expected batting line for these hitters. As a group, weighted by playing time, they were expected to hit .266/.333/.427 the following year. Instead, they hit .270/.344/.439.  So we can see that these hitters as a group tended to exceed their baseline forecasts.

Digging down to the component lines that form the “guts” of PECOTA, what we see is a significant effect on home runs on contact—projected to have a .039 HR/CON rate, they instead had a .047 rate. We also see an increase in unintentional walk rate (per plate appearance, minus intentional walks and hit by pitch)—from .083 to .086. That’s, statistically speaking, less likely to be significant than the finding on home runs on contact. But given the significance of the home runs on contact, I’m inclined to think it’s a result with practical significance. (My feeling is that the causal relationship is that pitchers are more likely to challenge hitters whose power has been sapped by wrist injuries.)

On one hand, this isn’t a particularly interesting finding—it pretty much confirms our expectations. What is interesting is that now we have a way to quantify what our intuition tells us about player injuries, and incorporate it into the forecasts in a systemic way.

What we can do from there is take the component batting lines, as well as the projections, and come up with the difference. We then regress those differences to come up with a set of adjustments to the baseline projections.

We can also use this record of how much time a player has missed to injury to figure out what players are most likely to miss playing time with an injury down the road. Let’s face it, some injuries are a product of circumstance, but there are some players who are more likely to get injured than others. And now we have the data to see who those players are.

This is also something we can deploy in-season; when a player goes on the disabled list, we can search for players in the database with similar injuries. We can then use that information to estimate how long he'll be missing and update his rest-of-season forecast accordingly, using data of how players with similar injuries have been affected upon their return.

PECOTA week continues Friday with our final article, and then at 1 p.m. Eastern I’ll be fielding your questions in a live chat. That’s not the end of the discussion, though—we’ll be talking about PECOTA more leading into the offseason and all the way through to the start of next season.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
You need to be logged in to comment. Login or Subscribe
dawblack
9/30
Taking this one step further, it will be interesting to see the variations and expectations after different injuries to pitchers, such as the statistical difference between having Tommy John surgery and the repair of a torn labrum.
TangoTiger1
9/30
Fantastic!
crperry13
9/30
I remember Will saying once that the reason he didn't offer an injury database was because there were player privacy issues that prevented it. Has this changed, or have I just gotten it all wrong? Because it would be a great addition.
dianagramr
9/30
Yes ... my convos with Will around this issue always came down to HIPAA

http://www.hhs.gov/ocr/privacy/hipaa/understanding/index.html

So, how will this new database protect privacy, while allowing for research?
cwyers
9/30
I don't know what Will's data source was or what restrictions it came with, so I can't speak to those issues.

What we're doing doesn't involve disseminating information, but collating it. Most of the data comes from MLB's transaction reports, which are made available to the public. We're filling in some gaps largely from contemporary newspaper reports. But we don't have any confidential information to protect, by definition. This is all stuff that has been published in multiple sources by the time it gets to us.
mikefast
9/30
IIRC, Will had access to some actuarial/insurance books from MLB, though I don't remember if he specified the exact source.
dtonisson
9/30
AFAIK, BP wouldn't be a covered entity under the HIPAA rules, so the rules would not limit BP's use. I suspect that Will's difficulties arose from trying to obtain information from people who were covered entities, not as much from publishing the information.
brownsugar
9/30
Sample size question: how many of a particular type of injury do you require before deciding that an observed difference in the statistics is significant?

Phrased another way, how does the system decide whether a third baseman that takes a grounder off the family jewels is or is not due for a breakout the following year?
cwyers
9/30
I don't like making a binary distinction between significant and not significant when it comes to sample size. Larger sample sizes are obviously more useful than smaller sample sizes, of course.

So you have to look at two things - the size of the sample and the magnitude of the effect. A large magnitude in a small sample can tell you something important, you just can't treat it as being as important as an effect of that magnitude over a larger sample.

So you make the applied effect a function of the magnitude of the observed effect and your number of observations.
jessehoffins
9/30
I take you're going to employ a dummy variable with information about specific injuries and apply them to projection totals. That seems like a great approach, but would it be conceivable to not just capture injury as a dummy with a constant but rather integrate the injury data into all player data and let injury comparables play out on an individual basis? I don't know if what you're doing so cloesly resembles ols, but if you are including a dummy variable for injuries, one problem is that the effect of the injuries on past players that is captured in the dummmy is already (but in a hidden way unless you're marking it) included in the data. That is, guys like jose reyes who are super speedy have comps that had hammy issues that affected their performance. The system knows that they declined. It doesn't know why, but this info is already a factor in projecting jose reyes. It always was a factor, and the system is already capable of noticing his reduced playing time which looks a lot like old comparables. But if you gave it the full set of data on injuries, it would be able to (if there was enough data) make inferences about how different sized and skilled players were responding to their different injuries. Even your non significant information about walk rates should get included under that scheme, and it won't introduce bias.

Actually, the thing about walk rates has another good example. Lets say you decided to include that dif-in-dif estimate as a dummy on recovering wrist injuries. Like you said, there was a change coming out of their injury that seems borderline significant. However, there are probably also relationships between their pre injury walk rates and their injured year. Using a single dummy won't necessarily capture all that. What could be most interesting would be if the system began to detect changes that suggested impending injury.

Anyway, this seems really cool. Can't wait to see the data.

jessehoffins
9/30
and apologies if setting up the system to include injury information like i proposed is just ludicrously hard. you guys do great work.
cwyers
9/30
For me, Ordinary Least Squares and similar techniques are tools of last resort. This isn't to say that last resorts don't come around from time to time or that it's not useful when they do, but I try to avoid them when there are clearer alternatives.

What we're doing is very much of-a-piece with the rest of PECOTA - we're looking for players with similar injuries and using that to adjust the way we think about his future performance. We can't integrate this with the rest of the comps process, for two reasons:

1) We only have injury data going back about 8 years. That's not enough to do the historical comps for the full career path adjustment.

2) Comparing injuries means largely comparisons of qualitative data, not quantitative data. It's very easy to tell PECOTA that a guy with a speed score of 9 is more similar to a guy with a speed score of 8 than a guy with a speed score of 3 - it's all Pythagorean distance. It is robustly harder to figure out the Pythagorean distance between an ankle and a thigh.
vertumnus
9/30
"1) We only have injury data going back about 8 years. That's not enough to do the historical comps for the full career path adjustment."

I would have to think you're going to have to be careful with using historical injury data in general, because of advances in medical technology and technique.

An injury may not have the same ramifications in 2010 as it did in 2000.
joelefkowitz
9/30
I'm confused. Wrist injuries hamper power numbers. Old-PECOTA didn't know about injuries. Shouldn't their old-PECOTA projected numbers therefor be higher than their actual numbers?
nateetan
9/30
The numbers shown were for the season after stats. i.e. the stats based on the season that was hampered by the injury.
joelefkowitz
9/30
Oh man, forgive me if I'm missing the obvious here, but I'm still not getting it...

For the season that was "hampered" by the injury PECOTA expected them to slug .427, but they instead slugged .439, correct? Why would these players, dealing with injuries (that PECOTA doesn't know about) which are widely accepted to lower power numbers, slug higher than PECOTA thought they would?

If PECOTA didn't know about these power-sapping injuries, why did it predict a larger power-sap than what actually occurred?
cwyers
9/30
We're not looking at the season where they dealt with injuries, we're looking at the season after. In other words, during the season they were hampered, they put up lower power numbers due to the effects of their wrist injury. PECOTA sees those lower power numbers and, not knowing the reason for them, projects that player to have less power.

Meanwhile the player has (presumably) recovered from the wrist injury, and his power is restored to some extent, and thus he will tend to exceed PECOTA's expectations. By being able to point out to PECOTA which seasons were hampered by injuries and what kinds of injuries they were, we can be smarter about predicting what they'll do.

(I should also note that this is just one example - we're looking for similarly injured players and finding what they do. It doesn't have to be this sort of improvement - they could well do worse than we expect, depending on the nature of the injury.)
joelefkowitz
9/30
Ok thanks a lot. I was confused by "the season after". I thought I had remembered from Will's Teachings, that the effects of a wrist injury can often extend into the next season. I was imagining a player who was injured late in season Y, and was feeling the effect on his numbers in season Y+1, rather than being injured and realizing the effects in the same season, Y, and then rebounding in Y+1.
cephyn
9/30
No. PECOTA predicted they'd slug .427 because they slugged .408 LAST season, when they were injured. Had they slugged their usual .430, then PECOTA would have maybe pegged them to slug around .430 again. but because PECOTA saw a decline in performance, it reduced the prediction. It didn't know that decline was from injury, not from skill degradation.
metty5
9/30
Colin, this is amazing. During implementation, will PECOTA have an understanding of the time frame of the injury, or just that it exists? My fear is that PECOTA will not understand the difference between a YApril injury and a YSeptember injury's effect on Y+1April's performance.

Also, will the cards (or whatever display method you use in the future) tell us if and what injuries are being applied?
leites
9/30
I thought Nate Silver's comment in yesterday's was interesting:

"The key difference in Pecota, the forecasting system that I developed eight years ago to predict the performance of baseball players, was not that it did better than its competition, on average (it did in most years, but only by a tiny bit). Rather, it was that it looked at the uncertainty in the forecast as a feature rather than a bug.

For example, it didn’t just tell you how many home runs Derek Jeter would hit on average, but what a best-case scenario looked like and what a worst-case scenario looked like. This not only made the forecasting system more honest, but also provided a lot more information to the reader."

http://fivethirtyeight.blogs.nytimes.com/2010/09/29/the-uncanny-accuracy-of-polling-averages-part-i-why-you-cant-trust-your-gut/#more-1545
mikefast
9/30
The problem with that is that, as far as I know, the accuracy of the projection percentiles was never tested. Tango has, in fact, offered some good reasons to believe they were problematic. I don't know if Tango's right or Nate's right, but without evidence I don't have a lot of confidence in the percentile forecasts.

Presumably this is something that Colin is going to address at some point.
phnath
9/30
So all this self-disclosure about PECOTA is interesting and all, but I've got to be honest, I'm still waiting for the article that says, "We realize you are paying $40/year for a subscription to our website. PECOTA is a big part of the value of a BP subscription and we are committed to providing you with accurate, useful projections and completed player cards by February 1st, 2011." Until you make such a commitment, the discussion of new features is just noise.

devine
10/01
Maybe it's noise to you. You are stepping up to represent one (probably very important) segment for BP; fantasy baseball players (I'm presuming here, but I bet I'm right). But there are other segments; some of which don't play fantasy baseball, and have been clamoring for PECOTA to be a more transparent, open system for years.