Moonshot: How Quickly Do Team Results Stabilize?

April 28, 2014

With the end of April looming, we can begin to shed some of our fears regarding small sample size. Statistics like strikeout and walk rates have passed critical thresholds on their march toward stabilization, and so we are beginning to get a first look at how well individual players will perform. The requisite early-season loss of ~20% of each team’s starting rotation to the failure of a certain crucial ligament has taken its toll, resulting in a clearer picture of who will make each team’s starts.

All of which is to say, we can begin to turn our attention to matters larger than individual players. Since the ultimate goal of every team is to win a championship—and the best way to win a championship is simply to field a very good team—the question of utmost importance is simply: How good is my team?

In light of this question, I examine here how quickly team quality stabilizes over the course of a season. At a fundamental level, good teams are defined by 1) scoring lots of runs, and 2) not allowing the other team to score many runs. Therefore, I take as my measurements of quality runs scored per game and runs allowed per game.

While there is a simple relationship between the number of runs scored/allowed and wins (via the Pythagorean expectation), that relationship is quite noisy. First and foremost, the noise results from sequencing, or the luck a team has in apportioning its runs to individual games. A bad team may thus end the season with an excellent record and a playoff berth, despite an underlying lack of quality. Nevertheless, all else being equal, good teams (those that score many runs and don’t allow many runs) are more likely to make the playoffs and win championships than bad teams.

Estimating Quality
A simple estimate of the quality of your team’s offense is simply the number of runs it has scored so far in the season divided by the number of games it has played (RS/games). Naturally, the accuracy of that estimate will improve as the season progresses and more data is available, but how just quickly does it become accurate? I used Retrosheet game logs from 2000-2013 to examine this question.

I plot here the root mean squared error (RMSE), a measurement of how accurate a prediction is, for the 420 team-seasons in my dataset, over the course of a season. Each line represents the error of an individual team’s season, which decreases as the season progresses (since the error is calculated relative to the final RS/game over the full season).

The red line is the mean squared error, over all 420 team-seasons. The blue dashed lines represent the range of game numbers that teams have played so far, between 20 and 30. You can see that guessing with the runs scored per game at this point in the season, you would tend to fall within ~.5 runs per game of the actual value. While .5 sounds small in absolute terms, it is quite large in terms of runs per game; it is the difference between the 2013 Detroit Tigers’ offensive output and the 2013 Toronto Blue Jays’, for example.

I have shaded the area representing the 90 percent confidence interval of the error (orange lines delineate the boundary of this confidence interval). This interval implies that while one could predict the RS/G to within .5 runs, one could also be off by as much as ~1.25 runs per game or as little as ~.1 runs per game. Variability in the predictive accuracy is huge this early in the season. A good offense could collapse and become terrible, or a terrible one could improve and become great.

Let’s take a look at the same graph, but for runs allowed, with the hope that the situation is not as variable.

The shape of the runs allowed prediction curve is almost exactly the same. In retrospect, the similarity was probably to be expected, given that every run scored is a run allowed for another team. Approximately the same dynamics hold for this graph: you could guess the RA/G of a team within about a half run, but you could just as easily be exactly right as be off by a run and change.

There are some minor differences between the two prediction accuracy curves. If you look very closely and compare them side-by-side (or statistically), you’ll note that the Runs Allowed curve tends to have a few more outliers over the course of the season. By this, I mean that the extreme deviations from the final RA/G, beyond the limits of the confidence interval, are a little more extreme for allowed runs than scored runs. That reflects what we know about pitching, and specifically the tendency of pitchers to get injured more often or just demonstrate fluky runs of unsustainable success or failure.

In both cases, we see that accurate prediction of a team’s offense or defense is severely limited this early in the season. While it might still be beneficial to know a team’s RS/RA within .5 runs or so, the true accuracy of the guess can fluctuate wildly, depending on the particular team. It might be tempting to conclude that we ought to simply forgo prediction until later in the year, when the sample sizes are larger still. But we can do better.

The Primacy of PECOTA
There’s another category of information we can bring to bear in our task: preseason projections. Such projections are based on the individually predicted attributes of each player on a team, and so constitute an orthogonal source of information. It’s reasonable to believe that by integrating what we previously knew about players, we could dramatically improve our predictions for teams.

Around these parts, the projections of choice are none other than PECOTA. I took PECOTA’s preseason projections for the 2013 season and contrasted the accuracy of PECOTA’s single-point estimate of each team’s RS with the increasingly accurate average RS/G over the course of a season.

This graph is similar to the above graphs, but using PECOTA projections and runs scored. PECOTA alone, with no updating to account for the season in progress, is quite accurate. In fact, using the PECOTA projections is more accurate than season-to-date RS/RA until about the 30th game—a point no teams have reached in 2014. However, PECOTA is still off by the familiar .5 runs per game margin to which we have become accustomed.

There is hope for prediction accuracy yet, however. I also built a combined, linear model that integrated PECOTA’s projections and the season-to-date RS numbers. I trained the model with 2012’s projections and runs/game numbers and then applied it to 2013’s data. As expected, this combined model outperforms either source of data alone. In fact, the season-to-date stats don’t approach the accuracy of the combined model until around the 100th game of the season.

Between games 20-25, where most teams presently lie, the combined model is accurate to within ~.25-.2 runs per game, for both RS and RA. It displays another excellent characteristic, as well. The maximum deviation for this model over this game range was no more than .4-.45 runs per game. To put it differently: the maximum deviation of the combined model was less than the average deviation of the rolling RS/G model alone.

With that said, let’s run the model on 2014’s data (Runs Scored):

In some ways, the model’s takeaway is unsurprising. In general, a given team’s projected RS number is going to be somewhere between PECOTA’s projection and the RS number the team has accrued so far. For fans of offensively overperforming teams, like the Twins and the White Sox, that’s going to be a bit of a downer. For fans of the underperforming teams, such as (most prominently) the Diamondbacks, this projection may offer some slim solace (but they are still a very long shot for the playoffs).

In a broader sense, this research illustrates that by incorporating orthogonal predictive information, the accuracy of a model can be rather drastically improved. That was the case with pitch velocity, and it’s the case with team-level projections, and it’s probably the case with a lot of other things as well. In the case of predicting a team’s quality, one can forecast runs scored and allowed to within less than a half-run and below as the season proceeds.

As I noted before, there’s still a layer of complexity between the runs scored/allowed and who actually wins games. While scoring and preventing runs is the raw output of good teams, the timing of those scored and prevented runs determines whether a good team will also be a winning team (and by extension, a playoff team). Still, there’s a lot of baseball yet to be played this season, and it may cheer the fans of a few good teams to know that while victory is elusive, your team is probably better than its record says.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now

Robert Arthur

Latest Articles

You need to be logged in to comment. Login or Subscribe

PeterCollery

4/28

You say: "In general, a given teamâ€™s projected RS number is going to be somewhere between PECOTAâ€™s projection and the RS number the team has accrued so far."

Why would it ever be anything else? How, for instance, do we account for the Orioles who are outperforming their PECOTA so far, but are projected to finish below that level?

Reply to PeterCollery

nada012

4/28

Mathematically, a linear model has the variables in it (in this case, RS so far and PECOTA projected RS), and also an intercept. So if the intercept for a particular model is say -.1, and the RS and PECOTA numbers are each pointing towards 4.2 RS, the model might spit out 4.1, because of that intercept.

Generally, I would take the above numbers with a grain of salt. They are provided for illustrative purposes, rather than as definitive predictions. The point of this article was to show how quickly RS/RA stabilized, and to demonstrate that preseason predictions still carry some weight (and will until ~ game 100). If people are interested in maximally accurate predictions, maybe I can do a follow-up with some more sophisticated models.

(With that said, it's also possible I just made a mistake in entering the numbers in the table. I'll go back and check to make sure.)

Reply to nada012

rawagman

4/28

Robert - Is it safe in looking at the final numbers that the model assumes no player movement between teams?

Reply to rawagman

nada012

4/28

Yes. All of this is based on PECOTA's depth chart projections, which assume that player X will get N plate appearances with his current team. There's no accounting for what would happen if that player gets traded. That's probably a source of some inaccuracy, given trading deadline dynamics (good teams tend to buy, bad teams tend to sell).

Reply to nada012

jrcolwell

4/28

Before running your projection model, did you update PECOTA depth chart projections to account for new playing time projections (like in the case of injuries)? Or are you still using the same preseason playing time projections?

Reply to jrcolwell

nada012

4/28

Yes, I ran it with the updated predictions.

Reply to nada012

dianagramr

4/28

Very nice work Robert!

Reply to dianagramr

markpadden

4/28

What exactly is the model you came up with?

Reply to markpadden

nada012

4/29

There's 162 slightly different models, one per game number. As you might imagine, the weight on PECOTA decreases as the season goes on, and the weight on RS/G increases. As I said below, I'm going to look into this again and try to get a simple formulation for how much to weight PECOTA per game number. The intention here wasn't to maximize accuracy so much as to illustrate the overall trends.

Reply to nada012

pjbenedict

4/28

MN scoring the third most runs in all of baseball? Unpossible!

If only we had some pitching to go along with our projection-annihilating offense...

Reply to pjbenedict

Kongos

4/28

For about half the teams (BAL, DET, CIN, CLE, MIL, NYA, NYN, PHI, SFN, TBA, TOR, HOU, and WAS), the projected RS is below both the actual RS and Pecota. I think there's a problem with the methodology. (Or perhaps you just made a mistake.)

Reply to Kongos

nada012

4/28

I think that this is a feature, not a bug. Why? Because teams generally scored fewer runs per game later in the season than earlier in the season, reducing the final RS/G number (or at least they did in the years I looked at [2012/2013]). So the model is systematically underpredicting the runs/game to account for that.

Reply to nada012

mbodell

4/30

If that were the case for nearly half of the teams then wouldn't a better prediction for PECOTA just be even fewer runs per teams? Might not there be some small sample size issue (maybe a shift of offense from 2012 to 2013 or from 2013 PECOTA expectations to 2013 actual) that causes this effect.

If PECOTA says 4.6 and actual to date is 4.4 I'm suspicious of a prediction that is 4.2, and when it happens to nearly half the teams (13 out of 30) it suggests a bug.

Reply to mbodell

mbodell

4/30

Also, in general the teams that you have projected outside this band are the teams who have actual runs scored most similar to projected runs score. In some sense that is expected since the band of possible values between the two is smallest. But in other senses this is surprising: The teams PECOTA has most accurately projected what they are doing are ones that your model doesn't trust! You'd think the evidence to date should make you more trust PECOTA more, not less.

BTW the issue happens in both directions (actual lower than PECOTA and PECOTA lower than actual). For instance:
Baltimore: 4.64 actual, 4.34 PECOTA, 4.23 projected.
Toronto: 4.35 actual, 4.48 PECOTA, 4.24 projected.

Somewhat dubious.

Reply to mbodell

newsense

4/28

I think I see an error: The Nationals' projection is below both its current runs scored and PECOTA.

There has been some prior work that suggests that if you pro-rate the PECOTA winning percentage to 69 games and add a team's current record to the PECTA 69 game record, that it will be close to the best predictor of the team's record going forward. That might serve as a good check on what you have done.

Reply to newsense

nada012

4/28

See above. But I agree; the tendency to underpredict to RS/G is a little weird. I'm fairly certain that the numbers are correct and consistent with my method, but perhaps I need to re-examine the method with some external checks as you suggest.

Reply to nada012

ravenight

4/28

Would be interesting to see the full version (perhaps trained on 3 years of data), incorporating RA to make a wins prediction.

Since the model's strength is its simplicity, I'm curious what the weights it's using are. Are you using one set of weights derived to optimize the predictive value at all points along the curve? Would it be better / close to just use [PECOTA * 162 + RS * (GP / 162)] / 2?

Very interesting that teams scored more of their runs early in the season, so improving on PECOTA's projection requires even more RS. If that's really a trend across all seasons, it has a lot of implications for fantasy...

Reply to ravenight

nada012

4/29

I use a different set of weights for each game number.
A lot of good questions and suggestions here, and in the comments above (as usual). I will look into some of these for the next article.

Reply to nada012

myshkin

4/29

I assume the sort order for your table was games played and then alphabetical by team abbreviation? That was a bit mystifying.

Reply to myshkin

Moonshot: How Quickly Do Team Results Stabilize?

Thank you for reading

Latest Articles

Lineup Lockdown: American League, May 2025 $

Box Score Banter: Justice vs. Mercy B

Inside You There Are Three Wolves B

The Slow Death (or Eternal Life) of a Bit I’ve Been Doing $

How Hitters are Reacting to the New Zone $

Robert Arthur

Latest Articles

Lineup Lockdown: American League, May 2025 $

Box Score Banter: Justice vs. Mercy B

Inside You There Are Three Wolves B