The new Yankeee Stadium has received a lot of press this spring for the large number of homeruns hit there so far. On April 21, 2009, Buster Olney wrote at ESPN http://sports.espn.go.com/mlb/news/story?id=4080195 “The New York Yankees might have a serious problem on their hands: Beautiful new Yankee Stadium appears to be a veritable wind tunnel that is rocketing balls over the fences…including 17 in the first three games in the Yankees’ first home series against the Indians. That’s an average of five home runs per game and, at this pace, there would be about 400 homers hit in the park this year — or an increase of about 250 percent. In the last year of old Yankee Stadium, in 2008, there were a total of 160 homers.”
The first mistake in Olney’s analysis is to take the homerun rate of five games and extrapolate that over a full season, and the second is to refer to how many were hit in the old Yankee Stadium last year, without considering if there might be different players on the field. The accepted method of measuring park factors, on any statistic, is to compare the home totals of both the batters and pitchers to those compiled on the road, where playing in fifteen or more different parks minimizes the effect of any one park. The factors then allow us to estimate how these players would perform in a neutral environment.
As of this writing on May 20, the Yankees have played 19 games at home, which have seen a total of 71 homeruns, 37 by Yankees hitters, 34 by their opposition. They’ve played 21 games on the road, with 49 homeruns, 27 by Yankees hitters and 22 by their opposition. 71 homers at Yankee Stadium divided by 49 in the Yankees road games gives a factor of 1.45-indicating the new Yankee Stadium inflates homerun rates 45%. The Yankees have played two more games on the road than at home, so let’s instead find the ratio of the home HR% (hr/(ab-so)) of .064 to their road game rate of .043, which is 1.48-slightly higher, but basically the same.
Is 20 games, a quarter of a season, enough of a sample size to get a reliable factor? After two exhibitions and three regular season games, Olney calculates an increase of 250%. After 19 regular season home games, I calculate an increase of 45%. What is it likely to be by the end of the season?
From 1985 to 1991, a period of seven seasons, there were no changes in the National League in either ballparks or schedule. I ran a series of one year, two year and three year factors to find out how much each varied from the seven year ‘true’ value at each park. The chart below shows the standard deviation of the results for each category at each sample size. After one year all categories are fairly close to 2 decimal point accuracy, except homeruns which take three years and triples which take even longer.
If Yankee Stadium still has a homerun factor of 1.45 at the end of the year, with a SD of .149, that means there’s a 70% chance the ‘true’ value is between 1.30 and 1.60, and a 95% chance of it being between 1.15 and 1.75. After 19 games it is still possible that Yankee Stadium could turn out to be an average park.
SDT XBH SI DO TR HR BB SO 1 Yr .039 .083 .044 .091 .292 .149 .069 .044 2 Yr .023 .057 .025 .060 .207 .085 .054 .030 3 Yr .018 .046 .020 .045 .161 .060 .041 .023
A stadium having a factor of 1.45 tells us that plays in that park will be increased by 45% over normal rates. We can use this number to normalize the performance of batters and pitchers to what they would have done in a ‘neutral’ park. Each team is scheduled to play half their games at home, the other half at the various road parks. If we assume that the road parks average out to 1.00, then the ‘team’ factor which is applied to the seasons stats would be (home+road)/2, or in this case (1.45+1.00)/2, which is 1.22. Yankees hitters would be normalized by having their homerun percentage reduced by 22%, and the pitchers increased by 22%. However, with interleague play and unbalanced schedules, we can not assume the a team’s road parks average 1.00. The Pirates play division games in Great American Ballpark, Miller Field, Wrigley Field and Minute Maid Park, all of which are among the easiest to homer in. The Rockies play division games in Petco Park, Dodger Stadium and AT&T Park, which are among the hardest. After the initial calculation of each park’s factors, use those to normalize each team’s road statistics and rerun to generate a new version of factors. A third time is even better, but more than that doesn’t add any meaningful accuracy.
The chart shows that it takes at least three years to get a fairly accurate set of factors, but before that time has gone by a new stadium has likely been constructed-the road parks have changed. Assuming Yankee Stadium’s HR factor reamains higher than the park it replaced, the factor for Fenway Park will go down because Red Sox hitters can be expected to hit more homers on the road. In 1978, Fenway was the fourth easiest park in the AL to homer in, but in 1999, Fenway had dropped to the ninth-Fenway hadn’t changed, it was all the other parks that changed. Can we legitimately say “It used to be a hitter’s park, but now it’s a pitcher’s park.” It would make sense for each park’s factors to remain constant as long as there had not been any changes in that park. To find each team’s factors, multiply how many times they play in each park by each park’s factors, then divide the sum by the total number of games. The team factor can change each year with a different mix of road parks for each team, while the factors for each park do not change as long as the park hasn’t changed. When play by play data is available, team factors to adjust a season total are not needed. Instead, how each player performed in each ballpark can be normalized with that park’s factors, and then summed into an adjusted season total.
1978 American League 1999 American League ParkID Name HRpf ParkID Name HRpf SEA02 Kingdome 1.55 DET04 Tiger Stadium 1.21 DET04 Tiger Stadium 1.21 BAL12 Camden Yards 1.13 TOR01 Exhibition Stadium 1.06 STP01 Tropicana Field 1.12 BOS07 Fenway Park 1.02 TOR02 Skydome 1.10 MIN02 Metropolitan Stadium 1.00 ARL02 Ballpark at Arlington 1.10 CLE07 Cleveland Stadium 1.00 SEA02 Kingdome 1.09 ARL01 Arlington Stadium 0.94 NYC16 Yankee Stadium 1.07 OAK01 Oakland Coliseum 0.93 KAN06 Kaufman Stadium 1.05 NYC16 Yankee Stadium 0.92 BOS07 Fenway Park 1.02 ANA01 Anaheim Stadium 0.86 ANA01 Anaheim Stadium 1.01 MIL05 County Stadium 0.85 MIN03 Metrodome 0.98 CHI10 Comiskey Park 0.79 CHI12 Comiskey Park II 0.98 KAN06 Kaufman Stadium 0.77 OAK01 Oakland Coliseum 0.97 BAL11 Memorial Stadium 0.76 CLE08 Jacobs Field 0.95
In calculating long term park factors, I first made a list of ballpark ‘versions’. Three Rivers Stadium opened in Pittsburgh in 1970, so that’s version 1. In 1975, an inner wooden fence was constructed, about 6 feet shorter, creating version 2 which lasted until it’s closing after the 2000 season. Version 2 of Veteran’s Stadium in Philadelphia existed from 1972 to 2003. Three River v2 and Veteran’s v2 both existed from 1975 to 2000. For those 26 seasons, compare the Pirates and Phillies stats in Pittsburgh with the same two teams stats in Philadelphia. Repeat for every combination of ballpark versions, then compare the total home to road stats for the entire range of years.
I’ve spoken mainly of homeruns in this article, as that category is the one that varies the most between ballparks, ranging from 1.65 for the Polo Grounds 1954-1963 to 0.48 for the Astrodome 1977-1984. Other than the mile high Coors Field with it’s BABIP factor of 1.15, base hits range from Kansas City’s Municipal Stadium at 1.08 to Milwaukee’s County Stadium at 0.92. Candlestick Park in San Francisco had the highest SO factor at 1.11, while Coors Field is the hardest place to fan at 0.85. The bottom of the SO factor list is populated by the various incarnations of fields in Denver, Kansas City, Atlanta, Pittsburgh, Chicago and St. Louis-almost all of the major league cities away from the coasts and a thousand or more feet above sea level. The theory is that breaking pitches don’t move as much at higher altitudes, where the air is thinner, resulting in higher contact rates, but that’s another article.
In summary
- Don’t expect more than two decimal places of accuracy
- It takes three seasons to get a good homerun factor.
- Park Factors should not change if the park does not change.
- Team factors are the weighted mean of park factors which can be applied to individual players statistics.
NAME ParkID Ver Since Games SDT XBH SI DO TR HR BB SO Angel Stadium of Anaheim ANA01 4 1997 812 1.00 0.96 1.02 0.99 0.76 1.01 1.00 0.99 Rangers Ballpark in Arlington ARL02 1 1994 1027 1.03 1.02 1.02 1.02 1.35 1.10 1.00 0.95 Turner Field ATL02 1 1997 810 1.01 0.94 1.03 0.95 1.03 0.96 1.00 0.99 Oriole Park at Camden Yards BAL12 1 2002 1100 0.98 0.89 1.01 0.89 0.70 1.13 1.02 0.97 Fenway Park BOS07 7 1956 3965 1.07 1.15 1.03 1.27 1.01 1.02 1.00 0.98 Wrigley Field CHI11 7 1956 4006 1.02 0.98 1.02 1.01 0.98 1.19 1.02 0.99 U.S. Cellular Field CHI12 2 2001 569 0.99 0.96 1.00 0.97 0.80 1.26 1.02 0.97 Great American Ballpark CIN09 1 2003 406 0.97 0.99 0.97 1.01 0.50 1.24 0.97 0.99 Progressive Field CLE08 1 1994 1008 1.01 1.02 1.00 1.05 0.78 0.95 1.03 1.00 Coors Field DEN02 2 2005 244 1.10 0.97 1.11 1.03 1.24 1.09 0.98 0.85 Comerica Park DET05 2 2003 324 1.00 0.93 1.02 0.86 1.56 0.87 0.95 0.94 Minute Maid Park HOU03 1 2000 648 1.02 1.00 1.02 0.98 1.39 1.18 0.96 1.00 Kauffman Stadium KAN06 4 2004 7323 1.04 1.08 1.01 1.11 1.21 0.83 1.04 0.92 Dodger Stadium LOS03 6 2001 7567 0.99 0.89 1.03 0.91 0.61 1.08 1.03 1.03 Land Shark Stadium MIA01 2 1994 1017 1.00 0.99 1.01 0.95 1.36 0.92 1.06 1.05 Miller Park MIL06 1 2001 570 0.98 1.03 0.97 1.02 0.92 1.13 1.04 1.01 Hubert H. Humphrey Metrodome MIN03 2 1983 1836 1.03 1.09 1.00 1.11 1.28 0.98 1.00 1.04 Shea Stadium NYC17 3 1985 1744 0.98 0.95 1.00 0.94 0.90 0.93 0.97 1.02 Yankee Stadium NYC16 7 1988 1420 0.99 0.94 1.01 0.95 0.73 1.07 0.96 0.99 Oakland Coliseum OAK01 6 1996 885 0.96 1.01 0.96 0.98 0.89 0.97 0.97 0.96 Citizens Bank Park PHI13 1 2004 324 1.01 0.96 1.03 0.97 0.96 1.23 0.89 0.97 Chase Field PHO01 1 1998 729 1.05 1.06 1.03 1.07 1.60 1.11 1.03 0.92 PNC Park PIT08 1 2001 565 1.03 1.01 1.03 1.08 0.77 0.89 0.95 0.92 PetCo Park SAN02 2 2006 162 0.94 0.86 0.99 0.77 1.07 0.90 1.00 1.08 AT&T Park SFO03 2 2004 325 1.05 0.98 1.05 1.00 1.24 0.87 0.96 0.94 Safeco Field SEA03 1 1999 650 0.96 0.96 0.97 0.94 0.76 0.93 1.09 1.07 Busch Stadium III STL10 1 2006 161 1.01 0.90 1.05 0.91 0.82 0.82 0.97 0.90 Tropicana Field STP01 2 2001 561 0.99 1.01 0.99 0.97 1.29 0.98 0.98 1.02 SkyDome TOR02 1 1989 1320 1.00 1.10 0.96 1.10 1.11 1.10 1.02 1.01 Robert F. Kennedy Stadium WAS10 3 1971 324 0.97 0.94 0.99 0.90 0.98 0.77 0.88 1.01
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.
Subscribe now
Brian Cartwright has one heck of a voice!
Every Week One article was good, but Brian tackled an issue currently consuming the baseball writers' attention and he offered unique value to the readers of BP. Perhaps he wrote at a level that could have been misunderstood by some baseball fans, but he wrote at a level easily understood by the readers of BP and he offered that for which we seek, insight missing from sites such as ESPN.com, MLB.com, or our local Sunday newspapers.
I gave just one thumbs up this week. Congratulations, Brian: you got my vote. Superb analysis for working on so short a deadline!
I have sung backup to two different Grammy Award winners.
I think the first half of the article did a good job at explaining the basics to a new reader... I like succinct lines like: "71 homers at Yankee Stadium divided by 49 in the Yankees road games gives a factor of 1.45" and "A stadium having a factor of 1.45 tells us that plays in that park will be increased by 45% over normal rates." I do feel that, besides a change in schedule, there should've been a mention about park factors changing possibly because of stadium renovations or a humidor. Overall, I liked the article.
I'd count that as a positive feature.
Instead, he has weekly deadlines and word limits.
But as far as the complexity of your analysis goes, don't change a thing. Personally, I like reading BP articles that challenge me; that's what attracted me to the site in the first place. This is Baseball Prospectus, not the FOX game of the week.
Your article seems to be less about park factors per se and more about how and when park factors can be used as a reliable statistic. Couldn't all that be simplified by referring to the "larger sample sizes are needed" mantra? Does it really matter (to the reader) exactly how large, and exactly what kind, of a sample size is needed to properly evaluate park factors? Moreover, is a new BP reader going to care about the reliability of one park factors figure versus another? That might be a petty critique, but you might want to think about not just what your topic is, but why you're writing about it (as opposed to other things).
Nonetheless, you have fantastically impressive statistical analysis skills and I'd LOVE to read several more articles by you.
Thumbs up!
"The chart below shows the standard deviation...If Yankee Stadium still has a homerun factor of 1.45 at the end of the year, with a SD of .149, that means there's a 70% chance the 'true' value is between 1.30 and 1.60, and a 95% chance of it being between 1.15 and 1.75"
'true' being the real underlying value we are estimating, the result we would get with an infinite amount of data.
Richard said:
there should've been a mention about park factors changing possibly because of stadium renovations
"Three Rivers Stadium opened in Pittsburgh in 1970...In 1975, an inner wooden fence was constructed, about 6 feet shorter, creating version 2"
I also missed where you said "standard deviation" but I see it now. Next time, capitalize "Standard Deviation" or annotate it like "standard deviation (SD)" to make it more obvious what the abbreviation SD refers to.
Fenway Park *did* change between 1978 and 1999. Around 1989, (I might be off by a year) they built what was then called the "600 club" a second deck behind home plate that added about 600 (get it?) new seats set, bizarrely, completely behind plexiglass. Mike Greenwell and other Red Sox outfielders were all quite clear that this addition to the park caused balls to not carry as well as they had before it was constructed. It cut off the prevailing west-southwesterly wind that pushed balls out toward center and left center field in warm weather.
Fenway Park did change.
From 1967-1987, Batter Park Factors* at Fenway ranged from 118 (1977) to 99 (1986/87) with a median of 107.
From 1988-2008, Batter Park Factors* at Fenway ranged from 111 (2007) to 97 (1997) with a median of 105.
There's a lot of talk about how the change affected air currents, and I guess that there's maybe a little change, but Fenway is still a hitter's park most years. In terms of the article, though, the HRpf hadn't changed: it was 1.02 both seasons cited.
* source Baseball Reference
Maybe my book's wrong but that's a pretty big difference.
To be fair and objective, I just went back to my database and created a new version for Fenway from 1988 on.
Base hits (babip) was unchanged at 1.07, but all the extra base hits dropped - it does appear that the ball did not carry as well. Doubles went from 1.30 to 1.22, triples from 1.03 to 0.97, and homers from 1.07 to 0.92.
I thought it was a good example, and although now struck down, the point stays the same - if the park hasn't changed, the park factor shouldn't. It's the team factor, the weighted mean of all the parks each team plays in, that can change from year to year as the schedule or any one park changes.
My bigger point is that, unless I'm missing something, this method gets the wrong answer, doesn't it? Say every stadium is exactly the same, except for Coors which dramatically increases HRs. Team A plays a disproportionate number of games in Coors so their stadium ends up with a park factor of 0.9. Team B ends up with 0.97 (#s are illustrative only - not sure they really make sense). Team A and Team B _should_ end up with the same park factor. But this would never happen. Because A and B are exactly the same, they (on average) have the same numbers of HRs hit in them. But, after your adjustment, it looks like Stadium A has 0.9/0.97 as many HRs as Stadium B. The park factors should definitely converge quickly, but they're converging to the wrong numbers. Am I off-base?
I can think of some straightforward ways to get the right numbers (which, honestly, I've always just assumed were used to get PFs) so I'm wondering why this method was used. Thanks!
I stepped through this in Excel to make sure everything worked as I expected.
I created four teams, A, B in Division 1, C and D in Division 2. Each team plays the one other team in their division 36 times at home, 36 on the road, and play the two teams in the other division 18 games each at home and on the road, for a total of 72 home and 72 road.
Let's assume we have perfect knowledge. Teams A, B, and C have a home park home run rate of .040 while Team D's home park rate is .060. The mean of all four parks is .045. In real life, we do not know these numbers, all we know is how many home runs were hit by each batter of each pitcher in each ballpark. Traditional factors are expressed as home/road ratios. I am fairly alone in trying to determine 'normal' rates at each park, which is the rate at which a stat will occur if a league average selection of players played there over a long period of time. There's more math than I can ask you to wrap your head around right now.
In our test case, in round one of calculating factors, A and B are both .040 at home and .045 on the road for a factor of .89. C's home rate is .040, but plays twice as many games at D, so it's road rate is .050 for a factor of .80. D's home rate is .060 and it's road is .040, so it's factor is 1.50.
C has the exact same ballpark as A and B, so it should have the same factor (.89) not .80. In round 2, each team's expected road rates are calculated by multipying the number of games against each opponent by the opponent's home park rate divided by their round 1 factor. A and B don't change, as they are in the other division. C's expected road rate is now .043 for a factor of .94, D's road rate is .048 for a factor of 1.26. In round 1 (raw), C's factor was too low and D's was too high. The new estimate is on the other side of the true value, but closer.
Round 3, C's road rate is .046 (true .045), factor .86 (true .89), D's road rate .044, factor 1.37 (true 1.33).
Round 4, C's road rate is .044, factor .90, D's road rate .046, factor 1.32.
One last time, Round 5, C's road rate is .045, factor .88, D's road rate .045, factor 1.34.
D has a home park rate of .060. If a batter there had an observed rate of .060, a raw (round 1) factor would normalize that batter to .040, but we know league average is .045. After round 2 the batter is rated at .048, round 3 .044, round 4 .046, round 5 .045. Three rounds gets the results to within .001, which is close enough, so let's not waste any more time waiting for the computer to do the extra calculations.
In the end, all four teams had an expected road park rate of .045, the same as the mean of the four home parks. You might ask, why not skip this exercise and just use this league average for the road rates? I assigned these values for this test, but in real life we do not know it. Two teams may have identical parks, but A has a lot of boppers while B has all slap hitters, which hides the truth of the park from us. This process is to strip out the players and show us the park.
This particular test shows that the iteration works. Team C played a disproportionate number of road games in ballpark D, causing it to have a different factor than A or B, when we knew that the true value should be the same. Each step of correcting for th road rates brought C's factor closer and closer to A and B.
Also, we may think that A, B and C are 'average', while D is the outlier, which is to say that league average is .040, not .045. A, B and C should then gave a factor of 1.00, while D's is 1.50. In a real life case where there are many more teams, ballparks and seasons, I believe that the long term league average would approach .040 and A, B and C would come out close to 1.00.
I'm guessing you're not responsible for this methodology, but I still don't think it's optimal. Admittedly, I'm surprised that - holding team quality constant - the park factors converge to the correct numbers. I'm not entirely convinced this holds generally, but I'll take it as given for now. I'm less surprised that if you allow heterogeneity in team quality but force all parks to be the same, that this method works fine as well. However, I'm pretty sure that this gets the wrong answer once you let teams have different HR rates and stadia have different park factors.
Instead of iterating, there's a pretty straightforward way to get park factors. The problem with unbalanced schedules is that they weight some parks more than others. Just eliminate the implicit weights. Using your example above, I don't need to iterate. Instead of using (A vs C in your example) A=[.04*4]/[2*.04+.04+.06]=.89 and C=[.04*4]/[.04+.04+2*.06]=.8 and then iterating to get C to converge to A, you can calculate the values automatically by just "unweighting" the balanced schedule (and including the park's own HR rate in both the numerator and denominator): A=[.04*4]/[.04+.04+.04+.06]=.89 and C=[.04*4]/[.04+.04+.04+.06]=.89.
In other words, you don't want to use a team's aggregate road numbers and aggregate home numbers. Each teamA-teamB matchup is an observation - calculate the ratio and don't weight any matchup more than any other. Just find the park factor for each team-team matchup and aggregate.
I'm afraid I haven't explain my point well. I don't think the advantage of what I'm proposing is just to eliminate iterating. By using one aggregate ratio, you can't separately (correctly) identify the team effect and the stadium effect. Instead, you need to separate things by team and by park - this separately identifies each one, allowing you to identify the park factor.
Again, thanks for responding! This is a helpful discussion.
This wasn’t so artful as some of the more glib writers, nor was this as basic as requested, but I’m heavily rooting to see more articles from Brian Cartwright.
This isn't specific to Brian's article, so please don't take this as criticism of him, but if someone can't be introduces to sabermetrics in a way that's comfortable and even easy, many won't. The reason more people listen to Joe Morgan and Steve Phillips is because they seem authoritative and easily understandable, even if demonstrably wrong. Phillips made a point about comparing Mauer's swing to a figure skater last night that went against the laws of physics, but millions heard it and it makes just enough sense that I bet we'll hear it again (especially in Minnesota!)
I never would have found BP if it weren't for Rob Neyer and I think Neyer is the gold standard for the "make something hard seem easy"/Basics niche. If you don't make converts, the revolution never moves forward.
That being said, I assumed 95% of what's published on BPro isn't intended for newbies, the exception being these Basics articles. People who are already here are willing to take the time to do more research (which can simlply be aking a question) when they encounter a new idea.
This is why The Basics was a strange topic for week one of BP Idol. These authors all have something unique and mostly original to provide to baseball writing, and asking them to mostly put that aside right away handcuffs them. Yes, it's a good test of writing, but 90% of this contest should be about content, not writing. I like the topic, buy maybe save it until after we've had a chance to to figure out the authors' approches for a few weeks.
Here's something along the lines of what I was expecting. It discusses why batting average isn't good enough, how we can do better, what OPS is, and gives meaning to the OPS scale. On a slightly advanced note, it shows how various measure of offense correlate to run scoring, and that OPS i better than other options with similar complexity.
http://www.redreporter.com/story/2007/7/13/0523/81591
Anyway, ending with a table isn't enough to make me not vote, but I didn't feel like the text above the table was strong enough to just give me some data and a pat on the back.
I didn't think Brian was particularly mean to Olney. Olney wrote something and Brian politely pointed out that it was false. BP writers shouldn't trash mainstream writers. But they probably should be pointing out inaccuracies.
Why shouldn't the park factor change if the park doesn't change? As long as the park changes relative to the average major-league park, that should be enough to change the park factor, shouldn't it?
I get it, I get it, but it's too much for Basics. You also introduce park factor numbers and show us a chart of them before giving us the nut graf about how the stat actually works: "A stadium having a factor of 1.45 tells us that plays in that park will be increased by 45% over normal rates..." -- that's the wrong order.
As with many of these articles, this is actually a pretty good piece. Its failings are that, in my opinion, it's just not Basic enough, and makes too many assumptions about a reader's knowledge (standard deviations).
Try imagining your mom when you write these Basics articles. Or, if you prefer, some dude sitting next to you at the ballpark who has been watching baseball for years but knows nothing about sabermetrics and things that Batting Average is the bee's knees.
For example, these two sentences really bother me: "After the initial calculation of each park's factors, use those to normalize each team's road statistics and rerun to generate a new version of factors. A third time is even better, but more than that doesn't add any meaningful accuracy."
I would've loved to hear more about why this is the accepted method since (as I commented above), I would assume this actually gets the wrong answer. Even if it's not a Basics article, it's always good to start at the beginning and explain why the community does something in a certain way. Instead, this article just glosses over a pretty major adjustment when there was plenty of opportunity to discuss it in detail.
General question for anyone: is there an explanation/justification for this method somewhere?
In any case, I thought that this article actually told me something new, and in a convincing way. However, I voted for this article because I thought that it was the best article, in a vacuum. At the same time, it is probably a few levels above a "Basics" article. One of my main problems with American Idol is that people vote based on each week, forgetting past results, and future upside. I would rather vote on potential, at least this early in the competition, rather than how closely the week's prompt was followed. I think that Brian is capable of having a Hanley Ramirez-type breakout season, as soon as his manager stops asking him to bunt the runner over. At this point, I'd rather vote for that then the solid singles hitter with limited upside.
Constructive criticism:
1. Don't use an acronym without introducing it first (at least not an acronym for something that wouldn't be found in the standard box scores). I wasn't so concerned with SD for standard deviation as I was with something like SDT in the table. I'm guessing it is singles+doubles+triples, but I'm not 100% sure. One tip to make it clearer too so that people don't miss your abbreviations is to use full words the first time and also include your abbreviation there. So something like "The chart below shows the standard deviation (SD) of the results for each category at each sample size". Now it is crystal clear in the next paragraph that SD is standard deviation.
2. I know a lot of people have moved into component park factors instead of aggregate park factors. However, for the guy in the ballpark slightly more refined in statistics then a true basic reader, but not yet a full, Brian/Tango/MGL type they may have thought about park factors only in terms of runs. Or hitters park versus pitchers park. It would be nice to either include the run factor of each park and/or explain why components are more important than runs. But touching on it in the tables would be good (unless I missed it and that is what SDT is).
3. I think using Buster's quote as an opening is fair, but one way to keep the tone from seeming to arrogant or snarky would be to credit Olney with intentionally exaggerating for effect and then have you come along to refine the argument to show how it is done for real. I think Olney knows that Yankee stadium is unlikely to have 400 HR hit there this season, even if he maybe couldn't run a bunch of multiyear regressions.
Overall though, the piece is easily better than the average BP article (which I think is the proper strict bar for BP Idol).
2) Runs and home runs are probably the two most quoted factors, and they are the ones that vary the most from park to park. Home runs were the 'hook' for this article, and being a component it's a number you can turn around and use for another computation. It's hard to do that with a runs factor and that's mainly why I don't track them, but a runs factor is easy to comprehend.
3) I'm not really familiar with Olney, and I don't assume that he was intentionally exaggerating. Maybe so, but to my reading he showed all the wrong ways to use numbers.
a) "Based on the first three games at Yankee Stadium, we can expect 400 home runs to be hit there this year."
b) "Is 17 homers in three games a lot? Well, of course it is, but to put it in context, this rate, if sustained over the course of a whole season, would mean 400 home runs would be hit there this season. By way of comparison . . ."
This first, as you point out, would be an ignorant use of statisics. But the second way is a reasonable, if not optimal, way to present the information. But the sentence after the Olney quote was labeled a "mistake" without any supporting statements, and a later passage attributes a 250% increase as if Olney had calculated a park factor himself.
If Olney in another article called a player a "malingerer" or "clubhouse cancer" based on an interpretation of a quote when a more innocent interpretation existed, he would rightfully be excoriated by BP readers. Calling Olney's analysis a mistake isn't as strong a statment because it isn't a personal attack, but the logic still applies.
Overall... odd mix of detail and generality. Nothing grabbed me in the prose, much as I love making fun of Buster Olney. C+
Putting aside one's feelings with regard to math-heavy analysis, Brian seems to have ignored the rules for this week's entries. The most basic requirement is to "craft an article around one statistic or concept and explain it." If the concept of "Park Factors" is explained in this article, the explanation eludes me.
"Yankees hitters would be normalized by having their homerun percentage reduced by 22%, and the pitchers increased by 22%."
If the new Yankee Stadium increases home runs by 22% relative to other parks, in order to normalize the hitters' home run percentages, you should reduce them by 22%. However, to penalize pitchers by increasing their home runs allowed by 22% because they pitch in a home run-inflating ballpark is clearly adjusting the numbers in the wrong direction. Why should we penalize the Kevin Millwoods of the world for pitching in Texas? Isn't it bad enough being a Kevin Millwood?
I liked parts of this article a lot, but I am a stickler for accuracy on the basics. At first glance (and only having read 2 or 3 others), it doesn't get a vote from me.
-----------------------
"As of this writing on May 20, the Yankees have played 19 games at home, which have seen a total of 71 homeruns, 37 by Yankees hitters, 34 by their opposition. They've played 21 games on the road, with 49 homeruns, 27 by Yankees hitters and 22 by their opposition. 71 homers at Yankee Stadium divided by 49 in the Yankees road games gives a factor of 1.45—indicating the new Yankee Stadium inflates homerun rates 45%. The Yankees have played two more games on the road than at home, so let's instead find the ratio of the home HR% (hr/(ab-so)) of .064 to their road game rate of .043, which is 1.48—slightly higher, but basically the same.
Is 20 games, a quarter of a season, enough of a sample size to get a reliable factor?"
------------------------------
This is the wrong question. The problem with the 1.48 number is not that the sample size may be too small, BUT THAT IT IS WRONG. Park factors are calculated using the SAME SET OF OPPONENTS at home and on the road. The Yankees have not played the same set of opponents at home and on the road. Therefore, the calculation is simply wrong.
I was hoping to read entries that were chosen exclusively by the contestant according to his or her strengths and interests, instead of forced-theme columns.
It'll definitely be a challenge if what we are requested to do is each right the type of piece that appears in BP. I could see this week's topic "Fantasy" being difficult if someone has really not played Fantasy Baseball at all (or only played it very little."
But hey, them's the rules.
I really, really don't understand the point of the themes. But whatever.
As far as what kind of writer the BP staff wants in the long run, it's honestly a bit hard to tell. Judging by the finalist selection, most of the initial entries had some grounding in statistics. We do know from Will's comments that there were comedy pieces and other non-statistics pieces submitted as well. They apparently _really_really_ wanted someone who conducted an external interview. In the end, though, their finalist selection biased who we would be able to evaluate. Some people agree or disagree with the finalists chosen, some are agreeing with the judge's comments and others aren't... I think overall, they did a good job in selecting the kinds of people we would be interested in.
So I think at this point, though BP selected the initial group, at this point I think they merely want a writer that we paying citizens find entertaining/engaging/insightful/thoughtprovoking.
Furthermore, it wouldn't be a fair contest, if the writers didn't have to come up with something new each week. Some have articles stashed away. I suppose they could find a way to tweak what they may have already written and adapt it to a theme, but that's OK. A good research article most likely takes more than a week to corral and crunch the data.