“With me, being a hard thrower … no matter what, they’re defending that heater, man. So the more confidence I have to throw that [changeup] in any count, I’m going to throw it. I’m just going to. I don’t care anymore. It’s going to help me and I realize that.”
—A.J. Burnett on his pitch selection. PITCHf/x has confidence in his fastball as well.
We’re now a week and a half into the new season and, if you’re anything like me, you’re basking in the wall-to-wall baseball that only this time of year offers. From the flip-flop in the AL Central with the Royals and Brian Bannister looking good and the Tigers…well, not…to the surprising Orioles, the not-so-surprising Giants, and the hot starts of the Brewers and Cardinals, there are plenty of story lines to keep us occupied.
But while MLB.TV may be a more or less constant companion, our attention turns to other matters as well, and so this week we’ll close the book (for now anyway) on measuring historical infield defense with Simple Fielding Runs (SFR) and open the book on PITCHf/x for 2008.
SFR in the Infield, One More Time
Before moving onto PITCHf/x in 2008, let’s first revisit a topic from the previous couple of weeks related to infield defense and SFR.
I mentioned last week that in making changes to the algorithm I also took the time to include pitcher handedness in the context that SFR uses to create its baseline matrix. I speculated that it probably wouldn’t make much difference in the overall results but I wanted to be sure, and so this week I re-ran the numbers for 2003 through 2006 (the period for which we also have UZR data with which to compare). As before, I ran two simple correlations for SFR vs. UZR with the results shown in Tables 1 and 2. For comparison, you can refer back to a previous column where I also ran correlations for both the seasonal and aggregate numbers.
Table 1. Correlation Coefficients for SFR vs. UZR, Seasonal 2003-2006 for >=50 Games Played
Pos Seasons r
All 549 0.79
1B 143 0.65
2B 132 0.78
SS 141 0.81
3B 133 0.82
Table 2. Correlation Coefficients for SFR vs. UZR, Aggregated 2003-2006 for >=162 Games
Pos Players r
All 156 0.86
1B 38 0.77
2B 40 0.88
SS 41 0.86
3B 37 0.89
In looking at the previous correlations, you can see that the changes are hardly noticeable and, in fact, the correlations are just slightly lower at first base while remaining the same everywhere else.
My assumption remains that the essential information about hit distribution (which one would think pitcher handedness would most affect) was already captured in the combination of hit type and batter handedness, and so by adding pitcher handedness we didn’t really add any new information. Although that’s plausible, it could also be the case that there is an offsetting effect in play where the additional context would indeed have produced better results (in theory) but that by creating smaller buckets (essentially splitting the existing buckets used for comparison in the baseline in half) we at the same time introduce more variation and less reliable results that cancel out the effect of adding pitcher handedness.
On a second note, in looking at the data again for 1986-1987 and 2000-2002 I determined that it would be worth running the framework as is, since both contain the vast majority of fielder identifications, albeit being deficient with respect to hit types. Hit types are especially low for 2000-2002 but, of course, the changes made last week are designed to account for this and so should compensate to some degree.
The upshot of all of this is that you can now download a spreadsheet that contains SFR data for all major league infielders for the time periods that include 1957-1983, 1986-1998, and 2000-2007. Keep in mind that the results provided are based on a single algorithm that takes into account when there is missing hit type information and fills it in accordingly. This means that the results for 1988-1998 and 2003-2007 should be considered more accurate since they are based on essentially complete data (minus actual zone information of course) with 1957-1983, 1986-1987 and 2000-2002 in descending order of precision, and with 1984 and 1985 still out of the picture.
Finally, to finish this out, let’s take a look at the new overall leaders in Rate at each of the four infield positions when all seasons for which we have data are included. You’ll notice we’ve also upped the ante and are looking only at players who were assigned 2,000 or more balls in their virtual area of responsibility.
Table 3. Top and Bottom Shortstops by Rate, >= 2,000 Balls 1957-2007 (almost)
Name Span Balls SFR Rate Adam Everett 2001-2007 2558 86.0 1.21 Bob Lillis 1958-1967 2053 55.8 1.20 Ernie Banks 1957-1961 2916 87.4 1.19 Rey Sanchez 1991-2005 3341 88.2 1.17 Mark Belanger 1965-1982 8468 198.3 1.15 ------------------------------------------------------ Frank Taveras 1972-1982 4903 -106.7 0.89 Ruben Amaro 1958-1969 2909 -66.3 0.88 Ricky Gutierrez 1993-2004 3079 -82.2 0.88 Andujar Cedeno 1990-1996 2426 -74.6 0.87 Kurt Stillwell 1986-1996 2727 -75.8 0.87
Table 4. Top and Bottom Second Baseman by Rate, >= 2,000 Balls 1957-2007 (almost)
Name Span Balls SFR Rate Dick Green 1963-1974 4281 102.0 1.18 Mark Ellis 2002-2007 2680 67.7 1.18 Mike Gallego 1986-1997 2117 50.5 1.16 Mark Lemke 1988-1998 3602 90.2 1.15 Jose Oquendo 1986-1995 2473 45.2 1.11 ------------------------------------------------------ Bobby Richardson 1957-1966 4976 -109.9 0.87 Luis Rivas 2000-2007 2013 -46.9 0.87 Tony Taylor 1958-1976 5735 -152.6 0.86 Cookie Rojas 1962-1977 5594 -157.3 0.85 Jorge Orta 1972-1979 2567 -79.4 0.83
Table 5. Top and Bottom Third Baseman by Rate, >= 2,000 Balls 1957-2007 (almost)
Name Span Balls SFR Rate Brooks Robinson 1957-1977 9686 293.0 1.26 Jim Davenport 1958-1970 2917 68.3 1.19 Eric Chavez 1998-2007 3488 73.9 1.16 Aurelio Rodriguez 1967-1983 6266 133.7 1.16 Scott Rolen 1996-2007 4423 95.9 1.16 ------------------------------------------------------ Jim Presley 1986-1991 2111 -48.4 0.87 Howard Johnson 1982-1995 2009 -52.7 0.86 Bill Madlock 1973-1987 3520 -90.4 0.86 Harmon Killebrew 1957-1971 2256 -74.9 0.82 Dick Allen 1964-1972 2252 -118.4 0.74
Table 6. Top and Bottom First Baseman by Rate, >= 2,000 Balls 1957-2007 (almost)
Name Span Balls SFR Rate Todd Helton 1997-2007 3202 55.1 1.18 John Olerud 1989-2005 4172 68.9 1.16 Pete O'Brien 1982-1993 2410 33.5 1.15 Wes Parker 1964-1972 2237 23.1 1.14 George Scott 1966-1979 4431 51.1 1.13 ------------------------------------------------------ John Mayberry 1968-1982 3008 -23.8 0.93 Donn Clendenon 1962-1972 2584 -23.1 0.93 Mo Vaughn 1991-2003 2472 -27.7 0.91 Willie Montanez 1970-1982 2539 -29.7 0.90 Dick Stuart 1958-1969 2104 -59.4 0.79
PITCHf/x for 2008
With the new season upon us, we want to continue exploring the most recent data set made available to analysts–Sportvision’s PITCHf/x system. The advantage this time around is that we’ll have data for all 30 parks, beginning with the opening game in Washington’s new ballpark (sans the first two games at the Tokyo Dome) and hopefully taking us through the playoffs and World Series.
Besides being able to collect more data (last year the system collected 332,851 pitches representing about half the schedule while this year we should have more than 650,000) thereby making the samples larger and therefore more reliable, analysts everywhere are hopeful some of the kinks in the system have been worked out necessitating fewer adjustments. For example, while last year the point at which the pitch was being tracked was adjusted in mid-season, it appears that this season all pitches are being tracked starting at a point 50 feet from the plate, making adjustment for velocity unnecessary. In addition, it is now the case that Gameday operators (like yours truly) have the ability to override pitch data that comes in to the system. Unfortunately, it appears that in the XML data set these overrides simply appear as pitches without any data other than the operator’s x and y coordinates, making it impossible to know whether a pitch was simply missed by the system (i.e. the system wasn’t turned on or had some other problem) or whether it was truly overridden. As you might expect, early-season configuration and operator issues likely cause more of the former, and so hopefully the number of untracked pitches will be more a reflection of actual overrides as the season progresses. At this point (through games of April 6), 5% of the pitches have gone untracked.
What is probably most interesting to analysts, and what most fans have no doubt noticed, is that PITCHf/x is taking a stab at pitch type categorization, and displays it in the client, as shown in Figure 1.
Figure 1. Classifying Pitch Types
Several analysts, most notably Josh Kalk and John Walsh, have developed procedures for pitch identification, while I’ve been working off individual pitcher profiles when the need has arisen, and occasionally creating larger buckets for fastballs, changeups and breaking balls. Although the details of the algorithm used by Sportvision is not public at this point, they are classifying pitches into many categories, including changeups, curveballs, fastballs, four-seam fastballs, cut fastballs, split-fingered fastballs, knuckleballs, pitchouts, intentional balls, sinkers, sliders and unknown. Interestingly, they are also providing a confidence level for each pitch classified that ranges from 0 to 1 and is apparently a percentage.
To get a feel for the frequency, characteristics, and confidence with which their algorithms classify pitches, the number of pitches for each pitch type by pitcher hand along with velocity, movement, and confidence is shown in Tables 7 and 8 for games through April 7:
Table 7. PITCHf/x Pitch Classification for Left Handed Pitchers in 2008
Pitch Throws Conf Vel Horiz Vert Count Pct Change L 0.681 78.8 7.0 6.2 668 13% Curve L 0.811 74.6 -4.7 -4.6 623 12% Cutter L 0.687 84.3 -4.3 6.0 195 4% Fastball L 0.848 89.1 6.6 9.3 3113 60% Four-seamer L n/a n/a n/a n/a 0 n/a Intent Ball L 1.000 68.6 2.5 8.7 9 0% Knuckleball L n/a n/a n/a n/a 0 n/a Pitch out L 1.000 83.0 5.7 9.6 2 0% Sinker L n/a n/a n/a n/a 0 n/a Slider L 0.662 82.0 -0.9 2.9 517 10% Splitter L 0.513 83.5 4.8 5.9 90 2% Unknown L 0.000 54.8 -0.9 6.2 1 0%
Table 8. PITCHf/x Pitch Classification for Right Handed Pitchers in 2008
Pitch Throws Conf Vel Horiz Vert Count Pct Change R 0.665 82.8 -7.7 6.2 1858 11% Curve R 0.713 76.2 5.2 -4.7 1459 9% Cutter R 0.531 88.4 0.3 8.6 507 3% Fastball R 0.795 91.0 -6.6 8.9 8909 52% Four-seamer R 0.583 91.9 -10.9 11.4 142 1% Intent Ball R 1.000 71.2 -5.8 8.1 85 0% Knuckleball R 0.844 67.9 2.7 1.3 85 0% Pitch out R 1.000 81.7 -6.7 9.3 21 0% Sinker R 0.548 90.0 -11.9 6.6 667 4% Slider R 0.710 83.6 2.2 3.6 2812 16% Splitter R 0.493 83.8 -9.3 3.2 512 3% Unknown R 0.000 52.7 -4.8 12.2 1 0%
Note that these tables exclude some pitches, since my master list of players is not complete, and so for those pitchers handedness is not recorded.
The algorithm thus far classifies almost all fastballs as simply “fastballs” and doesn’t really use the four-seamer or cut-fastball designations as often as they are certainly thrown. The same argument probably also applies to sinkers and perhaps splitters as well. However, the distribution of changeups, curves, and sliders appears to be closer to what one might expect.
Also, the algorithm is customized to some degree for each pitcher and at the very least incorporates velocity, since when plotting movement against pitch type (with respect to a non-spinning pitch as shown in Figure 2 for left-handed pitchers from the perspective of the batter) the groups for the various pitches overlap fairly significantly.
Figure 2. Pitch Type Groupings for Southpaws, 2008
And just for kicks, let’s take a look at the pitchers and pitch types that PITCHf/x is most and least confident about in the small sample we have from the first week.
Table 9. Most and Least Confident by Pitch Type, 20 or more pitches
A.J. Burnett Fastball 63 0.965 Jonathan Albaladejo Fastball 31 0.962 Philip Hughes Fastball 51 0.948 Hong-Chih Kuo Fastball 42 0.942 Richard Hill Curve 30 0.940 Kent Mercker Fastball 20 0.940 Jeremy Affeldt Curve 28 0.940 Erick Threets Fastball 37 0.939 Manuel Parra Fastball 53 0.938 Damaso Marte Fastball 27 0.925 --------------------------------------------- Micah Owings Cutter 24 0.501 Jason Bergmann Fastball 21 0.498 Justin Verlander Sinker 40 0.486 Brad Thompson Splitter 25 0.481 Justin Verlander Fastball 35 0.460 Joe Saunders Change 20 0.460 James Shields Splitter 24 0.438 Aaron Cook Sinker 29 0.420 Jason Bergmann Sinker 22 0.330 Mike Mussina Splitter 22 0.313
Finally, to close out this week’s musings, let’s compare the pitch profile I created for Felix Hernandez last year with the pitch classification in use by PITCHf/x this season. So far PITCHf/x has recorded two Hernandez starts–in total, 194 pitches on April 1 versus Texas and April 6 at Baltimore. First, let’s take a look at his pitch frequency using my classification scheme from last season.
Table 10. King Felix in 2008 Pitch Classification
Pitch Count Start Horiz Vert Conf Other System? Unknown 1 91.3 -1.8 -1.9 n/a Slider Changeup 32 86.2 -7.3 2.9 n/a 26 classified as splitter Curve 29 82.1 5.1 -5.4 n/a 7 sliders, 21 curves, 1 sinker Four-Seamer 40 96.3 -7.6 7.6 n/a All fastballs Two-Seamer 69 93.8 -7.2 6.9 n/a 1 split-fingered, 1 sinker Slider 23 88.5 1.2 -1.0 n/a All sliders
Table 11. King Felix in 2008 PITCHf/x Pitch Classification
Pitch Count Start Horiz Vert Conf Other System? Unknown 0 n/a n/a n/a n/a Changeup 4 86.4 -8.6 6.4 0.530 all changeups Curve 21 80.5 6.0 -6.6 0.785 all curves Four-Seamer 0 n/a n/a n/a n/a Fastball 109 94.7 -7.3 7.2 0.854 2 changes, 40 four-seamers, 67 two-seamers Split-finger 27 86.1 -7.1 2.2 0.558 26 changes, 1 two-seamer Sinker 2 86.7 0.7 -1.9 0.502 1 curve, 1 two-seamer Slider 31 88.1 1.3 -1.2 0.650 7 curves, 23 sliders, 1 unknown
Overall, the PITCHf/x algorithm does a pretty good job as both systems significantly agree on curves, sliders, and fastballs. But you’ll notice that my algorithm identifies a significant number of his pitches as changeups while PITCHf/x sees them as split-fingered fastballs. Since Hernandez does not, to my knowledge, throw a splitter, the conclusion is that while the algorithm considers some information on other pitches thrown by the pitcher, the subset of pitches it chooses from is not restricted. As a result, there will be cases where the classification is incorrect, although of course Hernandez is probably one of the more difficult pitchers to look at since he throws at a higher average velocity than most pitchers.
Be that as it may, the addition of pitch type confidence is a welcome one and should allow for a greater breadth of analyses while providing the benefit of being standardized.
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.
Subscribe now