Overview
This past week, I started talking to one of my classmates, Shabab Ahmed, about the possibility of doing a research project together. The one problem: it has to do with the sport of baseball. Shabab is a big cricket fan, which makes him more baseball-literate than perhaps the average person that you’d encounter in academia, but the project will have to do with some of the finer points of baseball strategy.
However, this realization is important to have early on, as discussing this project with people who aren’t familiar with baseball will require us to give them the basics quickly. The data that we have is also a little opaque with respect to documentation, so we need to walk through to figure out how to use it best. Thus, this post will be a combination of data exploration and explanation of how pitching works in baseball1.
The Data
Background
The data that we want to use in this project comes from Major League Baseball, the pre-eminent baseball league in the United States (and arguably, the world). Beginning in the 2006 playoffs2, a three-camera system tracks every pitch thrown in every baseball game. These cameras don’t only report the locations of where pitches cross the plate, but also more advanced physical properties of the pitches, such as the amount of break, the speed, and the spin rate (in theory).
Dataset
Private websites post this pitch-by-pitch data, but a Kaggle user made data from 2015-2019 available here for download. In addition, the publicly available sites are scrape-able, if we need additional data in the future. If we simply take the 2019 file, we have data that looks like this:
px | pz | start_speed | end_speed | spin_rate | spin_dir | break_angle | break_length | break_y | ax | ay | az | sz_bot | sz_top | type_confidence | vx0 | vy0 | vz0 | x | x0 | y | y0 | z0 | pfx_x | pfx_z | nasty | zone | code | type | pitch_type | event_num | b_score | ab_id | b_count | s_count | outs | pitch_num | on_1b | on_2b | on_3b |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.00 | 2.15 | 88.8 | 80.7 | placeholder | placeholder | 22.8 | 4.8 | 24 | -8.47 | 28.90 | -15.51 | 1.70 | 3.36 | placeholder | 5.28 | -128.95 | -6.89 | 116.97 | -1.42 | 180.81 | 50 | 6.07 | -5.07 | 9.98 | NA | placeholder | X | X | FF | 5 | 0 | 2019000001 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
0.34 | 2.31 | 89.9 | 81.8 | placeholder | placeholder | 22.8 | 3.6 | 24 | -7.10 | 28.85 | -12.99 | 1.80 | 3.55 | placeholder | 4.89 | -130.54 | -7.48 | 103.93 | -1.02 | 176.34 | 50 | 6.20 | -4.14 | 11.18 | NA | placeholder | C | C | FF | 8 | 0 | 2019000002 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
-0.05 | 2.03 | 85.7 | 79.6 | placeholder | placeholder | 9.6 | 6.0 | 24 | 3.65 | 22.07 | -22.64 | 1.59 | 3.55 | placeholder | 2.33 | -124.60 | -5.98 | 118.86 | -1.29 | 183.96 | 50 | 6.30 | 2.30 | 5.99 | NA | placeholder | S | S | SL | 9 | 0 | 2019000002 | 0 | 0 | 1 | 2 | 0 | 0 | 0 |
0.49 | 0.92 | 85.4 | 78.5 | placeholder | placeholder | 24.0 | 7.2 | 24 | -13.77 | 24.44 | -25.74 | 1.74 | 3.55 | placeholder | 7.83 | -123.74 | -6.78 | 98.15 | -1.56 | 214.03 | 50 | 5.85 | -8.87 | 4.14 | NA | placeholder | B | B | CH | 10 | 0 | 2019000002 | 0 | 1 | 1 | 3 | 0 | 0 | 0 |
-0.13 | 1.11 | 84.6 | 77.6 | placeholder | placeholder | 26.4 | 8.4 | 24 | -15.99 | 24.56 | -28.36 | 1.83 | 3.59 | placeholder | 6.79 | -122.67 | -5.73 | 121.81 | -1.57 | 208.77 | 50 | 5.89 | -10.51 | 2.51 | NA | placeholder | B | B | CH | 11 | 0 | 2019000002 | 1 | 1 | 1 | 4 | 0 | 0 | 0 |
0.09 | 3.41 | 90.9 | 82.4 | placeholder | placeholder | 27.6 | 3.6 | 24 | -8.27 | 30.12 | -13.15 | 1.59 | 3.47 | placeholder | 4.37 | -132.01 | -4.92 | 113.39 | -0.98 | 146.83 | 50 | 6.28 | -4.72 | 10.86 | NA | placeholder | X | X | FF | 12 | 0 | 2019000002 | 2 | 1 | 1 | 5 | 0 | 0 | 0 |
There are a lot of columns, and not a lot of explanations about what exactly they describe. Luckily, some work has been done on this already here. So if we list the columns from PITCHf/x with the definitions that are posted here, we can see what we’re still missing. I also added in some common sense definitions for the non-PITCHf/x situation variables.
Variable | Meaning |
---|---|
px | the left/right distance, in feet, of the pitch from the middle of the plate as it crossed home plate. The PITCHf/x coordinate system is oriented to the catcher’s/umpire’s perspective, with distances to the right being positive and to the left being negative. |
pz | the height of the pitch in feet as it crossed the front of home plate. |
start_speed | the pitch speed, in miles per hour and in three dimensions, measured at the initial point, y0. Of the two speeds, this one is closer to the speed measured by a radar gun and what we are familiar with for a pitcher’s “velocity” . |
end_speed | the pitch speed measured as it crossed the front of home plate. |
spin_rate | |
spin_dir | |
break_angle | the angle, in degrees, from vertical to the straight line path from the release point to where the pitch crossed the front of home plate, as seen from the catcher’s/umpire’s perspective. |
break_length | the measurement of the greatest distance, in inches, between the trajectory of the pitch at any point between the release point and the front of home plate, and the straight line path from the release point and the front of home plate, per the MLB Gameday team. John Walsh’s article “In Search of the Sinker” has a good illustration of this parameter. |
break_y | the distance in feet from home plate to the point in the pitch trajectory where the pitch achieved its greatest deviation from the straight line path between the release point and the front of home plate. |
ax | the acceleration of the pitch, in feet per second per second, in the x dimension, measured at the initial point. |
ay | the acceleration of the pitch, in feet per second per second, in the y dimension, measured at the initial point. |
az | the acceleration of the pitch, in feet per second per second, in the z dimension, measured at the initial point. |
sz_bot | the distance in feet from the ground to the bottom of the current batter’s rulebook strike zone. The PITCHf/x operator sets a line at the hollow of the knee for the bottom of the zone. |
sz_top | the distance in feet from the ground to the top of the current batter’s rulebook strike zone as measured from the video by the PITCHf/x operator. The operator sets a line at the batter’s belt as he settles into the hitting position, and the PITCHf/x software adds four inches up for the top of the zone. |
type_confidence | the value of the weight at the classification algorithm’s output node corresponding to the most probable pitch type, this value is multiplied by a factor of 1.5 if the pitch is known by MLBAM to be part of the pitcher’s repertoire. |
vx0 | the velocity of the pitch, in feet per second, in the x dimension, measured at the initial point. |
vy0 | the velocity of the pitch, in feet per second, in the y dimension, measured at the initial point. |
vz0 | the velocity of the pitch, in feet per second, in the z dimension, measured at the initial point. |
x | the horizontal location of the pitch as it crossed home plate as input by the Gameday stringer using the old Gameday coordinate system. I’m not sure what units are used or where the origin is located. Note that the y dimension in the old coordinate system is now called the z dimension in the new PITCHf/x coordinate system detailed below. |
x0 | the left/right distance, in feet, of the pitch, measured at the initial point. |
y | the vertical location of the pitch as it crossed home plate as input by the Gameday stringer using the old Gameday coordinate system. I’m not sure what units are used or where the origin is located. Note that the y dimension in the old coordinate system is now called the z dimension in the new PITCHf/x coordinate system detailed below. |
y0 | the distance in feet from home plate where the PITCHf/x system is set to measure the initial parameters. This parameter has been variously set at 40, 50, or 55 feet (and in a few instances 45 feet) from the plate at different times throughout the 2007 season as Sportvision experiments with optimal settings for the PITCHf/x measurements. Sportvision settled on 50 feet in the second half of 2007, and this value of y0=50 feet has been used since. Changes in this parameter impact the values of all other parameters measured at the release point, such as start_speed. |
z0 | the height, in feet, of the pitch, measured at the initial point. |
pfx_x | the horizontal movement, in inches, of the pitch between the release point and home plate, as compared to a theoretical pitch thrown at the same speed with no spin-induced movement. This parameter is measured at y=40 feet regardless of the y0 value. |
pfx_z | the vertical movement, in inches, of the pitch between the release point and home plate, as compared to a theoretical pitch thrown at the same speed with no spin-induced movement. This parameter is measured at y=40 feet regardless of the y0 value. |
nasty | |
zone | |
code | code representing the outcome of the pitch. |
type | a one-letter abbreviation for the result of the pitch: B, ball; S, strike (including fouls); X, in play. |
pitch_type | the most probable pitch type according to a neural net classification algorithm developed by Ross Paul of MLBAM. |
event_num | code describing what event transpired on a batted ball. |
b_score | score of the batter’s team |
ab_id | identifier to link the at-bat table. |
b_count | number of balls when pitch was thrown. |
s_count | number of strikes when pitch was thrown. |
outs | number of outs when pitch was thrown. |
pitch_num | counter describing the sequence of pitches in the at-bat (i.e. a value of 2 means that the pitch is the second in the at-bat.) |
on_1b | indicator for whether there is a runner on first base when the pitch is thrown. |
on_2b | indicator for whether there is a runner on second base when the pitch is thrown. |
on_3b | indicator for whether there is a runner on third base when the pitch is thrown. |
So to fully understand this data, there a few “mini-investigations” that I undertake to ensure everything makes sense.
code
vs. type
These two fields, to an eyeball test, look identical. We can easily find any discrepancies:
pitches_og[code != type | (is.na(type) & !is.na(code)) | (is.na(code) & !is.na(type))]
## Empty data.table (0 rows and 40 cols): px,pz,start_speed,end_speed,spin_rate,spin_dir...
So in all of observations (at least for 2019), code
and type
are the same. We need not worry about using one vs the other. Additionally, on the Kaggle page, the values that code
takes are defined, which is fantastic.
spin_rate
and spin_dir
These seem valuable for our analysis, but they aren’t defined anywhere. If we take tables of both, we can possibly infer the units of the variables (they’re character, not numeric):
table(pitches_og$spin_dir, useNA = 'always')
##
## placeholder <NA>
## 6629 722161 0
table(pitches_og$spin_rate, useNA = 'always')
##
## placeholder <NA>
## 6629 722161 0
So we can safely ignore these variables, as they’re alternatively placeholder
or blank.
zone
and nasty
I suspect these are additional features that are built using the physics data, but I don’t think they’re populated in this dataset:
table(pitches_og$nasty, useNA = 'always')
##
## <NA>
## 728790
table(pitches_og$zone, useNA = 'always')
##
## placeholder <NA>
## 6629 722161 0
Again, these aren’t populated, so we needn’t worry about them.
Pitching in Baseball
For the uninitiated to America’s Pastime, the goal of the game is to end nine innings with more runs than your opponent. You score runs by getting hits, and the job of the pitcher is to prevent hits (this is obviously extremely simplistic, but more detail isn’t necessary at the moment). The pitcher prevents hits by throwing pitches that are difficult for the batter to hit. However, the pitcher faces a constraint: if they throw four pitches that are not in the strike zone over the course of an at-bat as judged by the umpire (referee), the batter is allowed to advance to first base for free (this is called a walk). Such an outcome is not desirable for the pitcher, so their goal is to throw pitches that are maximally difficult to hit, while satisfying the constraint of being within the strike zone (as if the batter does not swing at a pitch judged to be in the strike zone, it’s counted as a strike).
To visualize an at-bat, I’m going to first merge all the data together to create the full_data
dataset:
abs_og <- fread('~/Google Drive/Raw Data/mlb_pitch_data/2019_atbats.csv') %>%
mutate_at(vars(ends_with('_id')), list(~ as.character(.))) %>%
data.table()
names_og <- fread('~/Google Drive/Raw Data/mlb_pitch_data/player_names.csv') %>%
mutate_at(vars(ends_with('id')), list(~ as.character(.))) %>%
data.table()
games_og <- fread('~/Google Drive/Raw Data/mlb_pitch_data/2019_games.csv') %>%
mutate_at(vars(ends_with('_id')), list(~ as.character(.))) %>%
data.table()
full_data <- pitches_og %>%
left_join(abs_og,
by = 'ab_id') %>%
left_join(names_og %>%
select(pitcher_id = id,
pitcher_first_name = first_name,
pitcher_last_name = last_name),
by = 'pitcher_id') %>%
left_join(names_og %>%
select(batter_id = id,
batter_first_name = first_name,
batter_last_name = last_name),
by = 'batter_id') %>%
left_join(games_og,
by = 'g_id')
So what does an at-bat actually look like? Well, we can build a function using this data to visualize a given at-bat.
full_data_built <- full_data %>%
group_by(batter_id) %>%
mutate(sz_bot_mean = mean(sz_bot),
sz_top_mean = mean(sz_top)) %>%
ungroup() %>%
mutate(date = as.Date(date)) %>%
data.table()
visualize_ab <- function(ab_id_var, box = FALSE){
ab_data <- full_data_built[ab_id == ab_id_var]
# strike zone/plate dimensions
low_sz <- unique(ab_data$sz_bot_mean)
high_sz <- unique(ab_data$sz_top_mean)
plate_min = -17/12/2
plate_max = 17/12/2
# pitcher and batter name
pitcher <- unique(ab_data$pitcher_last_name)
batter <- unique(ab_data$batter_last_name)
date <- unique(ab_data$date)
# make url (need to figure out how to identify doubleheaders, probably just by date?)
home_team <- toupper(unique(ab_data$home_team))
url <- str_interp('https://www.baseball-reference.com/boxes/${home_team}/${home_team}${format(date,"%Y%m%d")}0.shtml')
# formatting vectors
values_vec <- c('B' = 'red',
'*B' = 'red',
'I' = 'red',
'P' = 'red',
'C' = 'green',
'S' = 'purple',
'W' = 'purple',
'M' = 'purple',
'Q' = 'purple',
'X' = 'blue',
'D' = 'blue',
'E' = 'blue',
'F' = 'orange',
'T' = 'orange',
'L' = 'orange',
'R' = 'orange',
'H' = 'grey',
'E' = 'blue')
labels_vec <- c('B' = 'Ball',
'*B' = 'Ball in dirt',
'I' = 'Intentional Ball',
'P' = 'Pitchout',
'C' = 'Strike (Called)',
'S' = 'Strike (Swinging)',
'W' = 'Strike (Swinging, Blocked)',
'M' = 'Missed Bunt',
'Q' = 'Swinging Pitchout',
'X' = 'In Play, Out(s)',
'D' = 'In Play, No Out',
'E' = 'In Play, Run(s)',
'F' = 'Foul',
'T' = 'Foul Tip',
'L' = 'Foul Bunt',
'R' = 'Foul Pitchout',
'H' = 'Hit By Pitch',
'E' = 'In Play')
if(box == TRUE){
cat(url)
}
# plot
ggplot(ab_data) +
scale_x_continuous(limits = c(min(-1, min(ab_data$px) - .25/2),
max(1, max(ab_data$px) + .25/2)),
expand = expansion(mult = .03)) +
scale_y_continuous(limits = c(min(1, min(ab_data$pz) - .25/2),
max(4, max(ab_data$pz) + .25/2)),
expand = expansion(mult = .03)) +
ggforce::geom_circle(aes(x0 = px, y0 = pz, color = code, fill = code, r = .25/2),
alpha = .5) +
geom_text(aes(x = px, y = pz, label=pitch_num), hjust=.5, vjust=.5) +
geom_rect(xmin = plate_min,
xmax = plate_max,
ymin = low_sz,
ymax = high_sz,
alpha = 0,
color = 'black') +
scale_color_manual(values = values_vec,
labels = labels_vec) +
scale_fill_manual(values = values_vec,
labels = labels_vec) +
theme(legend.position = 'bottom',
legend.title = element_blank(),
axis.title = element_blank()) +
coord_fixed() +
labs(title = str_interp('${pitcher} vs ${batter} (${abs_og[ab_id == ab_id_var]$stand}), ${ifelse(abs_og[ab_id == ab_id_var]$top == 0, "Bottom", "Top")} ${abs_og[ab_id == ab_id_var]$inning}'),
subtitle = str_interp('Result: ${abs_og[ab_id == ab_id_var]$event}'))
}
The function is sort of long – here are the only somewhat interesting decisions that I made when building it (the rest is just base ggplot
):
- The use of the
ggforce
package’s functiongeom_circle
instead ofggplot
’s base functiongeom_point
to plot. The only reason I did this was to be able to try to make the circles the actual size of a baseball to contextualize them better. - The inclusion of a
url
parameter. I realized that I was spending a lot of time checking at-bats with https://www.baseballreference.com, so I decided to try to decrease the difficulty of accessing Baseball Reference by creating the game’s URL in the function if the user chooses. - I plotted the strike zone for a batter by taking the average of all their
sz_top
andsz_bot
observations in the data. The outer bounds of the plate are defined by the size of home plate (17 inches) and the fact that it’s centered in PITCHf/x data on 0.
A visualization will look something like this:
visualize_ab('2019000030')
On this at-bat, we see that there were 8 pitches. The first was a called strike that looks like it was slightly off the plate. The next pitch was a low ball, the next a foul, and so on, until on the eighth pitch of the at-bat, Semien hits a home run on a pitch in the center of the plate. This at-bat actually motivates the discussion of pitching strategy below.
Basic Pitching Strategy
One of the most basic tenets of pitching philosophy is that throwing the ball in the center of the plate is a cardinal sin (as an example, see the at-bat above). The reason for this isn’t complicated – while pitches in the middle of the plate are most likely to be called strikes, they are also the easiest for the batter to hit. Using the 2019 data, we have the ability to assess whether this holds in practice (to be clear, it would be truly shocking if it did not, so we can safely have a strong prior about what this should show).
To do this (I won’t put a ton of effort into showing this), let’s just look at the number of bases that are produced on average by strikes in various parts of the plate. I’ll do this by creating a heat map of such pitches:
# get the average strike zone for reference
zone <- full_data %>%
distinct(batter_id, sz_bot, sz_top) %>%
summarise(mean_low = mean(sz_bot),
mean_high = mean(sz_top))
outcome_data <- full_data %>%
# filter to strikes
filter(code %in% c('C', 'S', 'W', 'M', 'Q', 'X', 'D', 'E', 'F', 'T', 'L', 'R')) %>%
# calculate bases on those strikes
mutate(bases = case_when(event == 'Single' & code %in% c('X', 'D', 'E') ~ 1,
event == 'Double' & code %in% c('X', 'D', 'E') ~ 2,
event == 'Triple' & code %in% c('X', 'D', 'E') ~ 3,
event == 'Home Run' & code %in% c('X', 'D', 'E') ~ 4,
TRUE ~ 0)) %>%
distinct(batter_id, batter_first_name, event, pitch_num, type, batter_last_name, ab_id, bases, px, pz) %>%
# each box is one square inches
mutate(px_box = floor(px * 12)/12,
pz_box = floor(pz * 12)/12) %>%
group_by(px_box, pz_box) %>%
summarise(mean_bases = mean(bases),
n = n()) %>%
ungroup() %>%
# somewhat arbitrarily throw out any box that has fewer than 50 pitches
filter(n >= 50, !is.na(mean_bases), !is.na(px_box), !is.na(pz_box)) %>%
data.table()
ggplot(data = outcome_data) +
geom_tile(aes(x = px_box, y = pz_box, fill = mean_bases)) +
scale_fill_gradient(low = 'red', high = 'green') +
geom_rect(xmin = -17/12/2,
xmax = 17/12/2,
ymin = zone$mean_low,
ymax = zone$mean_high,
alpha = 0,
color = 'black') +
theme(axis.title = element_blank()) +
guides(fill=guide_legend(title="Mean Bases"))
Basic takeaway: pitches over the center of the plate have worse outcomes for pitchers than those that aren’t, on average. Notice how the edges of the strike zone are much redder than the center – throwing in those locations is optimal for pitchers.
Another piece of pitching theory is that “hanging” a pitch is bad. Hanging a pitch is when the pitcher attempts to throw a breaking ball (i.e., a pitch that curves due to spin), but doesn’t get enough spin on it, resulting in a pitch that’s slow and normally high in the strike zone, which generally is a bad outcome for the pitcher. My high school coach used to ominously say “you hang, he bang”, which I always found kind of amusing.
To look at this at a high level, we can run the most simple linear regression we can think of on this data. Particularly, let’s try
\[ bases = speed + |location_x| + |location_z| + angle_{break} + length_{break} + speed \times length_{break} \]
Here, I’m abusing notation somewhat, and taking the absolute value operator to mean absolute deviation from the mean. Note that we’re not considering any interactions here, although there are certainly relevant ones. There’s also a somewhat important combination that we’re omitting here, which is whether the batter and pitcher are both right handed, both left handed, or separate, but let’s ignore that for now. Looking just at these pitch characteristics, we can run a regression:
zone_mean <- (zone$mean_low + zone$mean_high)/2
reg_data <- full_data %>%
# filter to strikes
filter(code %in% c('C', 'S', 'W', 'M', 'Q', 'X', 'D', 'E', 'F', 'T', 'L', 'R')) %>%
# calculate bases on those strikes
mutate(bases = case_when(event == 'Single' & code %in% c('X', 'D', 'E') ~ 1,
event == 'Double' & code %in% c('X', 'D', 'E') ~ 2,
event == 'Triple' & code %in% c('X', 'D', 'E') ~ 3,
event == 'Home Run' & code %in% c('X', 'D', 'E') ~ 4,
TRUE ~ 0),
abs_px = abs(px),
abs_pz = abs(pz - zone_mean)) %>%
distinct(ab_id, pitch_num, bases, abs_px, abs_pz, px, pz, end_speed, break_angle, break_length)
basic_pitch_reg <- lm(bases ~ end_speed + abs_px + abs_pz + end_speed*break_length + break_angle + break_length, data = reg_data)
Variable | Estimate | SE | T-Statistic | P-Value |
---|---|---|---|---|
(Intercept) | 0.6768716 | 0.0420716 | 16.088577 | 0.0000000 |
end_speed | -0.0050291 | 0.0004941 | -10.177336 | 0.0000000 |
abs_px | -0.1230725 | 0.0025440 | -48.377068 | 0.0000000 |
abs_pz | -0.1085832 | 0.0021161 | -51.313695 | 0.0000000 |
break_length | -0.0564764 | 0.0040941 | -13.794469 | 0.0000000 |
break_angle | -0.0001956 | 0.0000800 | -2.446789 | 0.0144139 |
end_speed:break_length | 0.0007357 | 0.0000518 | 14.215595 | 0.0000000 |
So without digging too deeply into anything, we see that the data suggests a few things:
- Balls close to the center of the strike zone (small deviation from the calculated center) are hit for more bases (ceteris paribus)
- Faster pitches are hit for fewer bases than slower pitches (ceteris paribus)
- Pitches that break more are hit for fewer bases than pitches that break less (ceteris paribus)
Summary
So what should we take away from this basic review? Pitchers are trying to throw balls that are at the edges of the strike zone. In addition, they want to throw pitches that break a lot or are really fast (or in theory, both!)
There are a few other components that we’ll want to investigate as part of the conventional knowledge. One is the sequence of pitches – is a pitcher keeping a hitter “off-balance” by throwing pitches in varying speeds and locations? We also want to look at (and likely control for) the platoon advantage, which is the idea that a righty pitcher vs. a righty batter or a lefty pitcher vs. a lefty batter is an advantageous matchup (in general) for a pitcher, while a righty pitcher vs. a lefty batter or a lefty pitcher vs. a righty batter is an advantageous situation for the batter.
I’ll probably dig into that a little more in a further post, but this is a good starting point for the theory and intuition of basic pitching in baseball.
As a quickly explanator/apology: I use data.table and dyplr syntax together a lot, so if you only use one of these packages, a few things may look odd. I try to mostly use data.table for quick filtering and investigation, but just be aware that some of the code may not run without both packages↩︎
https://slate.com/culture/2007/08/pitch-f-x-the-new-technology-that-will-change-baseball-analysis-forever.html↩︎