Understanding Pitching Data

Overview

This past week, I started talking to one of my classmates, Shabab Ahmed, about the possibility of doing a research project together. The one problem: it has to do with the sport of baseball. Shabab is a big cricket fan, which makes him more baseball-literate than perhaps the average person that you’d encounter in academia, but the project will have to do with some of the finer points of baseball strategy.

However, this realization is important to have early on, as discussing this project with people who aren’t familiar with baseball will require us to give them the basics quickly. The data that we have is also a little opaque with respect to documentation, so we need to walk through to figure out how to use it best. Thus, this post will be a combination of data exploration and explanation of how pitching works in baseball¹.

The Data

Background

The data that we want to use in this project comes from Major League Baseball, the pre-eminent baseball league in the United States (and arguably, the world). Beginning in the 2006 playoffs², a three-camera system tracks every pitch thrown in every baseball game. These cameras don’t only report the locations of where pitches cross the plate, but also more advanced physical properties of the pitches, such as the amount of break, the speed, and the spin rate (in theory).

Dataset

Private websites post this pitch-by-pitch data, but a Kaggle user made data from 2015-2019 available here for download. In addition, the publicly available sites are scrape-able, if we need additional data in the future. If we simply take the 2019 file, we have data that looks like this:

Table 1: Pitching Data
px	pz	start_speed	end_speed	spin_rate	spin_dir	break_angle	break_length	break_y	ax	ay	az	sz_bot	sz_top	type_confidence	vx0	vy0	vz0	x	x0	y	y0	z0	pfx_x	pfx_z	nasty	zone	code	type	pitch_type	event_num	ab_id	b_count	s_count	outs	pitch_num
0.00	2.15	88.8	80.7	placeholder	placeholder	22.8	4.8	24	-8.47	28.90	-15.51	1.70	3.36	placeholder	5.28	-128.95	-6.89	116.97	-1.42	180.81	50	6.07	-5.07	9.98	NA	placeholder	X	X	FF	5	2019000001	0	0	0	1
0.34	2.31	89.9	81.8	placeholder	placeholder	22.8	3.6	24	-7.10	28.85	-12.99	1.80	3.55	placeholder	4.89	-130.54	-7.48	103.93	-1.02	176.34	50	6.20	-4.14	11.18	NA	placeholder	C	C	FF	8	2019000002	0	0	1	1
-0.05	2.03	85.7	79.6	placeholder	placeholder	9.6	6.0	24	3.65	22.07	-22.64	1.59	3.55	placeholder	2.33	-124.60	-5.98	118.86	-1.29	183.96	50	6.30	2.30	5.99	NA	placeholder	S	S	SL	9	2019000002	0	0	1	2
0.49	0.92	85.4	78.5	placeholder	placeholder	24.0	7.2	24	-13.77	24.44	-25.74	1.74	3.55	placeholder	7.83	-123.74	-6.78	98.15	-1.56	214.03	50	5.85	-8.87	4.14	NA	placeholder	B	B	CH	10	2019000002	0	1	1	3
-0.13	1.11	84.6	77.6	placeholder	placeholder	26.4	8.4	24	-15.99	24.56	-28.36	1.83	3.59	placeholder	6.79	-122.67	-5.73	121.81	-1.57	208.77	50	5.89	-10.51	2.51	NA	placeholder	B	B	CH	11	2019000002	1	1	1	4
0.09	3.41	90.9	82.4	placeholder	placeholder	27.6	3.6	24	-8.27	30.12	-13.15	1.59	3.47	placeholder	4.37	-132.01	-4.92	113.39	-0.98	146.83	50	6.28	-4.72	10.86	NA	placeholder	X	X	FF	12	2019000002	2	1	1	5

There are a lot of columns, and not a lot of explanations about what exactly they describe. Luckily, some work has been done on this already here. So if we list the columns from PITCHf/x with the definitions that are posted here, we can see what we’re still missing. I also added in some common sense definitions for the non-PITCHf/x situation variables.

Table 2: Pitching Data Variables
Variable	Meaning
px	the left/right distance, in feet, of the pitch from the middle of the plate as it crossed home plate. The PITCHf/x coordinate system is oriented to the catcher’s/umpire’s perspective, with distances to the right being positive and to the left being negative.
pz	the height of the pitch in feet as it crossed the front of home plate.
start_speed	the pitch speed, in miles per hour and in three dimensions, measured at the initial point, y0. Of the two speeds, this one is closer to the speed measured by a radar gun and what we are familiar with for a pitcher’s “velocity” .
end_speed	the pitch speed measured as it crossed the front of home plate.
spin_rate
spin_dir
break_angle	the angle, in degrees, from vertical to the straight line path from the release point to where the pitch crossed the front of home plate, as seen from the catcher’s/umpire’s perspective.
break_length	the measurement of the greatest distance, in inches, between the trajectory of the pitch at any point between the release point and the front of home plate, and the straight line path from the release point and the front of home plate, per the MLB Gameday team. John Walsh’s article “In Search of the Sinker” has a good illustration of this parameter.
break_y	the distance in feet from home plate to the point in the pitch trajectory where the pitch achieved its greatest deviation from the straight line path between the release point and the front of home plate.
ax	the acceleration of the pitch, in feet per second per second, in the x dimension, measured at the initial point.
ay	the acceleration of the pitch, in feet per second per second, in the y dimension, measured at the initial point.
az	the acceleration of the pitch, in feet per second per second, in the z dimension, measured at the initial point.
sz_bot	the distance in feet from the ground to the bottom of the current batter’s rulebook strike zone. The PITCHf/x operator sets a line at the hollow of the knee for the bottom of the zone.
sz_top	the distance in feet from the ground to the top of the current batter’s rulebook strike zone as measured from the video by the PITCHf/x operator. The operator sets a line at the batter’s belt as he settles into the hitting position, and the PITCHf/x software adds four inches up for the top of the zone.
type_confidence	the value of the weight at the classification algorithm’s output node corresponding to the most probable pitch type, this value is multiplied by a factor of 1.5 if the pitch is known by MLBAM to be part of the pitcher’s repertoire.
vx0	the velocity of the pitch, in feet per second, in the x dimension, measured at the initial point.
vy0	the velocity of the pitch, in feet per second, in the y dimension, measured at the initial point.
vz0	the velocity of the pitch, in feet per second, in the z dimension, measured at the initial point.
x	the horizontal location of the pitch as it crossed home plate as input by the Gameday stringer using the old Gameday coordinate system. I’m not sure what units are used or where the origin is located. Note that the y dimension in the old coordinate system is now called the z dimension in the new PITCHf/x coordinate system detailed below.
x0	the left/right distance, in feet, of the pitch, measured at the initial point.
y	the vertical location of the pitch as it crossed home plate as input by the Gameday stringer using the old Gameday coordinate system. I’m not sure what units are used or where the origin is located. Note that the y dimension in the old coordinate system is now called the z dimension in the new PITCHf/x coordinate system detailed below.
y0	the distance in feet from home plate where the PITCHf/x system is set to measure the initial parameters. This parameter has been variously set at 40, 50, or 55 feet (and in a few instances 45 feet) from the plate at different times throughout the 2007 season as Sportvision experiments with optimal settings for the PITCHf/x measurements. Sportvision settled on 50 feet in the second half of 2007, and this value of y0=50 feet has been used since. Changes in this parameter impact the values of all other parameters measured at the release point, such as start_speed.
z0	the height, in feet, of the pitch, measured at the initial point.
pfx_x	the horizontal movement, in inches, of the pitch between the release point and home plate, as compared to a theoretical pitch thrown at the same speed with no spin-induced movement. This parameter is measured at y=40 feet regardless of the y0 value.
pfx_z	the vertical movement, in inches, of the pitch between the release point and home plate, as compared to a theoretical pitch thrown at the same speed with no spin-induced movement. This parameter is measured at y=40 feet regardless of the y0 value.
nasty
zone
code	code representing the outcome of the pitch.
type	a one-letter abbreviation for the result of the pitch: B, ball; S, strike (including fouls); X, in play.
pitch_type	the most probable pitch type according to a neural net classification algorithm developed by Ross Paul of MLBAM.
event_num	code describing what event transpired on a batted ball.
b_score	score of the batter’s team
ab_id	identifier to link the at-bat table.
b_count	number of balls when pitch was thrown.
s_count	number of strikes when pitch was thrown.
outs	number of outs when pitch was thrown.
pitch_num	counter describing the sequence of pitches in the at-bat (i.e. a value of 2 means that the pitch is the second in the at-bat.)
on_1b	indicator for whether there is a runner on first base when the pitch is thrown.
on_2b	indicator for whether there is a runner on second base when the pitch is thrown.
on_3b	indicator for whether there is a runner on third base when the pitch is thrown.

So to fully understand this data, there a few “mini-investigations” that I undertake to ensure everything makes sense.

`code` vs. `type`

These two fields, to an eyeball test, look identical. We can easily find any discrepancies:

pitches_og[code != type | (is.na(type) & !is.na(code)) | (is.na(code) & !is.na(type))]

## Empty data.table (0 rows and 40 cols): px,pz,start_speed,end_speed,spin_rate,spin_dir...

So in all of observations (at least for 2019), code and type are the same. We need not worry about using one vs the other. Additionally, on the Kaggle page, the values that code takes are defined, which is fantastic.

`spin_rate` and `spin_dir`

These seem valuable for our analysis, but they aren’t defined anywhere. If we take tables of both, we can possibly infer the units of the variables (they’re character, not numeric):

table(pitches_og$spin_dir, useNA = 'always')

## 
##             placeholder        <NA> 
##        6629      722161           0

table(pitches_og$spin_rate, useNA = 'always')

## 
##             placeholder        <NA> 
##        6629      722161           0

So we can safely ignore these variables, as they’re alternatively placeholder or blank.

`zone` and `nasty`

I suspect these are additional features that are built using the physics data, but I don’t think they’re populated in this dataset:

table(pitches_og$nasty, useNA = 'always')

## 
##   <NA> 
## 728790

table(pitches_og$zone, useNA = 'always')

## 
##             placeholder        <NA> 
##        6629      722161           0

Again, these aren’t populated, so we needn’t worry about them.

Pitching in Baseball

For the uninitiated to America’s Pastime, the goal of the game is to end nine innings with more runs than your opponent. You score runs by getting hits, and the job of the pitcher is to prevent hits (this is obviously extremely simplistic, but more detail isn’t necessary at the moment). The pitcher prevents hits by throwing pitches that are difficult for the batter to hit. However, the pitcher faces a constraint: if they throw four pitches that are not in the strike zone over the course of an at-bat as judged by the umpire (referee), the batter is allowed to advance to first base for free (this is called a walk). Such an outcome is not desirable for the pitcher, so their goal is to throw pitches that are maximally difficult to hit, while satisfying the constraint of being within the strike zone (as if the batter does not swing at a pitch judged to be in the strike zone, it’s counted as a strike).

To visualize an at-bat, I’m going to first merge all the data together to create the full_data dataset:

abs_og <- fread('~/Google Drive/Raw Data/mlb_pitch_data/2019_atbats.csv') %>% 
  mutate_at(vars(ends_with('_id')), list(~ as.character(.))) %>% 
  data.table()

names_og <- fread('~/Google Drive/Raw Data/mlb_pitch_data/player_names.csv') %>% 
  mutate_at(vars(ends_with('id')), list(~ as.character(.))) %>% 
  data.table()

games_og <- fread('~/Google Drive/Raw Data/mlb_pitch_data/2019_games.csv') %>% 
  mutate_at(vars(ends_with('_id')), list(~ as.character(.))) %>% 
  data.table()

full_data <- pitches_og %>% 
  left_join(abs_og,
            by = 'ab_id') %>% 
  left_join(names_og %>% 
              select(pitcher_id = id, 
                     pitcher_first_name = first_name, 
                     pitcher_last_name = last_name),
            by = 'pitcher_id') %>% 
  left_join(names_og %>% 
              select(batter_id = id, 
                     batter_first_name = first_name, 
                     batter_last_name = last_name),
            by = 'batter_id') %>% 
  left_join(games_og,
            by = 'g_id')

So what does an at-bat actually look like? Well, we can build a function using this data to visualize a given at-bat.

full_data_built <- full_data %>% 
  group_by(batter_id) %>% 
  mutate(sz_bot_mean = mean(sz_bot),
         sz_top_mean = mean(sz_top)) %>% 
  ungroup() %>% 
  mutate(date = as.Date(date)) %>% 
  data.table()

visualize_ab <- function(ab_id_var, box = FALSE){
  
  ab_data <- full_data_built[ab_id == ab_id_var]
  
  # strike zone/plate dimensions
  low_sz <- unique(ab_data$sz_bot_mean)
  high_sz <- unique(ab_data$sz_top_mean)
  
  plate_min = -17/12/2
  plate_max = 17/12/2
  
  # pitcher and batter name
  pitcher <- unique(ab_data$pitcher_last_name)
  batter <- unique(ab_data$batter_last_name)
  date <- unique(ab_data$date)
  
  # make url (need to figure out how to identify doubleheaders, probably just by date?)
  home_team <- toupper(unique(ab_data$home_team))
  url <- str_interp('https://www.baseball-reference.com/boxes/${home_team}/${home_team}${format(date,"%Y%m%d")}0.shtml')
  
  # formatting vectors
  values_vec <- c('B' = 'red',
                  '*B' = 'red',
                  'I' = 'red',
                  'P' = 'red',
                  'C' = 'green',
                  'S' = 'purple',
                  'W' = 'purple',
                  'M' = 'purple',
                  'Q' = 'purple',
                  'X' = 'blue',
                  'D' = 'blue',
                  'E' = 'blue',
                  'F' = 'orange',
                  'T' = 'orange',
                  'L' = 'orange',
                  'R' = 'orange',
                  'H' = 'grey',
                  'E' = 'blue')
  
  labels_vec <- c('B' = 'Ball',
                  '*B' = 'Ball in dirt',
                  'I' = 'Intentional Ball',
                  'P' = 'Pitchout',
                  'C' = 'Strike (Called)',
                  'S' = 'Strike (Swinging)',
                  'W' = 'Strike (Swinging, Blocked)',
                  'M' = 'Missed Bunt',
                  'Q' = 'Swinging Pitchout',
                  'X' = 'In Play, Out(s)',
                  'D' = 'In Play, No Out',
                  'E' = 'In Play, Run(s)',
                  'F' = 'Foul',
                  'T' = 'Foul Tip',
                  'L' = 'Foul Bunt',
                  'R' = 'Foul Pitchout',
                  'H' = 'Hit By Pitch',
                  'E' = 'In Play')
  
  if(box == TRUE){
    cat(url)
  }
  
  
  # plot
  ggplot(ab_data) + 
    scale_x_continuous(limits = c(min(-1, min(ab_data$px) - .25/2),
                                  max(1, max(ab_data$px) + .25/2)), 
                       expand = expansion(mult = .03)) + 
    scale_y_continuous(limits = c(min(1, min(ab_data$pz) - .25/2),
                                  max(4, max(ab_data$pz) + .25/2)),
                       expand = expansion(mult = .03)) +
    ggforce::geom_circle(aes(x0 = px, y0 = pz, color = code, fill = code, r = .25/2),
                         alpha = .5) + 
    geom_text(aes(x = px, y = pz, label=pitch_num), hjust=.5, vjust=.5) +
    geom_rect(xmin = plate_min, 
              xmax = plate_max,
              ymin = low_sz,
              ymax = high_sz,
              alpha = 0,
              color = 'black') +
    scale_color_manual(values = values_vec,
                       labels = labels_vec) + 
    scale_fill_manual(values = values_vec,
                      labels = labels_vec) + 
    theme(legend.position = 'bottom',
          legend.title = element_blank(),
          axis.title = element_blank()) + 
    coord_fixed() + 
    labs(title = str_interp('${pitcher} vs ${batter} (${abs_og[ab_id == ab_id_var]$stand}), ${ifelse(abs_og[ab_id == ab_id_var]$top == 0, "Bottom", "Top")} ${abs_og[ab_id == ab_id_var]$inning}'),
         subtitle = str_interp('Result: ${abs_og[ab_id == ab_id_var]$event}'))
}

The function is sort of long – here are the only somewhat interesting decisions that I made when building it (the rest is just base ggplot):

The use of the ggforce package’s function geom_circle instead of ggplot’s base function geom_point to plot. The only reason I did this was to be able to try to make the circles the actual size of a baseball to contextualize them better.
The inclusion of a url parameter. I realized that I was spending a lot of time checking at-bats with https://www.baseballreference.com, so I decided to try to decrease the difficulty of accessing Baseball Reference by creating the game’s URL in the function if the user chooses.
I plotted the strike zone for a batter by taking the average of all their sz_top and sz_bot observations in the data. The outer bounds of the plate are defined by the size of home plate (17 inches) and the fact that it’s centered in PITCHf/x data on 0.

A visualization will look something like this:

visualize_ab('2019000030')

On this at-bat, we see that there were 8 pitches. The first was a called strike that looks like it was slightly off the plate. The next pitch was a low ball, the next a foul, and so on, until on the eighth pitch of the at-bat, Semien hits a home run on a pitch in the center of the plate. This at-bat actually motivates the discussion of pitching strategy below.

Basic Pitching Strategy

One of the most basic tenets of pitching philosophy is that throwing the ball in the center of the plate is a cardinal sin (as an example, see the at-bat above). The reason for this isn’t complicated – while pitches in the middle of the plate are most likely to be called strikes, they are also the easiest for the batter to hit. Using the 2019 data, we have the ability to assess whether this holds in practice (to be clear, it would be truly shocking if it did not, so we can safely have a strong prior about what this should show).

To do this (I won’t put a ton of effort into showing this), let’s just look at the number of bases that are produced on average by strikes in various parts of the plate. I’ll do this by creating a heat map of such pitches:

# get the average strike zone for reference
zone <- full_data %>% 
  distinct(batter_id, sz_bot, sz_top) %>% 
  summarise(mean_low = mean(sz_bot),
            mean_high = mean(sz_top))

outcome_data <- full_data %>% 
  # filter to strikes
  filter(code %in% c('C', 'S', 'W', 'M', 'Q', 'X', 'D', 'E', 'F', 'T', 'L', 'R')) %>% 
  # calculate bases on those strikes
  mutate(bases = case_when(event == 'Single' & code %in% c('X', 'D', 'E') ~ 1,
                           event == 'Double' & code %in% c('X', 'D', 'E') ~ 2,
                           event == 'Triple' & code %in% c('X', 'D', 'E') ~ 3,
                           event == 'Home Run' & code %in% c('X', 'D', 'E') ~ 4,
                           TRUE ~ 0)) %>% 
  distinct(batter_id, batter_first_name, event, pitch_num, type, batter_last_name, ab_id, bases, px, pz) %>% 
  # each box is one square inches
  mutate(px_box = floor(px * 12)/12,
         pz_box = floor(pz * 12)/12) %>% 
  group_by(px_box, pz_box) %>% 
  summarise(mean_bases = mean(bases),
            n = n()) %>% 
  ungroup() %>% 
  # somewhat arbitrarily throw out any box that has fewer than 50 pitches
  filter(n >= 50, !is.na(mean_bases), !is.na(px_box), !is.na(pz_box)) %>% 
  data.table()

ggplot(data = outcome_data) +
  geom_tile(aes(x = px_box, y = pz_box, fill = mean_bases)) +
  scale_fill_gradient(low = 'red', high = 'green') +
  geom_rect(xmin = -17/12/2,
            xmax = 17/12/2,
            ymin = zone$mean_low,
            ymax = zone$mean_high,
            alpha = 0,
            color = 'black') +
  theme(axis.title = element_blank()) +
  guides(fill=guide_legend(title="Mean Bases"))

Basic takeaway: pitches over the center of the plate have worse outcomes for pitchers than those that aren’t, on average. Notice how the edges of the strike zone are much redder than the center – throwing in those locations is optimal for pitchers.

Another piece of pitching theory is that “hanging” a pitch is bad. Hanging a pitch is when the pitcher attempts to throw a breaking ball (i.e., a pitch that curves due to spin), but doesn’t get enough spin on it, resulting in a pitch that’s slow and normally high in the strike zone, which generally is a bad outcome for the pitcher. My high school coach used to ominously say “you hang, he bang”, which I always found kind of amusing.

To look at this at a high level, we can run the most simple linear regression we can think of on this data. Particularly, let’s try

\[ bases = speed + |location_x| + |location_z| + angle_{break} + length_{break} + speed \times length_{break} \]

Here, I’m abusing notation somewhat, and taking the absolute value operator to mean absolute deviation from the mean. Note that we’re not considering any interactions here, although there are certainly relevant ones. There’s also a somewhat important combination that we’re omitting here, which is whether the batter and pitcher are both right handed, both left handed, or separate, but let’s ignore that for now. Looking just at these pitch characteristics, we can run a regression:

zone_mean <- (zone$mean_low + zone$mean_high)/2

reg_data <- full_data %>% 
  # filter to strikes
  filter(code %in% c('C', 'S', 'W', 'M', 'Q', 'X', 'D', 'E', 'F', 'T', 'L', 'R')) %>% 
  # calculate bases on those strikes
  mutate(bases = case_when(event == 'Single' & code %in% c('X', 'D', 'E') ~ 1,
                           event == 'Double' & code %in% c('X', 'D', 'E') ~ 2,
                           event == 'Triple' & code %in% c('X', 'D', 'E') ~ 3,
                           event == 'Home Run' & code %in% c('X', 'D', 'E') ~ 4,
                           TRUE ~ 0),
         abs_px = abs(px),
         abs_pz = abs(pz - zone_mean)) %>% 
  distinct(ab_id, pitch_num, bases, abs_px, abs_pz, px, pz, end_speed, break_angle, break_length)

basic_pitch_reg <- lm(bases ~ end_speed + abs_px + abs_pz + end_speed*break_length + break_angle + break_length, data = reg_data)

Table 3: Basic Pitching Regression
Variable	Estimate	SE	T-Statistic	P-Value
(Intercept)	0.6768716	0.0420716	16.088577	0.0000000
end_speed	-0.0050291	0.0004941	-10.177336	0.0000000
abs_px	-0.1230725	0.0025440	-48.377068	0.0000000
abs_pz	-0.1085832	0.0021161	-51.313695	0.0000000
break_length	-0.0564764	0.0040941	-13.794469	0.0000000
break_angle	-0.0001956	0.0000800	-2.446789	0.0144139
end_speed:break_length	0.0007357	0.0000518	14.215595	0.0000000

So without digging too deeply into anything, we see that the data suggests a few things:

Balls close to the center of the strike zone (small deviation from the calculated center) are hit for more bases (ceteris paribus)
Faster pitches are hit for fewer bases than slower pitches (ceteris paribus)
Pitches that break more are hit for fewer bases than pitches that break less (ceteris paribus)

Summary

So what should we take away from this basic review? Pitchers are trying to throw balls that are at the edges of the strike zone. In addition, they want to throw pitches that break a lot or are really fast (or in theory, both!)

There are a few other components that we’ll want to investigate as part of the conventional knowledge. One is the sequence of pitches – is a pitcher keeping a hitter “off-balance” by throwing pitches in varying speeds and locations? We also want to look at (and likely control for) the platoon advantage, which is the idea that a righty pitcher vs. a righty batter or a lefty pitcher vs. a lefty batter is an advantageous matchup (in general) for a pitcher, while a righty pitcher vs. a lefty batter or a lefty pitcher vs. a righty batter is an advantageous situation for the batter.

I’ll probably dig into that a little more in a further post, but this is a good starting point for the theory and intuition of basic pitching in baseball.

As a quickly explanator/apology: I use data.table and dyplr syntax together a lot, so if you only use one of these packages, a few things may look odd. I try to mostly use data.table for quick filtering and investigation, but just be aware that some of the code may not run without both packages↩︎
https://slate.com/culture/2007/08/pitch-f-x-the-new-technology-that-will-change-baseball-analysis-forever.html ↩︎

Understanding Pitching Data

Lukas Hager

2021/02/22

Overview

The Data

Background

Dataset

`code` vs. `type`

`spin_rate` and `spin_dir`

`zone` and `nasty`

Pitching in Baseball

Basic Pitching Strategy

Summary

Understanding Pitching Data

Lukas Hager

2021/02/22

Overview

The Data

Background

Dataset

code vs. type

spin_rate and spin_dir

zone and nasty

Pitching in Baseball

Basic Pitching Strategy

Summary

`code` vs. `type`

`spin_rate` and `spin_dir`

`zone` and `nasty`