Understanding Pitching Data

Lukas Hager

2021/02/22

Overview

This past week, I started talking to one of my classmates, Shabab Ahmed, about the possibility of doing a research project together. The one problem: it has to do with the sport of baseball. Shabab is a big cricket fan, which makes him more baseball-literate than perhaps the average person that you’d encounter in academia, but the project will have to do with some of the finer points of baseball strategy.

However, this realization is important to have early on, as discussing this project with people who aren’t familiar with baseball will require us to give them the basics quickly. The data that we have is also a little opaque with respect to documentation, so we need to walk through to figure out how to use it best. Thus, this post will be a combination of data exploration and explanation of how pitching works in baseball1.

The Data

Background

The data that we want to use in this project comes from Major League Baseball, the pre-eminent baseball league in the United States (and arguably, the world). Beginning in the 2006 playoffs2, a three-camera system tracks every pitch thrown in every baseball game. These cameras don’t only report the locations of where pitches cross the plate, but also more advanced physical properties of the pitches, such as the amount of break, the speed, and the spin rate (in theory).

Dataset

Private websites post this pitch-by-pitch data, but a Kaggle user made data from 2015-2019 available here for download. In addition, the publicly available sites are scrape-able, if we need additional data in the future. If we simply take the 2019 file, we have data that looks like this:

Table 1: Pitching Data
px pz start_speed end_speed spin_rate spin_dir break_angle break_length break_y ax ay az sz_bot sz_top type_confidence vx0 vy0 vz0 x x0 y y0 z0 pfx_x pfx_z nasty zone code type pitch_type event_num b_score ab_id b_count s_count outs pitch_num on_1b on_2b on_3b
0.00 2.15 88.8 80.7 placeholder placeholder 22.8 4.8 24 -8.47 28.90 -15.51 1.70 3.36 placeholder 5.28 -128.95 -6.89 116.97 -1.42 180.81 50 6.07 -5.07 9.98 NA placeholder X X FF 5 0 2019000001 0 0 0 1 0 0 0
0.34 2.31 89.9 81.8 placeholder placeholder 22.8 3.6 24 -7.10 28.85 -12.99 1.80 3.55 placeholder 4.89 -130.54 -7.48 103.93 -1.02 176.34 50 6.20 -4.14 11.18 NA placeholder C C FF 8 0 2019000002 0 0 1 1 0 0 0
-0.05 2.03 85.7 79.6 placeholder placeholder 9.6 6.0 24 3.65 22.07 -22.64 1.59 3.55 placeholder 2.33 -124.60 -5.98 118.86 -1.29 183.96 50 6.30 2.30 5.99 NA placeholder S S SL 9 0 2019000002 0 0 1 2 0 0 0
0.49 0.92 85.4 78.5 placeholder placeholder 24.0 7.2 24 -13.77 24.44 -25.74 1.74 3.55 placeholder 7.83 -123.74 -6.78 98.15 -1.56 214.03 50 5.85 -8.87 4.14 NA placeholder B B CH 10 0 2019000002 0 1 1 3 0 0 0
-0.13 1.11 84.6 77.6 placeholder placeholder 26.4 8.4 24 -15.99 24.56 -28.36 1.83 3.59 placeholder 6.79 -122.67 -5.73 121.81 -1.57 208.77 50 5.89 -10.51 2.51 NA placeholder B B CH 11 0 2019000002 1 1 1 4 0 0 0
0.09 3.41 90.9 82.4 placeholder placeholder 27.6 3.6 24 -8.27 30.12 -13.15 1.59 3.47 placeholder 4.37 -132.01 -4.92 113.39 -0.98 146.83 50 6.28 -4.72 10.86 NA placeholder X X FF 12 0 2019000002 2 1 1 5 0 0 0

There are a lot of columns, and not a lot of explanations about what exactly they describe. Luckily, some work has been done on this already here. So if we list the columns from PITCHf/x with the definitions that are posted here, we can see what we’re still missing. I also added in some common sense definitions for the non-PITCHf/x situation variables.

Table 2: Pitching Data Variables
Variable Meaning
px the left/right distance, in feet, of the pitch from the middle of the plate as it crossed home plate. The PITCHf/x coordinate system is oriented to the catcher’s/umpire’s perspective, with distances to the right being positive and to the left being negative.
pz the height of the pitch in feet as it crossed the front of home plate.
start_speed the pitch speed, in miles per hour and in three dimensions, measured at the initial point, y0. Of the two speeds, this one is closer to the speed measured by a radar gun and what we are familiar with for a pitcher’s “velocity” .
end_speed the pitch speed measured as it crossed the front of home plate.
spin_rate
spin_dir
break_angle the angle, in degrees, from vertical to the straight line path from the release point to where the pitch crossed the front of home plate, as seen from the catcher’s/umpire’s perspective.
break_length the measurement of the greatest distance, in inches, between the trajectory of the pitch at any point between the release point and the front of home plate, and the straight line path from the release point and the front of home plate, per the MLB Gameday team. John Walsh’s article “In Search of the Sinker” has a good illustration of this parameter.
break_y the distance in feet from home plate to the point in the pitch trajectory where the pitch achieved its greatest deviation from the straight line path between the release point and the front of home plate.
ax the acceleration of the pitch, in feet per second per second, in the x dimension, measured at the initial point.
ay the acceleration of the pitch, in feet per second per second, in the y dimension, measured at the initial point.
az the acceleration of the pitch, in feet per second per second, in the z dimension, measured at the initial point.
sz_bot the distance in feet from the ground to the bottom of the current batter’s rulebook strike zone. The PITCHf/x operator sets a line at the hollow of the knee for the bottom of the zone.
sz_top the distance in feet from the ground to the top of the current batter’s rulebook strike zone as measured from the video by the PITCHf/x operator. The operator sets a line at the batter’s belt as he settles into the hitting position, and the PITCHf/x software adds four inches up for the top of the zone.
type_confidence the value of the weight at the classification algorithm’s output node corresponding to the most probable pitch type, this value is multiplied by a factor of 1.5 if the pitch is known by MLBAM to be part of the pitcher’s repertoire.
vx0 the velocity of the pitch, in feet per second, in the x dimension, measured at the initial point.
vy0 the velocity of the pitch, in feet per second, in the y dimension, measured at the initial point.
vz0 the velocity of the pitch, in feet per second, in the z dimension, measured at the initial point.
x the horizontal location of the pitch as it crossed home plate as input by the Gameday stringer using the old Gameday coordinate system. I’m not sure what units are used or where the origin is located. Note that the y dimension in the old coordinate system is now called the z dimension in the new PITCHf/x coordinate system detailed below.
x0 the left/right distance, in feet, of the pitch, measured at the initial point.
y the vertical location of the pitch as it crossed home plate as input by the Gameday stringer using the old Gameday coordinate system. I’m not sure what units are used or where the origin is located. Note that the y dimension in the old coordinate system is now called the z dimension in the new PITCHf/x coordinate system detailed below.
y0 the distance in feet from home plate where the PITCHf/x system is set to measure the initial parameters. This parameter has been variously set at 40, 50, or 55 feet (and in a few instances 45 feet) from the plate at different times throughout the 2007 season as Sportvision experiments with optimal settings for the PITCHf/x measurements. Sportvision settled on 50 feet in the second half of 2007, and this value of y0=50 feet has been used since. Changes in this parameter impact the values of all other parameters measured at the release point, such as start_speed.
z0 the height, in feet, of the pitch, measured at the initial point.
pfx_x the horizontal movement, in inches, of the pitch between the release point and home plate, as compared to a theoretical pitch thrown at the same speed with no spin-induced movement. This parameter is measured at y=40 feet regardless of the y0 value.
pfx_z the vertical movement, in inches, of the pitch between the release point and home plate, as compared to a theoretical pitch thrown at the same speed with no spin-induced movement. This parameter is measured at y=40 feet regardless of the y0 value.
nasty
zone
code code representing the outcome of the pitch.
type a one-letter abbreviation for the result of the pitch: B, ball; S, strike (including fouls); X, in play.
pitch_type the most probable pitch type according to a neural net classification algorithm developed by Ross Paul of MLBAM.
event_num code describing what event transpired on a batted ball.
b_score score of the batter’s team
ab_id identifier to link the at-bat table.
b_count number of balls when pitch was thrown.
s_count number of strikes when pitch was thrown.
outs number of outs when pitch was thrown.
pitch_num counter describing the sequence of pitches in the at-bat (i.e. a value of 2 means that the pitch is the second in the at-bat.)
on_1b indicator for whether there is a runner on first base when the pitch is thrown.
on_2b indicator for whether there is a runner on second base when the pitch is thrown.
on_3b indicator for whether there is a runner on third base when the pitch is thrown.

So to fully understand this data, there a few “mini-investigations” that I undertake to ensure everything makes sense.

code vs. type

These two fields, to an eyeball test, look identical. We can easily find any discrepancies:

pitches_og[code != type | (is.na(type) & !is.na(code)) | (is.na(code) & !is.na(type))]
## Empty data.table (0 rows and 40 cols): px,pz,start_speed,end_speed,spin_rate,spin_dir...

So in all of observations (at least for 2019), code and type are the same. We need not worry about using one vs the other. Additionally, on the Kaggle page, the values that code takes are defined, which is fantastic.

spin_rate and spin_dir

These seem valuable for our analysis, but they aren’t defined anywhere. If we take tables of both, we can possibly infer the units of the variables (they’re character, not numeric):

table(pitches_og$spin_dir, useNA = 'always')
## 
##             placeholder        <NA> 
##        6629      722161           0
table(pitches_og$spin_rate, useNA = 'always')
## 
##             placeholder        <NA> 
##        6629      722161           0

So we can safely ignore these variables, as they’re alternatively placeholder or blank.

zone and nasty

I suspect these are additional features that are built using the physics data, but I don’t think they’re populated in this dataset:

table(pitches_og$nasty, useNA = 'always')
## 
##   <NA> 
## 728790
table(pitches_og$zone, useNA = 'always')
## 
##             placeholder        <NA> 
##        6629      722161           0

Again, these aren’t populated, so we needn’t worry about them.

Pitching in Baseball

For the uninitiated to America’s Pastime, the goal of the game is to end nine innings with more runs than your opponent. You score runs by getting hits, and the job of the pitcher is to prevent hits (this is obviously extremely simplistic, but more detail isn’t necessary at the moment). The pitcher prevents hits by throwing pitches that are difficult for the batter to hit. However, the pitcher faces a constraint: if they throw four pitches that are not in the strike zone over the course of an at-bat as judged by the umpire (referee), the batter is allowed to advance to first base for free (this is called a walk). Such an outcome is not desirable for the pitcher, so their goal is to throw pitches that are maximally difficult to hit, while satisfying the constraint of being within the strike zone (as if the batter does not swing at a pitch judged to be in the strike zone, it’s counted as a strike).

To visualize an at-bat, I’m going to first merge all the data together to create the full_data dataset:

abs_og <- fread('~/Google Drive/Raw Data/mlb_pitch_data/2019_atbats.csv') %>% 
  mutate_at(vars(ends_with('_id')), list(~ as.character(.))) %>% 
  data.table()

names_og <- fread('~/Google Drive/Raw Data/mlb_pitch_data/player_names.csv') %>% 
  mutate_at(vars(ends_with('id')), list(~ as.character(.))) %>% 
  data.table()

games_og <- fread('~/Google Drive/Raw Data/mlb_pitch_data/2019_games.csv') %>% 
  mutate_at(vars(ends_with('_id')), list(~ as.character(.))) %>% 
  data.table()

full_data <- pitches_og %>% 
  left_join(abs_og,
            by = 'ab_id') %>% 
  left_join(names_og %>% 
              select(pitcher_id = id, 
                     pitcher_first_name = first_name, 
                     pitcher_last_name = last_name),
            by = 'pitcher_id') %>% 
  left_join(names_og %>% 
              select(batter_id = id, 
                     batter_first_name = first_name, 
                     batter_last_name = last_name),
            by = 'batter_id') %>% 
  left_join(games_og,
            by = 'g_id')

So what does an at-bat actually look like? Well, we can build a function using this data to visualize a given at-bat.

full_data_built <- full_data %>% 
  group_by(batter_id) %>% 
  mutate(sz_bot_mean = mean(sz_bot),
         sz_top_mean = mean(sz_top)) %>% 
  ungroup() %>% 
  mutate(date = as.Date(date)) %>% 
  data.table()

visualize_ab <- function(ab_id_var, box = FALSE){
  
  ab_data <- full_data_built[ab_id == ab_id_var]
  
  # strike zone/plate dimensions
  low_sz <- unique(ab_data$sz_bot_mean)
  high_sz <- unique(ab_data$sz_top_mean)
  
  plate_min = -17/12/2
  plate_max = 17/12/2
  
  # pitcher and batter name
  pitcher <- unique(ab_data$pitcher_last_name)
  batter <- unique(ab_data$batter_last_name)
  date <- unique(ab_data$date)
  
  # make url (need to figure out how to identify doubleheaders, probably just by date?)
  home_team <- toupper(unique(ab_data$home_team))
  url <- str_interp('https://www.baseball-reference.com/boxes/${home_team}/${home_team}${format(date,"%Y%m%d")}0.shtml')
  
  # formatting vectors
  values_vec <- c('B' = 'red',
                  '*B' = 'red',
                  'I' = 'red',
                  'P' = 'red',
                  'C' = 'green',
                  'S' = 'purple',
                  'W' = 'purple',
                  'M' = 'purple',
                  'Q' = 'purple',
                  'X' = 'blue',
                  'D' = 'blue',
                  'E' = 'blue',
                  'F' = 'orange',
                  'T' = 'orange',
                  'L' = 'orange',
                  'R' = 'orange',
                  'H' = 'grey',
                  'E' = 'blue')
  
  labels_vec <- c('B' = 'Ball',
                  '*B' = 'Ball in dirt',
                  'I' = 'Intentional Ball',
                  'P' = 'Pitchout',
                  'C' = 'Strike (Called)',
                  'S' = 'Strike (Swinging)',
                  'W' = 'Strike (Swinging, Blocked)',
                  'M' = 'Missed Bunt',
                  'Q' = 'Swinging Pitchout',
                  'X' = 'In Play, Out(s)',
                  'D' = 'In Play, No Out',
                  'E' = 'In Play, Run(s)',
                  'F' = 'Foul',
                  'T' = 'Foul Tip',
                  'L' = 'Foul Bunt',
                  'R' = 'Foul Pitchout',
                  'H' = 'Hit By Pitch',
                  'E' = 'In Play')
  
  if(box == TRUE){
    cat(url)
  }
  
  
  # plot
  ggplot(ab_data) + 
    scale_x_continuous(limits = c(min(-1, min(ab_data$px) - .25/2),
                                  max(1, max(ab_data$px) + .25/2)), 
                       expand = expansion(mult = .03)) + 
    scale_y_continuous(limits = c(min(1, min(ab_data$pz) - .25/2),
                                  max(4, max(ab_data$pz) + .25/2)),
                       expand = expansion(mult = .03)) +
    ggforce::geom_circle(aes(x0 = px, y0 = pz, color = code, fill = code, r = .25/2),
                         alpha = .5) + 
    geom_text(aes(x = px, y = pz, label=pitch_num), hjust=.5, vjust=.5) +
    geom_rect(xmin = plate_min, 
              xmax = plate_max,
              ymin = low_sz,
              ymax = high_sz,
              alpha = 0,
              color = 'black') +
    scale_color_manual(values = values_vec,
                       labels = labels_vec) + 
    scale_fill_manual(values = values_vec,
                      labels = labels_vec) + 
    theme(legend.position = 'bottom',
          legend.title = element_blank(),
          axis.title = element_blank()) + 
    coord_fixed() + 
    labs(title = str_interp('${pitcher} vs ${batter} (${abs_og[ab_id == ab_id_var]$stand}), ${ifelse(abs_og[ab_id == ab_id_var]$top == 0, "Bottom", "Top")} ${abs_og[ab_id == ab_id_var]$inning}'),
         subtitle = str_interp('Result: ${abs_og[ab_id == ab_id_var]$event}'))
} 

The function is sort of long – here are the only somewhat interesting decisions that I made when building it (the rest is just base ggplot):

A visualization will look something like this:

visualize_ab('2019000030')

On this at-bat, we see that there were 8 pitches. The first was a called strike that looks like it was slightly off the plate. The next pitch was a low ball, the next a foul, and so on, until on the eighth pitch of the at-bat, Semien hits a home run on a pitch in the center of the plate. This at-bat actually motivates the discussion of pitching strategy below.

Basic Pitching Strategy

One of the most basic tenets of pitching philosophy is that throwing the ball in the center of the plate is a cardinal sin (as an example, see the at-bat above). The reason for this isn’t complicated – while pitches in the middle of the plate are most likely to be called strikes, they are also the easiest for the batter to hit. Using the 2019 data, we have the ability to assess whether this holds in practice (to be clear, it would be truly shocking if it did not, so we can safely have a strong prior about what this should show).

To do this (I won’t put a ton of effort into showing this), let’s just look at the number of bases that are produced on average by strikes in various parts of the plate. I’ll do this by creating a heat map of such pitches:

# get the average strike zone for reference
zone <- full_data %>% 
  distinct(batter_id, sz_bot, sz_top) %>% 
  summarise(mean_low = mean(sz_bot),
            mean_high = mean(sz_top))

outcome_data <- full_data %>% 
  # filter to strikes
  filter(code %in% c('C', 'S', 'W', 'M', 'Q', 'X', 'D', 'E', 'F', 'T', 'L', 'R')) %>% 
  # calculate bases on those strikes
  mutate(bases = case_when(event == 'Single' & code %in% c('X', 'D', 'E') ~ 1,
                           event == 'Double' & code %in% c('X', 'D', 'E') ~ 2,
                           event == 'Triple' & code %in% c('X', 'D', 'E') ~ 3,
                           event == 'Home Run' & code %in% c('X', 'D', 'E') ~ 4,
                           TRUE ~ 0)) %>% 
  distinct(batter_id, batter_first_name, event, pitch_num, type, batter_last_name, ab_id, bases, px, pz) %>% 
  # each box is one square inches
  mutate(px_box = floor(px * 12)/12,
         pz_box = floor(pz * 12)/12) %>% 
  group_by(px_box, pz_box) %>% 
  summarise(mean_bases = mean(bases),
            n = n()) %>% 
  ungroup() %>% 
  # somewhat arbitrarily throw out any box that has fewer than 50 pitches
  filter(n >= 50, !is.na(mean_bases), !is.na(px_box), !is.na(pz_box)) %>% 
  data.table()

ggplot(data = outcome_data) +
  geom_tile(aes(x = px_box, y = pz_box, fill = mean_bases)) +
  scale_fill_gradient(low = 'red', high = 'green') +
  geom_rect(xmin = -17/12/2,
            xmax = 17/12/2,
            ymin = zone$mean_low,
            ymax = zone$mean_high,
            alpha = 0,
            color = 'black') +
  theme(axis.title = element_blank()) +
  guides(fill=guide_legend(title="Mean Bases"))

Basic takeaway: pitches over the center of the plate have worse outcomes for pitchers than those that aren’t, on average. Notice how the edges of the strike zone are much redder than the center – throwing in those locations is optimal for pitchers.

Another piece of pitching theory is that “hanging” a pitch is bad. Hanging a pitch is when the pitcher attempts to throw a breaking ball (i.e., a pitch that curves due to spin), but doesn’t get enough spin on it, resulting in a pitch that’s slow and normally high in the strike zone, which generally is a bad outcome for the pitcher. My high school coach used to ominously say “you hang, he bang”, which I always found kind of amusing.

To look at this at a high level, we can run the most simple linear regression we can think of on this data. Particularly, let’s try

\[ bases = speed + |location_x| + |location_z| + angle_{break} + length_{break} + speed \times length_{break} \]

Here, I’m abusing notation somewhat, and taking the absolute value operator to mean absolute deviation from the mean. Note that we’re not considering any interactions here, although there are certainly relevant ones. There’s also a somewhat important combination that we’re omitting here, which is whether the batter and pitcher are both right handed, both left handed, or separate, but let’s ignore that for now. Looking just at these pitch characteristics, we can run a regression:

zone_mean <- (zone$mean_low + zone$mean_high)/2

reg_data <- full_data %>% 
  # filter to strikes
  filter(code %in% c('C', 'S', 'W', 'M', 'Q', 'X', 'D', 'E', 'F', 'T', 'L', 'R')) %>% 
  # calculate bases on those strikes
  mutate(bases = case_when(event == 'Single' & code %in% c('X', 'D', 'E') ~ 1,
                           event == 'Double' & code %in% c('X', 'D', 'E') ~ 2,
                           event == 'Triple' & code %in% c('X', 'D', 'E') ~ 3,
                           event == 'Home Run' & code %in% c('X', 'D', 'E') ~ 4,
                           TRUE ~ 0),
         abs_px = abs(px),
         abs_pz = abs(pz - zone_mean)) %>% 
  distinct(ab_id, pitch_num, bases, abs_px, abs_pz, px, pz, end_speed, break_angle, break_length)

basic_pitch_reg <- lm(bases ~ end_speed + abs_px + abs_pz + end_speed*break_length + break_angle + break_length, data = reg_data)
Table 3: Basic Pitching Regression
Variable Estimate SE T-Statistic P-Value
(Intercept) 0.6768716 0.0420716 16.088577 0.0000000
end_speed -0.0050291 0.0004941 -10.177336 0.0000000
abs_px -0.1230725 0.0025440 -48.377068 0.0000000
abs_pz -0.1085832 0.0021161 -51.313695 0.0000000
break_length -0.0564764 0.0040941 -13.794469 0.0000000
break_angle -0.0001956 0.0000800 -2.446789 0.0144139
end_speed:break_length 0.0007357 0.0000518 14.215595 0.0000000

So without digging too deeply into anything, we see that the data suggests a few things:

  • Balls close to the center of the strike zone (small deviation from the calculated center) are hit for more bases (ceteris paribus)
  • Faster pitches are hit for fewer bases than slower pitches (ceteris paribus)
  • Pitches that break more are hit for fewer bases than pitches that break less (ceteris paribus)

Summary

So what should we take away from this basic review? Pitchers are trying to throw balls that are at the edges of the strike zone. In addition, they want to throw pitches that break a lot or are really fast (or in theory, both!)

There are a few other components that we’ll want to investigate as part of the conventional knowledge. One is the sequence of pitches – is a pitcher keeping a hitter “off-balance” by throwing pitches in varying speeds and locations? We also want to look at (and likely control for) the platoon advantage, which is the idea that a righty pitcher vs. a righty batter or a lefty pitcher vs. a lefty batter is an advantageous matchup (in general) for a pitcher, while a righty pitcher vs. a lefty batter or a lefty pitcher vs. a righty batter is an advantageous situation for the batter.

I’ll probably dig into that a little more in a further post, but this is a good starting point for the theory and intuition of basic pitching in baseball.


  1. As a quickly explanator/apology: I use data.table and dyplr syntax together a lot, so if you only use one of these packages, a few things may look odd. I try to mostly use data.table for quick filtering and investigation, but just be aware that some of the code may not run without both packages↩︎

  2. https://slate.com/culture/2007/08/pitch-f-x-the-new-technology-that-will-change-baseball-analysis-forever.html↩︎