Bandit Algorithms

Lukas Hager

2024-11-29

Question

  • We run a store
  • We can display an ad in two formats
  • We can afford to run 100 ads in total
  • How should we figure out which ad to run?

Which ad will be more effective?

Specifics

  • We’ll assume that ad 1 is better on average than ad 2
    • If we run ad 1, we get an average sale of 6 (\(\mu_1\)), as opposed to 5 (\(\mu_2\)) for ad 2
    • However, individual sales are noisy
  • In time period \(t\), choosing ad \(i\) gives us payoff \(y_{it} = \mu_i + \varepsilon_t\)
  • \(\varepsilon_t\) is distributed i.i.d. standard normal

Best Expected Payoff

  • If we knew this ahead of time, we’d run ad 1 every time
  • Best (unobtainable) expected payoff:

\[ \begin{split} \mathbb{E}[\text{Revenue}] &= \mathbb{E}\left[\sum_{t=1}^{100}y_{1t}\right] \\ &= \sum_{t=1}^{100}\mathbb{E}[6 + \varepsilon_t] \\ &= 600 \end{split} \]

Regret

  • We can compute the expected revenue for any algorithm we pick
  • Compare the expected revenue to the 600 calculated in the previous slide
  • The difference is the “regret” of an algorithm
  • Goal: pick an algorithm that minimizes regret

Explore-then-Commit Algorithm

  • Algorithm:
    • Pick some number of “trials” for each ad \(\ell\leq 50\)
    • Run each ad \(\ell\) times
    • Calculate sample mean of the \(\ell\) sales for each ad, \(\overline{y}_1\) and \(\overline{y}_2\)
    • Run the ad with the higher sample mean for the rest of time

Implementing ETC

simulate_etc <- function(l){
    # try each ad l times
    trials_1 <- 6 + rnorm(l)
    trials_2 <- 5 + rnorm(l)
    # compute the sample mean for each ad
    y_bar_1 <- mean(trials_1)
    y_bar_2 <- mean(trials_2)
    # commit to the better ad
    if(y_bar_1 > y_bar_2){
        return(c(trials_1, trials_2, 6 + rnorm(100-2*l)))
    }else{
        return(c(trials_1, trials_2, 5 + rnorm(100-2*l)))
    }
}

ETC Cumulative Reward Visualization

Code
etc_sim_df <- bind_rows(
  lapply(
    c(1:100),
    function(id_val){
      data.frame(
        'id' = id_val,
        't' = c(1:100),
        'reward' = simulate_etc(5)
      )
    }
  )
) %>%
  group_by(id) %>% 
  mutate(cum_payoff = cumsum(reward),
         cum_regret = t*6-cum_payoff) %>% 
  ungroup()

ggplot(data = etc_sim_df) +
  geom_line(aes(x=t,y=cum_payoff,group=factor(id),color=factor(id))) +
  labs(x = 'Time', y = 'Cumulative Reward', title = 'ETC With l=5') + 
  theme_bw() + 
  theme(legend.position = 'none')

ETC Cumulative Reward Visualization

Code
etc_sim_df <- bind_rows(
  lapply(
    c(1:100),
    function(id_val){
      data.frame(
        'id' = id_val,
        't' = c(1:100),
        'reward' = simulate_etc(40)
      )
    }
  )
) %>%
  group_by(id) %>% 
  mutate(cum_payoff = cumsum(reward),
         cum_regret = t*6-cum_payoff) %>% 
  ungroup()

ggplot(data = etc_sim_df) +
  geom_line(aes(x=t,y=cum_payoff,group=factor(id),color=factor(id))) +
  labs(x = 'Time', y = 'Cumulative Reward', title = 'ETC With l=40') + 
  theme_bw() + 
  theme(legend.position = 'none')

Tradeoffs

  • As we experiment more (higher \(\ell\)), we become more confident that we’re picking the better arm
  • However, that experimentation is costly – we incur regret as we play the worse ad more often
  • How do we assess which \(\ell\) is best? Regret!

ETC Expected Regret

  • Break it into individual parts:
    • When we explore ad 1, we incur zero expected regret
    • When we explore ad 2, we incur \(\ell\) expected regret (as \(6-5=1\))
    • After committing, if we pick ad 1, we incur zero expected regret, and if we pick ad 2, we incur \(100-2\ell\) regret
  • Thus, ETC regret is

\[ \underbrace{\ell}_{\text{Exploration Regret}} + \underbrace{(100-2\ell)\mathbb{P}(\text{Pick Wrong Arm}|\ell)}_{\text{Exploitation Regret}} \]

Expected Regret Using Simulation

Code
etc_sim_df <- bind_rows(
  lapply(
    c(1:40000),
    function(id_val){
      data.frame(
        'id' = id_val,
        't' = c(1:100),
        'l' = ceiling(id_val/5000)*5,
        'reward' = simulate_etc(ceiling(id_val/5000)*5)
      )
    }
  )
) %>%
  group_by(id) %>% 
  mutate(cum_payoff = cumsum(reward),
         cum_regret = t*6-cum_payoff) %>% 
  ungroup() %>% 
  group_by(l,t) %>% 
  summarise(mean_cum_regret = mean(cum_regret)) %>% 
  ungroup()

ggplot(data = etc_sim_df) +
  geom_line(aes(x=t,y=mean_cum_regret,group=factor(l),color=factor(l))) +
  labs(x = 'Time', y = 'Mean Cumulative Regret', 
       title = 'ETC Cumulative Regret with Various l (1000 Simulations)') + 
  guides(color=guide_legend(title="l")) +
  theme_bw()

Computing Optimal \(\ell\) Analytically

  • Solve the problem (recalling that in our case, \(\mu_1-\mu_2=1\))

\[ \min_{\ell}\; \underbrace{\ell}_{\text{Exploration Regret}} + \underbrace{(100-2\ell)\mathbb{P}(\text{Pick Wrong Arm}|\ell)}_{\text{Exploitation Regret}} \]

  • Take a derivative with respect to \(\ell\) and set equal to zero (beyond the scope of today)

Is ETC Good?

  • It turns out that we can do far better than ETC
  • Big concern: regret is linear in \(t\)
    • Scaling concern
  • We want algorithms where \(\text{Regret}/t\to 0\) as \(t \to \infty\)

Upper Confidence Bound

  • Upper Confidence Bound (UCB) intuition:
    • I can construct a confidence interval around an ad’s average payoff so far
    • As I run an ad more often, that confidence interval shrinks
    • The 95th percentile of \(\overline{y}_i\) can be high because I either haven’t played that ad often, or because something really good happens when I play that ad
    • Either way: play that ad!

UCB Algorithm

  1. Play both ads
  2. For each \(t\) from 3 to 100:
    • Denote by \(n_{it}\) the number of times that ad \(i\) has been played as of time \(t\)
    • Calculate \(a_{it} = \overline{y}_{it} + \frac{2}{\sqrt{n_{it}}}\)
    • Play the ad with the highest value of \(a_{it}\)

UCB in R

simulate_ucb <- function(){
  # initialization
  n_1 <- 1
  n_2 <- 1
  rewards_1 <- 6 + rnorm(1)
  rewards_2 <- 5 + rnorm(1)
  all_rewards <- c(rewards_1, rewards_2)
  all_actions <- c(1,2)
  for(t in c(3:100)){
    # calculate the upper confidence bound
    y_bar_1 <- mean(rewards_1)
    y_bar_2 <- mean(rewards_2)
    a_1 <- y_bar_1 + 2/sqrt(n_1)
    a_2 <- y_bar_2 + 2/sqrt(n_2)
    # whichever ad has the higher UCB, run it,
    # and record the results
    if(a_1 > a_2){
      n_1 <- n_1 + 1
      reward <- 6 + rnorm(1)
      rewards_1 <- c(rewards_1, reward)
      all_rewards <- c(all_rewards, reward)
      all_actions <- c(all_actions, 1)
    }else{
      n_2 <- n_2 + 1
      reward <- 5 + rnorm(1)
      rewards_2 <- c(rewards_2, reward)
      all_rewards <- c(all_rewards, reward)
      all_actions <- c(all_actions, 2)
    }
  }
  return(data.frame(all_actions, all_rewards))
}

Which Arms are we Playing?

Code
set.seed(100)
simulate_ucb() %>% 
  mutate(t = c(1:100)) %>% 
  group_by(all_actions) %>% 
  mutate(y_bar = cumsum(all_rewards)/row_number(),
         ucb = y_bar + 2/sqrt(row_number())) %>% 
  ungroup() %>% 
  ggplot(data = .) +
  geom_point(aes(x=t,y=y_bar, color = factor(all_actions))) + 
  geom_errorbar(aes(x=t,ymax=ucb,ymin=y_bar, color = factor(all_actions)), alpha = .5) +
  labs(x = 'Time', y = 'Mean Payoff of Ad and UCB', title = 'The Algorithm is Optimistic') + 
  guides(color=guide_legend(title="Ad")) +
  theme_bw()

Expected Regret via Simulation

Code
ucb_sim_df <- bind_rows(
  lapply(
    c(1:1000),
    function(id_val){
      simulate_ucb() %>% 
      mutate(id = id_val,
             t = c(1:100))
    }
  )
) %>%
  group_by(id) %>% 
  mutate(cum_payoff = cumsum(all_rewards),
         cum_regret = t*6-cum_payoff) %>% 
  ungroup() %>% 
  group_by(t) %>% 
  summarise(mean_cum_regret = mean(cum_regret),
            mean_cum_reward = mean(cum_payoff)) %>% 
  ungroup()

ggplot(data = ucb_sim_df) +
  geom_line(aes(x=t, y = mean_cum_regret))+
  labs(x = 'Time', y = 'Mean Cumulative Regret', 
       title = 'UCB Achieves Sublinear Regret (1000 Simulations)') + 
  theme_bw()

Conclusion

  • When experimentation is costly, use bandits
  • Use algorithms that display sublinear regret
    • UCB
    • Thompson Sampling (Bayesian analogue to UCB, not covered today)
  • Interactive learning is cool!