Bandit Algorithms

Lukas Hager

2024-11-29

Question

We run a store
We can display an ad in two formats
We can afford to run 100 ads in total
How should we figure out which ad to run?

Which ad will be more effective?

Specifics

We’ll assume that ad 1 is better on average than ad 2
- If we run ad 1, we get an average sale of 6 (\(\mu_1\)), as opposed to 5 (\(\mu_2\)) for ad 2
- However, individual sales are noisy
In time period \(t\), choosing ad \(i\) gives us payoff \(y_{it} = \mu_i + \varepsilon_t\)
\(\varepsilon_t\) is distributed i.i.d. standard normal

Best Expected Payoff

If we knew this ahead of time, we’d run ad 1 every time
Best (unobtainable) expected payoff:

\[ \begin{split} \mathbb{E}[\text{Revenue}] &= \mathbb{E}\left[\sum_{t=1}^{100}y_{1t}\right] \\ &= \sum_{t=1}^{100}\mathbb{E}[6 + \varepsilon_t] \\ &= 600 \end{split} \]

Regret

We can compute the expected revenue for any algorithm we pick
Compare the expected revenue to the 600 calculated in the previous slide
The difference is the “regret” of an algorithm
Goal: pick an algorithm that minimizes regret

Explore-then-Commit Algorithm

Algorithm:
- Pick some number of “trials” for each ad \(\ell\leq 50\)
- Run each ad \(\ell\) times
- Calculate sample mean of the \(\ell\) sales for each ad, \(\overline{y}_1\) and \(\overline{y}_2\)
- Run the ad with the higher sample mean for the rest of time

Implementing ETC

simulate_etc <- function(l){
    # try each ad l times
    trials_1 <- 6 + rnorm(l)
    trials_2 <- 5 + rnorm(l)
    # compute the sample mean for each ad
    y_bar_1 <- mean(trials_1)
    y_bar_2 <- mean(trials_2)
    # commit to the better ad
    if(y_bar_1 > y_bar_2){
        return(c(trials_1, trials_2, 6 + rnorm(100-2*l)))
    }else{
        return(c(trials_1, trials_2, 5 + rnorm(100-2*l)))
    }
}

ETC Cumulative Reward Visualization

Code

etc_sim_df <- bind_rows(
  lapply(
    c(1:100),
    function(id_val){
      data.frame(
        'id' = id_val,
        't' = c(1:100),
        'reward' = simulate_etc(5)
      )
    }
  )
) %>%
  group_by(id) %>% 
  mutate(cum_payoff = cumsum(reward),
         cum_regret = t*6-cum_payoff) %>% 
  ungroup()

ggplot(data = etc_sim_df) +
  geom_line(aes(x=t,y=cum_payoff,group=factor(id),color=factor(id))) +
  labs(x = 'Time', y = 'Cumulative Reward', title = 'ETC With l=5') + 
  theme_bw() + 
  theme(legend.position = 'none')

ETC Cumulative Reward Visualization

Code

etc_sim_df <- bind_rows(
  lapply(
    c(1:100),
    function(id_val){
      data.frame(
        'id' = id_val,
        't' = c(1:100),
        'reward' = simulate_etc(40)
      )
    }
  )
) %>%
  group_by(id) %>% 
  mutate(cum_payoff = cumsum(reward),
         cum_regret = t*6-cum_payoff) %>% 
  ungroup()

ggplot(data = etc_sim_df) +
  geom_line(aes(x=t,y=cum_payoff,group=factor(id),color=factor(id))) +
  labs(x = 'Time', y = 'Cumulative Reward', title = 'ETC With l=40') + 
  theme_bw() + 
  theme(legend.position = 'none')

Tradeoffs

As we experiment more (higher \(\ell\)), we become more confident that we’re picking the better arm
However, that experimentation is costly – we incur regret as we play the worse ad more often
How do we assess which \(\ell\) is best? Regret!

ETC Expected Regret

Break it into individual parts:
- When we explore ad 1, we incur zero expected regret
- When we explore ad 2, we incur \(\ell\) expected regret (as \(6-5=1\))
- After committing, if we pick ad 1, we incur zero expected regret, and if we pick ad 2, we incur \(100-2\ell\) regret
Thus, ETC regret is

\[ \underbrace{\ell}_{\text{Exploration Regret}} + \underbrace{(100-2\ell)\mathbb{P}(\text{Pick Wrong Arm}|\ell)}_{\text{Exploitation Regret}} \]

Expected Regret Using Simulation

Code

etc_sim_df <- bind_rows(
  lapply(
    c(1:40000),
    function(id_val){
      data.frame(
        'id' = id_val,
        't' = c(1:100),
        'l' = ceiling(id_val/5000)*5,
        'reward' = simulate_etc(ceiling(id_val/5000)*5)
      )
    }
  )
) %>%
  group_by(id) %>% 
  mutate(cum_payoff = cumsum(reward),
         cum_regret = t*6-cum_payoff) %>% 
  ungroup() %>% 
  group_by(l,t) %>% 
  summarise(mean_cum_regret = mean(cum_regret)) %>% 
  ungroup()

ggplot(data = etc_sim_df) +
  geom_line(aes(x=t,y=mean_cum_regret,group=factor(l),color=factor(l))) +
  labs(x = 'Time', y = 'Mean Cumulative Regret', 
       title = 'ETC Cumulative Regret with Various l (1000 Simulations)') + 
  guides(color=guide_legend(title="l")) +
  theme_bw()

Computing Optimal \(\ell\) Analytically

Solve the problem (recalling that in our case, \(\mu_1-\mu_2=1\))

\[ \min_{\ell}\; \underbrace{\ell}_{\text{Exploration Regret}} + \underbrace{(100-2\ell)\mathbb{P}(\text{Pick Wrong Arm}|\ell)}_{\text{Exploitation Regret}} \]

Take a derivative with respect to \(\ell\) and set equal to zero (beyond the scope of today)

Is ETC Good?

It turns out that we can do far better than ETC
Big concern: regret is linear in \(t\)
- Scaling concern
We want algorithms where \(\text{Regret}/t\to 0\) as \(t \to \infty\)

Upper Confidence Bound

Upper Confidence Bound (UCB) intuition:
- I can construct a confidence interval around an ad’s average payoff so far
- As I run an ad more often, that confidence interval shrinks
- The 95th percentile of \(\overline{y}_i\) can be high because I either haven’t played that ad often, or because something really good happens when I play that ad
- Either way: play that ad!

UCB Algorithm

Play both ads
For each \(t\) from 3 to 100:
- Denote by \(n_{it}\) the number of times that ad \(i\) has been played as of time \(t\)
- Calculate \(a_{it} = \overline{y}_{it} + \frac{2}{\sqrt{n_{it}}}\)
- Play the ad with the highest value of \(a_{it}\)

UCB in `R`

simulate_ucb <- function(){
  # initialization
  n_1 <- 1
  n_2 <- 1
  rewards_1 <- 6 + rnorm(1)
  rewards_2 <- 5 + rnorm(1)
  all_rewards <- c(rewards_1, rewards_2)
  all_actions <- c(1,2)
  for(t in c(3:100)){
    # calculate the upper confidence bound
    y_bar_1 <- mean(rewards_1)
    y_bar_2 <- mean(rewards_2)
    a_1 <- y_bar_1 + 2/sqrt(n_1)
    a_2 <- y_bar_2 + 2/sqrt(n_2)
    # whichever ad has the higher UCB, run it,
    # and record the results
    if(a_1 > a_2){
      n_1 <- n_1 + 1
      reward <- 6 + rnorm(1)
      rewards_1 <- c(rewards_1, reward)
      all_rewards <- c(all_rewards, reward)
      all_actions <- c(all_actions, 1)
    }else{
      n_2 <- n_2 + 1
      reward <- 5 + rnorm(1)
      rewards_2 <- c(rewards_2, reward)
      all_rewards <- c(all_rewards, reward)
      all_actions <- c(all_actions, 2)
    }
  }
  return(data.frame(all_actions, all_rewards))
}

Which Arms are we Playing?

Code

set.seed(100)
simulate_ucb() %>% 
  mutate(t = c(1:100)) %>% 
  group_by(all_actions) %>% 
  mutate(y_bar = cumsum(all_rewards)/row_number(),
         ucb = y_bar + 2/sqrt(row_number())) %>% 
  ungroup() %>% 
  ggplot(data = .) +
  geom_point(aes(x=t,y=y_bar, color = factor(all_actions))) + 
  geom_errorbar(aes(x=t,ymax=ucb,ymin=y_bar, color = factor(all_actions)), alpha = .5) +
  labs(x = 'Time', y = 'Mean Payoff of Ad and UCB', title = 'The Algorithm is Optimistic') + 
  guides(color=guide_legend(title="Ad")) +
  theme_bw()

Expected Regret via Simulation

Code

ucb_sim_df <- bind_rows(
  lapply(
    c(1:1000),
    function(id_val){
      simulate_ucb() %>% 
      mutate(id = id_val,
             t = c(1:100))
    }
  )
) %>%
  group_by(id) %>% 
  mutate(cum_payoff = cumsum(all_rewards),
         cum_regret = t*6-cum_payoff) %>% 
  ungroup() %>% 
  group_by(t) %>% 
  summarise(mean_cum_regret = mean(cum_regret),
            mean_cum_reward = mean(cum_payoff)) %>% 
  ungroup()

ggplot(data = ucb_sim_df) +
  geom_line(aes(x=t, y = mean_cum_regret))+
  labs(x = 'Time', y = 'Mean Cumulative Regret', 
       title = 'UCB Achieves Sublinear Regret (1000 Simulations)') + 
  theme_bw()

Conclusion

When experimentation is costly, use bandits
Use algorithms that display sublinear regret
- UCB
- Thompson Sampling (Bayesian analogue to UCB, not covered today)
Interactive learning is cool!

Bandit Algorithms

Question

Specifics

Best Expected Payoff

Regret

Explore-then-Commit Algorithm

Implementing ETC

ETC Cumulative Reward Visualization

ETC Cumulative Reward Visualization

Tradeoffs

ETC Expected Regret

Expected Regret Using Simulation

Computing Optimal \(\ell\) Analytically

Is ETC Good?

Upper Confidence Bound

UCB Algorithm

UCB in R

Which Arms are we Playing?

Expected Regret via Simulation

Conclusion

UCB in `R`