We can compute the expected revenue for any algorithm we pick
Compare the expected revenue to the 600 calculated in the previous slide
The difference is the “regret” of an algorithm
Goal: pick an algorithm that minimizes regret
Explore-then-Commit Algorithm
Algorithm:
Pick some number of “trials” for each ad \(\ell\leq 50\)
Run each ad \(\ell\) times
Calculate sample mean of the \(\ell\) sales for each ad, \(\overline{y}_1\) and \(\overline{y}_2\)
Run the ad with the higher sample mean for the rest of time
Implementing ETC
simulate_etc <-function(l){# try each ad l times trials_1 <-6+rnorm(l) trials_2 <-5+rnorm(l)# compute the sample mean for each ad y_bar_1 <-mean(trials_1) y_bar_2 <-mean(trials_2)# commit to the better adif(y_bar_1 > y_bar_2){return(c(trials_1, trials_2, 6+rnorm(100-2*l))) }else{return(c(trials_1, trials_2, 5+rnorm(100-2*l))) }}
ETC Cumulative Reward Visualization
Code
etc_sim_df <-bind_rows(lapply(c(1:100),function(id_val){data.frame('id'= id_val,'t'=c(1:100),'reward'=simulate_etc(5) ) } )) %>%group_by(id) %>%mutate(cum_payoff =cumsum(reward),cum_regret = t*6-cum_payoff) %>%ungroup()ggplot(data = etc_sim_df) +geom_line(aes(x=t,y=cum_payoff,group=factor(id),color=factor(id))) +labs(x ='Time', y ='Cumulative Reward', title ='ETC With l=5') +theme_bw() +theme(legend.position ='none')
ETC Cumulative Reward Visualization
Code
etc_sim_df <-bind_rows(lapply(c(1:100),function(id_val){data.frame('id'= id_val,'t'=c(1:100),'reward'=simulate_etc(40) ) } )) %>%group_by(id) %>%mutate(cum_payoff =cumsum(reward),cum_regret = t*6-cum_payoff) %>%ungroup()ggplot(data = etc_sim_df) +geom_line(aes(x=t,y=cum_payoff,group=factor(id),color=factor(id))) +labs(x ='Time', y ='Cumulative Reward', title ='ETC With l=40') +theme_bw() +theme(legend.position ='none')
Tradeoffs
As we experiment more (higher \(\ell\)), we become more confident that we’re picking the better arm
However, that experimentation is costly – we incur regret as we play the worse ad more often
How do we assess which \(\ell\) is best? Regret!
ETC Expected Regret
Break it into individual parts:
When we explore ad 1, we incur zero expected regret
When we explore ad 2, we incur \(\ell\) expected regret (as \(6-5=1\))
After committing, if we pick ad 1, we incur zero expected regret, and if we pick ad 2, we incur \(100-2\ell\) regret
Take a derivative with respect to \(\ell\) and set equal to zero (beyond the scope of today)
Is ETC Good?
It turns out that we can do far better than ETC
Big concern: regret is linear in \(t\)
Scaling concern
We want algorithms where \(\text{Regret}/t\to 0\) as \(t \to \infty\)
Upper Confidence Bound
Upper Confidence Bound (UCB) intuition:
I can construct a confidence interval around an ad’s average payoff so far
As I run an ad more often, that confidence interval shrinks
The 95th percentile of \(\overline{y}_i\) can be high because I either haven’t played that ad often, or because something really good happens when I play that ad
Either way: play that ad!
UCB Algorithm
Play both ads
For each \(t\) from 3 to 100:
Denote by \(n_{it}\) the number of times that ad \(i\) has been played as of time \(t\)
set.seed(100)simulate_ucb() %>%mutate(t =c(1:100)) %>%group_by(all_actions) %>%mutate(y_bar =cumsum(all_rewards)/row_number(),ucb = y_bar +2/sqrt(row_number())) %>%ungroup() %>%ggplot(data = .) +geom_point(aes(x=t,y=y_bar, color =factor(all_actions))) +geom_errorbar(aes(x=t,ymax=ucb,ymin=y_bar, color =factor(all_actions)), alpha = .5) +labs(x ='Time', y ='Mean Payoff of Ad and UCB', title ='The Algorithm is Optimistic') +guides(color=guide_legend(title="Ad")) +theme_bw()