Regret Optimalities
===
Author
[Raj Ghugare](https://github.com/Raj19022000)
###### tags: `notes` `multi-armed bandits` `Regret` `UCB`
Topic of discusion:
Upper Confidence Bounds for multi-armed bandits
## Review and research notes
### Introduction
* Optimizing regret for a multi-armed bandit is basically adding a new dimension of optimality or correctness rather than the naive asymptotic guarentees.
* It focuses on maximising the expected reward during learning process; i.e. during exploring to gain insight on all arms.
* Regret in the sense of our problem is the expected reward we could have obtained if we have had chosen the best arm since the start.
* Therefore, for any time step n the regret :
$z_n = q_*(a^*) - q_*(a_n)$
### Notion behind UCB :
* Upper Confidence bounds as the name suggests tries to determine an optimistic upper bound to the value of every arm even if the arm isnt explored many times.
* We then chose the arm which has the luxury of our higest confidence.
* The upper confidence bound for an arm j is :
$Q(j) + \sqrt{2ln(n)/n_j}$
* Where :
$n$ is number of trials we've played.
$n_j$ is the number times we've played arm $j$.
* The second term in the bound is the optimistic factor which is inversely proportional to $n_j$ because, as the number of times we try that arm increases it is only obvious that we'd be surer about its empirical value and not be over optimistic about it. The $ln(n)$ is just a normalising term which does not affect any action relative to the others.
* The only question remains is that how did we get to this magic term?
$\sqrt{2ln(n)/n_j}$
### UCB1:
* Initialization - Play each arm once
* loop :
Play the arm j that maximises the term
$Q(j) + \sqrt{2ln(n)/n_j}$
##### Note
This algorithm assumes that the rewards are binary.
### Reaching the algorithm:
#### Notatations :
$\sqrt{2ln(n)/s} = C_{n,s}$
$I_m(i) = indicator function$ which takes the value 1 if ith action is played at time "m", else it takes a value of 0.
Let's try and formalize our problem first and realize that we need to mathematically define regret and bound it.
* $T_i(n)$ be a random variable specifying the number of times an arm $i$ is played in the first $n$ trials.
* $\Delta_i = q_*(a^*)-q_*(i)$
Hence we can define the cummulative regret for the first n trials as:
$Z_n = \sum_{x = 1}^{|A|} E[T_i(n)]\Delta_i$
The expected number of times we play arm *i* in the first n trials multiplied by its distance from the true value of the best arm.
We have to prove that using our upper confidence values($Q(a) + \sqrt{2ln(n)/s}$) we can bound the regret as well.
It is evident from the above equation that we cannot calculate regret, as the true action values are unkown and $T_i(n)$ is also unkown at the start of them problem.
To bound the regret lets first try and bound $T_i(n)$
Let's assume that we've played n trials according to our algorithm.
Hence for any arm $i$
$T_i(n)$ = 1 + $\sum_{m = K+1}^{n}$($I_m=i$)
Because we have to play each of the K arms once. To try and upper bound,we write $t_i(n)$ in this manner:
$T_i(n) \leq l$ + $\sum_{m = K+1}^{n}$($I_m$ && $T_i(m) \geq l$) --(i)
We take an arbitrary number $l$ which we'd later optimize. Don't worry about $l$ right now,as we have to decide a suitable value for it.Two cases arise
Case 1: In our experiment we play $ith$ arm more than $l$-times
Then the equality sign will hold.
Case 2:If we play the arm less than $l$-times
Then the equation would have a strict inequality.
Hence you should be able to see how the RHS is an overcount of the former one. Now,
$T_i(n) \leq l$ + $\sum_{m = K+1}^{n}$($Q(a^*)+C_{m-1,t_a*(m-1)}\leq Q(i)+C_{m-1,T_i(m-1)}$ && $T_i(m) \geq l$) --(ii)
We know that according to our algorithm we'd chose the $ith$ arm when our upper confidence estimate for it is more than all other arms. Let $E_2$ be the event when the upper confidence estimate for $ith$ arm is atleast better than the true best($a^*$) arm. It should be now understood than $E_2$ is an overcount of the actual number of times we'd play the arm $i$.Hence the $\leq$ sign remains.
Now let's take a further overcount:
$T_i(n) \leq l$ + $\sum_{m = K+1}^{n}$$(($$min_{0<s<m}$($Q(a^*)+C_{m-1,t_a*(m-1)})\leq(($$max_{l\leq s_i \leq m}$$Q(i)+C_{m-1,T_i(m-1)})$ && $T_i(m) \geq l$) --(iii)
Now, we are counting every time the maximum of upper confidence estimate of $ith$ arm from the $lth$ time step to the $mth$ one is greater than the minmum of the upper confidence estimate of $a^*$ from the $0th$ to the $mth$ time-step. Try to think of how the event used in RHS of (iii) equation will take place atleast the number of times the event in the RHS of (ii) equation. In other words the RHS of (iii) will surely take place if RHS of (ii) has occured.
A further overcount to get rid of all the comparing signs is
$T_i(n) \leq (\sum_{x = a}^{b} \sum_{s=1}^{m-1} \sum_{s_i=l}^{m-1}(Q_s(a^*)+C_{m-1,T_{a^*}(s)} \leq Q_{s_i}(i)+C_{m,T_i(s_i)}$) && $T_i(m) \geq l$)
if for any values $s$ or $s_i$:
$Q_s(a^*)+C_{m,T_{a^*}(s)} \leq Q_s(i) + C_{m,T_{i^*}(s)}$
This implies that either of the three things must've atleast occured
1) $Q_s(a^*) + C_{m,T_{a^*}(s)} \leq q_*(a^*)$
This means that the optimistic upper confidence value for $a^*$ is wrong,the true value of $a^*$ is even higher.
2) $Q_s(i) - C_{m,T_{i}(s_i)} \geq q_*(i)$
This means that the the value of $ith$ action has been grossly over-estimated by a factor greater than $C_{m,T_{a^*}(s)}$
3) $q_*(a) \leq q_*(i) + 2C_{m,T_i(s_i)}$
This says that the $ith$ arm is very close to the true best action in value and hence mistakes are bound to happen.
Using Chernoff hoeffding bounds
* Probablity of event 1
$pr(Q_s(a^*) \leq q_*(a^*)-C_{m,T_{a^*}(s)}) \leq m^{-4}$
* Probablity of event 2
$pr(Q_s(i) \geq q_*(i)+C_{m,T_{i}(s_i)}) \leq m^{-4}$
* Now we'll chose $l$ in such a way that event three never occurs.
If we chose $l = 8ln(n)/\Delta_i^2$ then:
$q*(a)-q_*(i)-2C_{m,T_{a^*}(s_i)} > q*(a)-q_*(i)-2\sqrt{2ln(n)/l} > 0$ ; for $l = 8ln(m)/\Delta_i^2$
Hence if we play the $ith$ arm more than "$l$" times then:
$E[T_i(n)] \leq E[l] + \sum_{m = 1}^{\infty} \sum_{s=1}^{m-1} \sum_{s_i=l}^{m-1}(2m^{-4})$
Because the the occurence of an event is counted as $1$ and its absence is counted as $0$, its expectation will be its probablity of occuring.($m^{-4}+m^{-4}$)
$E[T_i(n)] \leq 8ln(n)/\Delta_i^2 +1+ \sum_{m = 1}^{\infty} (2m^{-2})$
The above step is just algebraic manipulations.The next step uses a result from [basel's problem](https://en.wikipedia.org/wiki/Basel_problem)
$E[T_i(n)] \leq 8ln(n)/\Delta_i^2 + 1 + \pi^2/3$
multiyplying both sides by $\Delta_i$ and taking a sum over all actions we obtain :
$\sum_{i}E[T_i(n)]\Delta_i \leq \sum_{j \in A}8ln(n)/\Delta_j + \sum_{j \in A}(1+\pi^2/3)\Delta_i$
Hence this proves that the UCB algorithm does actually bound regret.
##### Note:
Don't just read these proof, instead try and understand how the author of the algorithm must've thought while coming up with this.Try to understand the amount of hardwork it would require to come up with any one of these steps by yourself.This would give you a direction if you are trying to come up with an algorithm of your own.
Go through this excellent article by Jeremykun on [UCB](https://jeremykun.com/2013/10/28/optimism-in-the-face-of-uncertainty-the-ucb1-algorithm/) if you need further clarity or different approach of understanding.