# Multi-arm Bandit (MAB) Problem
## Setting
- There are $K$ arms (actions) to pull repeatedly
- Each arm $a$ is associated with a probability distribution $p_a$ of numerical values (assumed to be supported in $[0,1]$ for illustration) with mean $q(a)$. The distribution for each arm is assumed to be independent.
- At time $t$, if you choose action $A_t=a$, you recieve a reward $r_t=r(a)$ drawn from $p_a$.
- Goal is to find the action $a^\star:=\arg\max_a q(a)$ that gives the highest expected reward $q^\star:=q(a^\star)$.
Strategy: Maintain estimates $Q_t(a)$ of $q(a)$ at each time step $t$ and keep updating the estimates based on experience
- Exploting: choose $\arg\max_a Q_t(a)$ each time (greedy)
- Exploring: choose non-greedy action
Exploration is needed as the current estimates $Q_t(a)$ $(a=1,...,K)$ could be wrong and blindly exploting accordingly leads to suboptimal actions, but too much exploration leads to poor performance in the long run. This is called the exploration-explotation dilemma. An important part of bandit theory is to balance exploration and exploitation with the choice of $Q(a)$ and action-taking strategy.
## Relation to Reinforcement Learning
Bandit problems is a simplication of reinforcement learning (Markov decision processes) that helps us focus on the trade-off between exploration and exploitation. MAB can be considered an MDP with just one states and $K$ actions, where every action leads back to the state while having its own reward distribution independent of others. Conversely, MDP can be viewed as a collection of bandit problems, one at each state with arms being the set of allowed actions and reward being the MDP return, but whenever an action is taken, we are transitioned into a neighboring bandit.
In some cases an MDP can also be modelled as MAB in different ways. For example, suppose we have a finite MDP which is unichain (i.e. all stationary policies induces a single recurrent class on the whole state space) and the return is defined as the undiscounted average reward. Then it is classical that each stationary policy $\pi$ yields a state-independent expected return $\rho_\pi$. This MDP corresponds to an MAB where each stationary policy $\pi$ is an arm which gives the return of MDP following $\pi$ as the (random) reward. [3]
## How to measure performance of bandit algorithms?
Asymptotic results regarding convergence is often unsatisfactoy since we only have access to a finite number of samples in practice. There are two common ways to measure a performance of an algorithm:
### 1. Sample complexity
Given $\epsilon,\delta>0$, find the smallest number of steps/samples $T$ such that with probability at least $1-\delta$, the average error of the algorithm after $T$ steps is less than $\epsilon$ [1, 4]. Further disction can be made based on whether a generative sampling model is accessible. In the bandit setting, the error would be $\mathbb{E}[q(a^\star)-Q_t(A_t)]$, where the expectation is taken with respcect to all sources of randomness (the environment and the algorithm).
### 2. Regret
Unlike sample complexity which only captures the "final error", regret captures the cumulative error over the course of the algorithm due to the selection of suboptimal actions. More specifically, the regret after $n$ steps is defined as
$$R_n:= nq^\star - \mathbb{E}\left[\sum_{t=1}^n r(A_t)\right],$$where again the expectation is taken with respcect to all sources of randomness. Defining $T_n(a):=\sum_{t=1}^n\mathbb{1}_{A_t=a}$ and the gap $\Delta(a):=q^\star-q(a)$ for each action, we can rewrite the regret as [5]
$$R_n=nq^\star - \mathbb{E}\left[ \sum_{a}T_n(a)\right]q(a) = \sum_{a: \text{ suboptimal}}\mathbb{E}\left[ T_n(a)\right]\Delta(a),$$ so that we only need to control the frequency at which the algorithm selects suboptimal actions (since we assumed the rewards have bounded supports).
Minimizing regret encodes the exploration-exploitation dilemma. In the worst case, $R_n=O(n)$ can happen if one keeps exploring (say, with uniform probability), while a good algorithm should have $R_n=o(n)$. In other words, the regret captures how fast the algorithm learns an optimal strategy and how it performs in the long run.
## Exploration Strategies
We set $n$ to be the total number of trials (i.e. the number of timesan arm is pulled).
### 1. Explore-then-commit
- Fix $\tau<n$.
- (Explore) For steps $t=1,...,\tau$, select actions $A_t$ with equal probability (or in a round-robin fashion) and obtain reward $r_t$.
- Compute estimates $\hat{Q}(a):=\frac{1}{T_\tau(a)}\sum_{t=1}^\tau \mathbb{1}_{A_t=a}r_t$
- (Commit) For steps $t=\tau+1,...,n$, select greedy action $\hat{a}^\star$ with respect to $\hat{Q}$.
### 2. $\epsilon$-greedy
- Initialize $Q_0$ (to be zero for all actions, for example) and $\epsilon>0$
- At each step $t=1,...,n$, choose $A_t$ in an $\epsilon$-greedy way with respect to $Q_{t-1}$, obain reward $r_t$, and calculate the new estimate $Q_t(a):=\frac{1}{T_t(a)}\sum_{i=1}^{t} \mathbb{1}_{A_i=a}r_i$
Since each action is selected infinitely often, by the Law of Large Numbers $Q_t(a)$ converges almost surely to $q(a)$ for all $a$ as $t\to\infty$.
### 3. Upper Confidence Bound (UCB)
The principle of $\textit{optimism in the face of uncertainty}$ says that whenever the value of an action is uncertain, consider its largest plausible value, and then select the best action. This encourages exploration in a guided way based on how uncertain our estimate for each action is. Specifically, we add an exploration bonus term on top of the estimates $Q_t(a)$ when selecting greedy actions. The bonus term is the radius of some confidence interval, which can be constructed via many methods (Bernstein, KL divergence etc. [2, 3]).
- Initialize $Q_0$ and constant $C>0$
- Choose $\delta_t>0$ ($t=1,...,n$)
- At each step $t=1,...,n$
- Compute the upper confident bounds $B_{t-1}(a):=Q_{t-1}(a)+C\sqrt{\frac{\log(2/\delta_t)}{2T_{t-1}(a)}}$
- Select $A_t:=\arg\max_{a} B_{t-1}(a)$ and obtain reward $r_t$
- Update the estimates $Q_t(a):=\frac{1}{T_t(a)}\sum_{i=1}^{t} \mathbb{1}_{A_i=a}r_i$
The first part of $B_t$ encourage exploitation while the second term encourages exploration, but as more samples are collected for an arm, the level of exploration decreases.
If $\delta_t$ is sufficiently small, e.g. $\delta_t=\Omega(1/t)$, then each action would be selected infinitely often, and hence by the Law of Large Numbers $Q_t(a)$ converges almost surely to $q(a)$ for all $a$ as $t\to\infty$.
## Regret Analysis
A common approach to upper bound the regret is to separate two cases:
- The estimates $Q(a)$ concentrate around its mean $q(a)$ with high probability as a sum of independent random variables, so most of the time the regret is not too large.
- The remaining small probability can be handled as the reward is assumed to be bounded.
Recall (a form of) the classic Hoeffding's inequality [2]: if random variables $X_i\in[a_i,b_i]$ $(i=1,...,n)$ are indepedent and $Z_n:=X_1+\cdots+X_n$, then for $\epsilon>0$ we have $$\mathbb{P}(Z_n-\mathbb{E}Z_n \geq \epsilon) \leq \exp\left(\frac{-2\epsilon^2}{\sum_{i=1}^n (b_i-a_i)^2}\right).$$
### Regret bound for explore-then-commit
In this setting we have $R_n = \sum_{t=1}^n \mathbb{E}\left[ q(a^\star)- r_t\right]\leq \tau + \sum_{t=\tau+1}^n \mathbb{E}[q(a^\star)- q(\hat{a}^\star)]$, where the expectation is taken w.r.t. the randomness from sampling. There is nothing to prove when $\hat{a}^\star$ is optimal, so let's assume the contrary.
Let $w(a)$ be some confidence interval radius to be determined, and let $S_a$ denote the event where $|\hat{Q}(a)-q(a)|\leq w(a)$. Then under $S:=S_{a^\star}\cap S_{\hat{a}^\star}$ we have $$q(\hat{a}^\star)+w(\hat{a}^\star) \geq \hat{Q}(\hat{a}^\star)>\hat{Q}(a^\star)\geq q(a^\star)-w(a^\star),$$ where the second inequality follows from the definition of $\hat{a}^\star$. Thus $$q(a^\star) -q(\hat{a}^\star)\leq
w(a^\star) + w(\hat{a}^\star).$$ If the radii $w(a)$ are deterministic, then by law of total expectation
$$R_n\leq \tau + \sum_{t>\tau} [w(a^\star) + w(\hat{a}^\star) + \mathbb{P}(S^c)].$$ If we set $$w(a)=\sqrt{\frac{2K\log n}{\tau}},$$ then for all $a$ we have $\mathbb{P}(S_a^c)\leq (n^{-4})^{nK/\tau}$ by Hoeffding, and hence
$$R_n\leq \tau + O\left(\sqrt{\frac{2K\log n}{\tau}} (n-\tau)\right) + O(n^{-3}) \leq \tau + O\left(\sqrt{\frac{2K\log n}{\tau}} n\right). $$ Setting $\tau = n^{2/3}(K\log n)^{1/3}$ we have $R_n\leq O\left(n^{2/3}(K\log n)^{1/3}\right) = \tilde{O}(n^{2/3})$.
### Regret bound for UCB case
Choose $C=1$. Let $E$ be the event such that for all action $a$ and time $t=1,...,n$ we have $|Q_t(a)-q(a)|\leq w_t(a):=\sqrt{\frac{\log(2/\delta)}{2T_t(a)}}$. Then by union bound and Hoeffding we have $\mathbb{P}(E)\geq 1-nK\delta$. On the event $E$, for any nonoptimal $a$ we have
$$q(a)+2w_t(a)\geq Q_t(a)+w_t(a) > Q_t(a^\star)+w_t(a^\star) \geq q(a^\star)$$ and therefore $T_n(a)\leq \frac{2\log(2/\delta)}{\Delta(a)^2}$ with probability at least $1-nK\delta$. If we set $\delta=n^{-2}$, by the law of total expectation
$$\mathbb{E}[T_n(a)] \leq \frac{2\log(2/\delta)}{\Delta(a)^2} + n\cdot nK\delta = \frac{4\log(2n)}{\Delta(a)^2} + K$$ and summing over suboptimal actions gives a regret bound. Also note that for any suboptimal action $a$ we have $\lim_{n\to\infty} T_n(a)/n = 0$ almost surely by Markov's inequality. For sharper upper bounds and lower bounds, see [4, 5, 6].
## References
[1] [Cathy Wu, Multi-armed bandits: Exploration-Exploitation Dilemma, 2022](https://web.mit.edu/6.7950/www/lectures/L15-2022fa.pdf)
[2] [Probability in High Dimension, APC 550 Lecture Notes, Princeton University](https://web.math.princeton.edu/~rvan/APC550.pdf)
[3] [Ronald Ortner, Regret Bounds for Reinforcement Learning via Markov Chain Concentration, 2020](https://www.jair.org/index.php/jair/article/view/11316)
[4] [Auer et al., Finite-time Analysis of the Multiarmed Bandit Problem, 2002](https://link.springer.com/article/10.1023/A:1013689704352)
[5] [Lattimore & Szepesvári. Bandit algorithms, 2020](https://tor-lattimore.com/downloads/book/book.pdf)
[6] [Osband and Roy, On lower bounds for regret in reinforcement learning, 2016](https://arxiv.org/pdf/1608.02732.pdf)