--- title: Bandit Problems tags: cs 593 rl --- ### Stochastic $k$-armed Bandit A $k$-armed stochastic bandit is a tuple of probability measures $P_1, P_2, ... P_k$ on RV $X$ with $\mu_i = E_{P_i}[X]$, $i^* = \arg\max \mu_i$, $\mu^* = \max_i \mu_i$ are the optimal action and the optimal mean reward respectively At time $t$, the agent chooses arm $A_t$, and receives reward $x_t \sim P_{A_t}[X]$; $A_t$ is $\mathcal{F}_t$ measurable with $A_t \sim \pi(\mathcal{F}_{t-1})$ We define instantaneous regret at time $t$ as the suboptimality of arm $A_t$. There are multiple ways of defining long-term regret: Random Regret: $n\mu^*(\nu) - \sum_t x_t$ Pseudo-Regret: $n\mu^*(\nu) - \sum_t \mu_{A_t}$ Counterfactual Regret: $\int X dP_{A_t}$ We define the reward gap as $\Delta_i = \mu^* - \mu_i$ and $T_i(t) = \sum_{s=1}^t I\{A_s = i\}$ <!-- Lecture 4 --> For any policy $\pi$ on a $k$-armed stochastic bandit $\nu$ with horizon $n$, we have regret $R_n(\pi\nu) = \sum_{i=1}^k\Delta_iE[T_i(n)]$ **Proof** Return = $S_n = \sum_{t=1}^n\sum_{i=1}^k x_t I\{A_t = i\}$ $R_n(\pi\nu) = n\mu^* - E[S_n] = \sum_{t=1}^n\sum_{i=1}^k E[(\mu^* - x_t) I\{A_t = i\}]$ $= \sum_{t=1}^n\sum_{i=1}^k E[E[(\mu^* - x_t) I\{A_t = i\}|\sigma(A_t)]]$ $= \sum_{t=1}^n\sum_{i=1}^k E[I\{A_t = i\} E[(\mu^* - x_t)|\sigma(A_t)]]$ $= \sum_{t=1}^n\sum_{i=1}^k E[I\{A_t = i\} \Delta_{A_t}]$ $= \sum_{t=1}^n\sum_{i=1}^k \Delta_{A_t} E[I\{A_t = i\}]$ $= \sum_{i=1}^k \Delta_{A_t}\sum_{t=1}^n E[I\{A_t = i\}]$ $= \sum_{i=1}^k\Delta_iE[T_i(n)]$ ### Explore Then Commit (ETC) Choose each arm sequentially $m$ times, then follow the best arm from the sample means $A_t = t \mod k + 1, t \leq mk$ $\hat\mu_i = \frac{1}{T_i(\min \{mk, n\})}\sum_{t=1}^{\min \{mk, n\}} x_t I\{A_t = i\}$ $A_t \in \arg\max_i \hat\mu_i, t > mk$ For $\nu$ with sub-Gaussian reward, the regret of ETC is: $R_n \leq \max\{m, \lceil n/k \rceil\} \sum_i \Delta_i + \max\{n - mk, 0\} \sum_i \Delta_i\exp(\frac{-n \Delta_i^2}{4R})$ **Proof** $E[T_i(n)] \leq \max\{m, \lceil n/k \rceil\} + (n - mk)^+P(A_{mk+1} = i)$ $\leq \max\{m, \lceil n/k \rceil\} + (n - mk)^+P(\hat\mu_i(mk) \geq \max_{j \neq i} \hat\mu_j(mk))$ $\leq \max\{m, \lceil n/k \rceil\} + (n - mk)^+P(\hat\mu_i(mk) \geq \hat\mu^* (mk))$ $\leq \max\{m, \lceil n/k \rceil\} + (n - mk)^+P(\hat\mu_i(mk) - \mu_i - (\hat\mu^* (mk) - \mu^*) \geq \Delta_i)$ (Since reward $r$ is sub-Gaussian, $2r/m$ is sub-Gaussian and we can use Hoeffding) $E[T_i(n)] \leq \max\{m, \lceil n/k \rceil\} + (n - mk)^+\exp(\frac{-m\Delta_i^2}{4R})$ $R_n = \sum \Delta_i E[T_i(n)]$ $\implies R_n \leq \max\{m, \lceil n/k \rceil\} \sum_i \Delta_i + (n - mk)^+\sum_i\Delta_i\exp(\frac{-m\Delta_i^2}{4R})$ Consider $k = 2$ with optimal arm $i = 1$ $R_n \leq m\Delta_2 + n\Delta_2 \exp(\frac{-m\Delta^2}{4R})$ Choose $m = \max\{1, \frac{4R}{\Delta_2^2}\log(\frac{n\Delta_2^2}{4R})\}$ Substituting, we get $R_n \leq \Delta_2 + \frac{4}{\Delta_2}(1 + \max\{0, \log \frac{n\Delta^2}{4R}\}) = O(\log(n))$ ### Upper Confidence Bound (UCB) We choose $\delta$ such that: $P(|\hat\mu_i - \mu_i| \leq \sqrt\frac{{2 \log(1/\delta)})}{T_i(t)}) \leq 1 - \delta$ (Hoeffding) $UCB_i(t, \delta) = \hat\mu_i + \sqrt\frac{{2 \log(1/\delta)})}{T_i(t)}$ We choose $A_t = \arg\max_i UCB_i(t - 1, \delta)$ and update $UCB_i(t, \delta)$ accordingly $UCB$ with 1-subGaussian reward ($r \in [0,1]$) acheives regret: $R_n \leq 3 \sum_i \Delta_i \sum_{\Delta_i > 0} 16 \frac{\log n}{\Delta_i}$ **Proof** Consider the following events: $A: \min_{t \leq n} UCB_1(t, \delta) \leq \mu^*$ (Underestimating arm $i^*$) $B: \hat\mu_i + \sqrt{\frac{2}{u_i}\log(1/\delta)} \geq \mu_i, i \neq i^*, u_i \leq n$ (Overestimating arm $i$) Let $G_i = A^c \cap B^c$, then under $G_i, T_i(n) \leq u_i$ We prove this by condtradiction: If $T_i(n) > u_i, \exists t \leq n$ such that $T_i(t - 1) = u_i, A_t = i$ $UCB_i(t - 1, \delta) = \hat\mu_i(t - 1) + \sqrt\frac{{2 \log(1/\delta)})}{T_i(t - 1)} < \mu_i \leq UCB_{i^*}(t - 1, \delta) \quad (T_i(t-1) = u_i)$ Hence, we would not pull arm $i$ at time $t$ if $T_i(n) > u_i$ $E[T_i(n)] = \int_{G_i} T_i(n) dP_{\nu,\pi} + \int_{G_i^c} T_i(n) dP_{\nu,\pi}$ $= E[I_i\{G_i\}T_i(n)] + E[I_i\{G_i^c\}T_i(n)]$ $\leq u_i E[I_i\{G_i\}] + nE[I_i\{G_i^c\}]$ $= u_i P(G_i) + n P(G_I^c) \leq u_i + n P(G_i^c)$ $G_i^c = A \cup B \implies P(G^c) \leq P(A) + P(B)$ $P(A) = P(\{\min_{t \leq n} UCB_1(t, \delta) \leq \mu^*\}) \leq P(\{\min_{s \leq n} \hat\mu_s + \sqrt\frac{2 \log (1/\delta)}{s} \leq \mu^*\})$ $\leq \sum_s P(\{\hat\mu_s + \sqrt\frac{2 \log (1/\delta)}{s} \leq \mu^*\}) \leq n\delta$ (Hoeffding) Assume $u_i$ is such that $\Delta_i - \sqrt{\frac{2 \log(1 /\delta)}{u_i}} \geq c\Delta_i$ $P(B) = P(\hat\mu_i + \sqrt{\frac{2}{u_i}\log(1/\delta)} \geq \mu_i) = P(\hat\mu_i + \sqrt{\frac{2}{u_i}\log(1/\delta)} \geq \mu_i)$ $= P(\hat\mu_i - \mu_i \geq \Delta_i - \sqrt{\frac{2}{u_i}\log(1/\delta)})$ *Missing notes* $E[T_i(n)] \leq u_i + n(n\delta + n\exp (\frac{-u_ic^2\Delta_i^2}{2}))$ Let us choose $u_i = \lceil \frac{2 \log (1/\delta)}{(1 - c)^2\Delta_i^2} \rceil, \delta = 1/n^2$ $E[T_i(n)] \leq \frac{2 \log (1/\delta)}{(1 - c)^2\Delta_i^2} + 1 + n\exp{1 - \frac{2c^2}{1 - c}}$ Let $c = 1/2 \implies E[T_i(n)] \leq \frac{16 \log n}{\Delta_i^2} + 3$ $\therefore R_n(UCB, \nu) = 3\sum_i\Delta_i + \sum_{\Delta_i > 0} \frac{16 \log n}{\Delta_i} = O(\log n)$ If $\Delta_i \lll 1$, the regret above becomes very large; we can rewrite the regret to accomodate situations where we have almost-optimal arms with very small $\Delta_i$ *Missing Notes* ### UCB Lower Bound and Minimax Regret The worst-case regret of a policy $\pi$ on a class on environments $\mathcal{E}$ is $R_n(\pi, \mathcal{E}) = \sup_{\nu \in \mathcal{E}} R_n(\pi, \nu)$ Let $\Pi$ be a set of policies and $\mathcal{E}$ be a set of environments. The minimax regret is defined as $R_n^* (\pi, \mathcal{E}) = \inf_{\pi \in \Pi}\sup_{\nu \in \mathcal{E}} R_n(\pi, \nu)$ <!-- Lecture 4 --> Let $\mathcal{E}(k)$ be the subset of $k$-armed bandits with unit variance and mean reward $\mu \in [0, 1]^k$. Then there exists a constant $c > 0$ such that for $k > 1, n \geq k$, it holds that $R_n^* (\mathcal{E}(k)) \geq c\sqrt{(k - 1)n}$ $R_n(UCB): O(\sqrt{kn \log n})$ is minimax optimal up to the $\log n$ term Minimax Optimal Strategy in Stochastic $k$-Armed Bandits (MOSS) $UCB(s,t) = \sqrt{\frac{4}{T_i(t - 1)}} \log(\max(1, \frac{n}{kT_i(t - 1)})) + \hat\mu_{i, t - 1}$ $R_n \leq 40\sqrt{kn} + \sum_{i=1}^k C_i$ (See Bandit Algorithms, CH9) ### Bernoulli Bandit Consider a $k$-armed bandit with $\mu_i \in [0, 1]$ and for all i $x_t \sim \mathcal{B}(\mu_{A_t})$ The relative entropy or kl-divergence between Bernoulli distributions with parameters $p, q \in [0,1]$ is defined as $d(p, q) = p \log \frac{p}{q} + (1 - p)\log(1 - p)(1 - q)$ ### KL-UCB $A_t \in \arg \max_i \max\{\mu \in [0,1]: d(\hat\mu_{t - 1}, \mu) < \frac{\log(1 + t \log^2 t)}{T_i(t - 1)}\}$ ### Adversarial Bandit A $k$-armed adversarial bandit is an arbitrary sequence of reward vectors $\nu = (X_1, ..., X_n)$where $X_t \in [0, 1]^k$ Consider the worst-case regret $R_n(\pi) = \sup_{\nu \in [0,1]^{k \times n}} R_n(\pi, \nu)$ For a deterministic policy e.g. $UCB$ $R_n(\pi) \geq n(1 - \frac{1}{k})$ ### Exponential-Weight Algorithm for Exploration and Exploitation (Exp3) Acheives a regret of $\leq 2\sqrt{nk\log k}$ (Bandit Algorithms, CH11) ### Contextual Bandit A $k$-armed bandit where at each timestep, we are given action set $D_t$ and a context $c_t$, with $X_t = f(A_t, c_t)$ ### Linear Bandit $X_t = <\phi(c_t, A_t)\theta_t> + \eta_t$ where $\eta_t$ is noise $\theta_t$ is a linear projection matrix Let $X_1, ..., X_n$ be a sequence of random variables on $\mathcal{(\Omega, F, P)}$ and $F = \mathcal{F_t}$ is a filtration. An $F$-adapted sequence of random variables $(X_t)$ is an $F$-adapted **super-martingale** sequence if $E[X_t | \mathcal{F}_{t-1}] \leq X_{t-1} \; a.s.$ and $X_t$ is integrable sub-martingale => $E[X_t | \mathcal{F}_{t-1}] \geq X_{t-1} \; a.s.$ Maximal Inequality: Let $(X_t)$ be a supermartingale sequence with $X_t \geq 0 a.s. \forall t$ then for any $\delta > 0$, $P(\sup_{t \in N} X_t > \delta) \leq \frac{E[X_0]}{\delta}$ ### Linear Regression Consider a linear model $X_t = <A_t \theta> + \eta_t, A_t, \theta \in R^d$ the sequence $A_1, X_1, A_2, X_2 ...$ is $F$-adapted such that $\mathcal{F_t} = \sigma(A_t, X_{t - 1}, A_{t - 1})$ and we have $\eta_t$ such that $E[\exp(\alpha\eta_t) | \mathcal{F}_t] \leq \exp(\alpha^2/2) \; \forall \; \alpha \in R, t \in N$ i.e. 1-sub-Gaussian Given the sequence $(A_t, X_t)_{t}$ how can we estimate $\theta$? Ridge regression: $\min_{\theta \in R^d} \sum_{s = 1}^t (X_s - A_s^T\theta)^2 + ||\theta||_V^2$, $V$ is positive semidefinite Note: $||\theta||_V^2 = \theta^T V \theta$; usually, we set $V = \lambda I => ||\theta||_V^2 = \lambda ||\theta||^2$ Let $S_t = \sum_{s=1}^t \eta_t A_t, V_t = \sum_{s = 1}^t A_sA_s^T$, then $V_t(V) = V + V_t$ $\hat\theta_t = V_t(V)^{-1} \sum_{s=1}^t X_sA_s$ We define $M_t(\alpha) = \exp(\alpha^TS_t - \frac{1}{2}||\alpha||_{V_t}^2)$ Theorem: For any $\alpha \in R_d$ the sequence $\{M_t(\alpha)\}$ is an $F$-adapted super-martingale Proof: $E[M_t(\alpha) | \mathcal{F}_{t-1}] \leq M_{t-1}(\alpha)$ $E[\exp(\alpha^TS_t - \frac{1}{2}||\alpha||_{V_t}^2) | \mathcal{F}_{t-1}] \leq$ *Missing Notes* For a Gaussian measure $h$ with covariance $V^{-1}$ and zero mean $\bar M_t = \int_{R^d} M_t(\alpha)dh(\alpha)$ is super-martingale using Frobenius Theorem Using Radon-Nikodym derivative, we have $\bar M_t = \frac{1}{(2\pi)^{\alpha / 2} \det(V^{-1})^{1/2}} \int_{R_d} \exp(<\alpha, S_t> - \frac{1}{2}||\alpha||_{V_t}^2 - \frac{1}{2}||\alpha||_{V}^2)d\alpha$ $||S_t||_{(V + V_t)^{-1}}^2 - ||\alpha - (V + V_t)^{-1}S_t||_{V + V_t}^2$ $= 2<\alpha, S_t> - ||\alpha||_{V_t}^2 - ||\alpha||_V^2$ *Missing notes* Theorem: For $\delta \in (0, 1)$ with probability at least $1 - \delta$, for $t \in \lceil n \rceil$, $V = \lambda I$, we have: $||\hat \theta_t - \theta^*||_{V_t(V)} \leq \sqrt{\beta_t(\delta)} = \sqrt \lambda ||\theta_t||$ *Missing notes* <!-- Lecture 5 --> Note: In adversarial bandits, $R(n, \nu) = \sum_t X_{t, A_t} + \max_i\sum_t X_{t,i}$ ### Linear Upper Confidence Bound (LinUCB) From linear regression: $X_t = <\theta^*, A_t> + \eta_t$ $V_t(V) = \sum_{s = 1}^t A_sA_s^T + V$ $S_t = \sum_{s=1}^t \eta_s A_s$ $|| \hat \theta_t - \theta^* ||^2 \leq \beta_t(\delta)$ $C_t(\delta) = \{\theta: ||\theta - \hat \theta_t||^2 \leq \beta_t(\delta)\}$ $V = \lambda I$ $\beta_t(\delta) \leq \sqrt{\lambda} S + \sqrt{2 \log (1/\delta)}$ *Missing notes; see LinUCB proof in Bandit Algorithms* Optimization in LinUCB is NP-Hard ### Thompson Sampling ### Linear Thompson Sampling (LinTS) See: https://arxiv.org/abs/1209.3352, https://arxiv.org/abs/1611.06534 ### Linear Bandit Extensions https://banditalgs.com/ https://arxiv.org/abs/2011.04020 https://banditalgs.com/2016/10/19/stochastic-linear-bandits/ Adversarial Linear Bandits: $\{\theta_1, \theta_2, ... \theta_n\}, X_t = <A_t, \theta_t>$ ### Bayesian Regret Frequentist bound: For set of environments $\mathcal{E}, R(\pi) = \sum_{\nu \in \mathcal{E}} R_n(\pi, \nu)$ Bayesian bound: $\nu \sim P_\mathcal{E}, R_n(\pi) = E_{P_\mathcal{E}}[R_n(\pi, \nu)]$ If you have a frequentist regret for an algorithm, then it is automatically a Bayesian bound since $\sup(X) > E(X)$; inverse is not always true ### Posterior Sampling for Reinforcement Learning (PSRL) Prior: $P_\mathcal{E}$ History: $H_{t-1} = \{X_1, X_2, ..., X_{t-1}\}$ Posterior: $Post(P_\mathcal{E}, H_{t-1})$ Opponent drew $\nu^* \sim P_\mathcal{E}$ for $t = 1$ to $n$: $\qquad$ Draw an environment $\nu_t \sim Post(P_\mathcal{E}, H_{t-1})$ $\qquad$ Choose $A_t = i^*(\nu_t)$ (e.g. we know $(i^*, \mu^*)$ for $\nu_t$) and receive $X_t \sim \nu^*(A_t)$ $\qquad$ Update $Post$ according to $(X_t, A_t)$