# Multi-Armed Bandits: Exploration versus Exploitation
## Introduction
In the world of decision-making, we often face a fundamental dilemma: Should we exploit our current knowledge and choose options that have proven successful in the past? Or should we explore uncharted territory, potentially discovering even better alternatives? This trade-off between **exploitation** and **exploration** is central to many fields, from business strategy to machine learning.
## The Multi-Armed Bandit Problem (MAB)
The Multi-Armed Bandit (MAB) problem provides a mathematical framework for tackling this dilemma. Imagine a row of slot machines (the "arms"), each with an unknown payout distribution. The challenge is to figure out which machine to play to maximize your winnings over time. Should you stick to the machine that seems to be paying well now (exploitation), or try different ones in the hope of finding an even better one (exploration)?
### Formal Definition
* **Tuple Representation:** The MAB problem is formally represented as $(\mathcal{A}, \mathcal{R})$, where
* $\mathcal{A}$: A set of m actions (corresponding to the slot machine arms).
* $\mathcal{R}^a$: A probability distribution over rewards for each action a. This distribution is initially unknown.
* **Agent's Task:** At each time step $t$, the agent selects an action $A_t$ from $\mathcal{A}$.
* **Environment's Response:** The environment generates a reward $R_t$ according to the distribution $\mathcal{R}^{A_t}$ of the chosen action.
* **Goal:** The agent aims to maximize the cumulative reward $\sum_{t=1}^T R_t$ over $T$ time steps.
#### MAB: Beyond Markov Decision Processes
While MAB problems share some similarities with MDPs, a key difference is the absence of an explicit state in the environment. However, the agent can maintain an internal state that summarizes past actions and rewards, influencing its action selection policy.
### The Concept of Regret: Measuring Performance
The efficacy of a MAB algorithm is gauged by its **regret**, which quantifies the opportunity loss incurred by not consistently choosing the optimal action.
* **Action Value (Q-value):** $Q(a)$ represents the expected reward $\mathbb{E}[r|a]$ for action $a$. It is unknown initially.
* **Optimal Value and Action:** $V^* = Q(a^*) = \max_{a \in \mathcal{A}} Q(a)$ is the maximum expected reward and the corresponding optimal action.
* **Regret at step $t$:** $l_t = \mathbb{E}[V^* - Q(A_t)]$ s the expected loss for not choosing the optimal action at step $t$.
* **Total Regret:** $L_T = \sum_{t=1}^{T} l_t$ is the cumulative regret over $T$ time steps.
Minimizing total regret is the primary goal in MAB, as it aligns with maximizing cumulative reward.
### Counts and Gaps: Understanding Regret
Two key concepts aid in understanding and minimizing regret:
* **Counts:** $N_t(a)$ denotes how many times action a was chosen up to step $t$.
* **Gaps:** $\Delta_a = V^* - Q(a)$ quantifies the difference in expected reward between the optimal action and action $a$.
Total regret can be expressed as:
$$
L_T = \sum_{a \in \mathcal{A}} \text{Count}_T(a) \cdot \Delta_a
$$
Effective algorithms aim to minimize the counts of actions with larger gaps, as these contribute more to the overall regret.
### Code Implementation
The provided Python code presents a base class (`MABBase`) for implementing MAB algorithms. It includes methods for simulating episodes, calculating expected rewards and regrets, tracking action counts, and plotting regret curves. This structure facilitates the development and comparison of different MAB algorithms, offering a practical way to test their performance and understand their theoretical properties.
```python=
from typing import Sequence, Tuple
from abc import ABC, abstractmethod
from rl.distribution import Distribution
from numpy import ndarray, mean, vstack, cumsum, full, bincount
class MABBase(ABC):
def __init__(
self,
arm_distributions: Sequence[Distribution[float]],
time_steps: int,
num_episodes: int
) -> None:
self.arm_distributions: Sequence[Distribution[float]] = \
arm_distributions
self.num_arms: int = len(arm_distributions)
self.time_steps: int = time_steps
self.num_episodes: int = num_episodes
@abstractmethod
def get_episode_rewards_actions(self) -> Tuple[ndarray, ndarray]:
pass
def get_all_rewards_actions(self) -> Sequence[Tuple[ndarray, ndarray]]:
return [self.get_episode_rewards_actions()
for _ in range(self.num_episodes)]
def get_rewards_matrix(self) -> ndarray:
return vstack([x for x, _ in self.get_all_rewards_actions()])
def get_actions_matrix(self) -> ndarray:
return vstack([y for _, y in self.get_all_rewards_actions()])
def get_expected_rewards(self) -> ndarray:
return mean(self.get_rewards_matrix(), axis=0)
def get_expected_cum_rewards(self) -> ndarray:
return cumsum(self.get_expected_rewards())
def get_expected_regret(self, best_mean) -> ndarray:
return full(self.time_steps, best_mean) - self.get_expected_rewards()
def get_expected_cum_regret(self, best_mean) -> ndarray:
return cumsum(self.get_expected_regret(best_mean))
def get_action_counts(self) -> ndarray:
return vstack([bincount(ep, minlength=self.num_arms)
for ep in self.get_actions_matrix()])
def get_expected_action_counts(self) -> ndarray:
return mean(self.get_action_counts(), axis=0)
def plot_exp_cum_regret_curve(self, best_mean) -> None:
import matplotlib.pyplot as plt
x_vals = range(1, self.time_steps + 1)
plt.plot(
self.get_expected_cum_regret(best_mean),
"b",
label="Expected Total Regret"
)
plt.xlabel("Time Steps", fontsize=20)
plt.ylabel("Expected Total Regret", fontsize=20)
plt.title("Total Regret Curve", fontsize=25)
plt.xlim(xmin=x_vals[0], xmax=x_vals[-1])
plt.ylim(ymin=0.0)
# plt.xticks(x_vals)
plt.grid(True)
# plt.legend(loc='upper left')
plt.show()
```
## Simple Bandit Algorithms
To effectively address the multi-armed bandit problem, algorithms typically estimate action values, denoted $\hat{Q}_t(a)$, as approximations of the true expected rewards $Q(a)$. A common and intuitive way to estimate these values is by averaging the observed rewards:
$$
\hat{Q}_t(a) = \frac{1}{N_a(t)} \sum_{s=1}^t R_s \cdot \mathbf{1}_{A_s=a}
$$
where $\mathbf{1}_{A_s=a}$ is an indicator function that equals 1 if action $a$ was chosen at step $s$, and 0 otherwise.
Let's delve into a few simple, yet insightful, bandit algorithms:
### Greedy Algorithm
The greedy algorithm prioritizes **exploitation**. It consistently selects the action with the highest estimated value:
$$
A_t = \mathop{\arg\max}_{a \in \mathcal{A}} \hat{Q}_{t-1}(a)
$$
While straightforward, this approach can easily get stuck in suboptimal choices if it doesn't explore enough initially. This leads to a total regret that grows linearly over time.
### $\epsilon$-Greedy Algorithm
To strike a balance between exploration and exploitation, the ε-greedy algorithm introduces randomness. At each time step $t$:
* With probability $1−\epsilon$, it select the greedy action (exploitation).
* With probability $\epsilon$, it chooses a random action (exploration).
This ensures a minimum level of exploration, but it also incurs a minimum regret proportional to the mean gap:
$$
l_t \geq \frac{\epsilon}{|\mathcal{A}|} \sum_{a \in \mathcal{A}} \Delta_a
$$
Therefore, the total regret still grows linearly with the number of time steps.
### Optimistic Initialization
Optimistic initialization promotes early exploration by setting the initial action values $\hat{Q}_0(a)$ to high values. This encourages the algorithm to try each action before potentially settling on a suboptimal choice. The update rule remains the same:
\begin{split}
N_t(A_t) &= N_{t-1}(A_t) + 1 \\
\hat{Q}_t(A_t) &= \hat{Q}_{t-1}(A_t) + \frac{R_t - \hat{Q}_{t-1}(A_t)}{N_t(A_t)}
\end{split}
Although optimistic initialization can improve performance in practice, it might still lead to linear total regret in the worst-case scenario.
### Decaying $\epsilon_t$-Greedy Algorithm
The decaying $\epsilon_t$-greedy algorithm addresses the limitations of the constant $\epsilon$-greedy approach by gradually reducing the exploration probability $\epsilon$ over time. A commonly used decay schedule is:
$$
\epsilon_t = \min(1, \frac{c|\mathcal{A}|}{d^2(t+1)})
$$
where $c$ is a positive constant and $d$ is the smallest positive gap between the optimal and suboptimal actions:
$$
d = \min_{a|\Delta_a > 0} \Delta_a
$$
With an appropriate decay schedule, this algorithm can achieve logarithmic total regret, significantly outperforming the linear regret of the previous algorithms. However, finding the optimal decay schedule usually requires knowledge of the gaps $\Delta_a$, which are typically unknown in real-world applications.

### Code Implementation
```python=
from typing import List, Callable, Tuple, Sequence
from rl.distribution import Distribution, Gaussian, Range, Bernoulli
from rl.chapter14.mab_base import MABBase
from operator import itemgetter
from numpy import ndarray, empty
class EpsilonGreedy(MABBase):
def __init__(
self,
arm_distributions: Sequence[Distribution[float]],
time_steps: int,
num_episodes: int,
epsilon: float,
epsilon_half_life: float = 1e8,
count_init: int = 0,
mean_init: float = 0.,
) -> None:
if epsilon < 0 or epsilon > 1 or \
epsilon_half_life <= 1 or count_init < 0:
raise ValueError
super().__init__(
arm_distributions=arm_distributions,
time_steps=time_steps,
num_episodes=num_episodes
)
self.epsilon_func: Callable[[int], float] = \
EpsilonGreedy.get_epsilon_decay_func(epsilon, epsilon_half_life)
self.count_init: int = count_init
self.mean_init: float = mean_init
@staticmethod
def get_epsilon_decay_func(
epsilon: float,
epsilon_half_life: float
) -> Callable[[int], float]:
def epsilon_decay(
t: int,
epsilon: float = epsilon,
epsilon_half_life: float = epsilon_half_life
) -> float:
return epsilon * 2 ** -(t / epsilon_half_life)
return epsilon_decay
def get_episode_rewards_actions(self) -> Tuple[ndarray, ndarray]:
counts: List[int] = [self.count_init] * self.num_arms
means: List[float] = [self.mean_init] * self.num_arms
ep_rewards: ndarray = empty(self.time_steps)
ep_actions: ndarray = empty(self.time_steps, dtype=int)
for i in range(self.time_steps):
max_action: int = max(enumerate(means), key=itemgetter(1))[0]
epsl: float = self.epsilon_func(i)
action: int = max_action if Bernoulli(1 - epsl).sample() else \
Range(self.num_arms).sample()
reward: float = self.arm_distributions[action].sample()
counts[action] += 1
means[action] += (reward - means[action]) / counts[action]
ep_rewards[i] = reward
ep_actions[i] = action
return ep_rewards, ep_actions
```
## Fundamental Limits: The Lower Bound on Regret
A natural question arises when studying bandit algorithms: Can we design an algorithm that consistently achieves sublinear total regret, regardless of the specific reward distributions in the bandit problem? Unfortunately, the answer is no. The performance of any bandit algorithm is inherently limited by how similar the reward distributions of different arms are to each other.
### The Challenge of Hard Bandit Problems
Intuitively, bandit problems become more challenging when the reward distributions of different arms are very similar. In such cases, it becomes increasingly difficult to distinguish between the optimal arm (the one with the highest expected reward) and suboptimal arms. This makes the exploration process more costly, as the algorithm needs to gather more information to confidently identify the best choice.
### The Lai-Robbins Lower Bound
A landmark result by Lai and Robbins (1985) established a fundamental lower bound on the total regret achievable by any algorithm in the face of this challenge. This bound depends on two key factors:
1. **Gaps:** The differences between the expected rewards of the optimal arm and the suboptimal arms. Smaller gaps mean the problem is harder, and the regret will be higher.
2. **Kullback-Leibler (KL) Divergence:** A measure of how dissimilar the reward distributions of different arms are. Higher KL divergence indicates greater distinguishability between arms, potentially leading to lower regret.
**Theorem** (Lai and Robbins)
For any multi-armed bandit problem $(\mathcal{A},\mathcal{R})$, the total regret $L_T$ of any algorithm after $T$ time steps must satisfy:
$$
L_T \geq \log T \sum_{a|\Delta_a > 0} \frac{1}{\Delta_a} \geq \log T \sum_{a|\Delta_a > 0} \frac{\Delta_a}{KL(\mathcal{R}^a || \mathcal{R}^{a^*})}
$$
where $KL(\mathcal{R}^a || \mathcal{R}^{a^*})$ is the KL divergence between the reward distribution of action $a$ and the optimal action $a^*$.
### Key Insights from the Lower Bound
This lower bound has profound implications for the design and analysis of bandit algorithms:
* **Logarithmic Regret is the Best We Can Hope For:** Even the most optimal algorithm cannot achieve a regret that grows slower than logarithmically with time T. This means that some level of regret is unavoidable, especially in the long run.
* **The Impact of Gaps:** The smaller the gaps between the optimal and suboptimal actions, the higher the regret will be. This highlights the importance of careful exploration in problems where the differences in rewards are subtle.
* **The Role of KL Divergence:** The KL divergence serves as a measure of how easy or difficult it is to distinguish between different arms based on their reward distributions. Larger KL divergence implies easier distinguishability and potentially lower regret.
The Lai-Robbins lower bound acts as a crucial benchmark for evaluating the performance of bandit algorithms. It sets a realistic expectation for what is achievable and guides the development of algorithms that strive to approach this fundamental limit in the most favorable scenarios.
## Upper Confidence Bound (UCB) Algorithms
The Upper Confidence Bound (UCB) family of algorithms embraces an optimistic approach to exploration in the multi-armed bandit problem. Instead of solely relying on the estimated values of actions, UCB algorithms incorporate upper confidence bounds to guide their decision-making. This optimism encourages the exploration of actions whose potential rewards might be underestimated.
### UCB1 Algorithm
The UCB1 algorithm, a cornerstone of the UCB family, maintains an upper confidence bound $\hat{U}_t(a)$ for each action $a$ at time step $t$. The algorithm then selects the action that maximizes the sum of its estimated reward and its confidence bound:
$$
A_{t+1} = \mathop{\arg\min}_{a \in \mathcal{A}} \{\hat{Q}_t(a) + \hat{U}_t(a)\}
$$
This approach elegantly balances exploration and exploitation. When an action has been chosen infrequently (low $N_t(a)$) its confidence bound $\hat{U}_t(a)$ will be relatively large, making it more likely to be explored. As an action is chosen more often, its confidence bound gradually shrinks, shifting the focus towards exploitation if its estimated value is promising.
One of the most remarkable features of UCB1 is its theoretical guarantee: it achieves **logarithmic total regret** asymptotically. This means that as the number of time steps T increases, the growth of regret is significantly slower than linear, making it a very attractive algorithm for long-term decision-making.
### Hoeffding's Inequality and Confidence Bounds
The calculation of upper confidence bounds is underpinned by Hoeffding's inequality, a powerful tool in probability theory:
**Theorem (Hoeffding's Inequality)**
Let $X_1, \dots, X_n$ be independent and identically distributed random variables in the range $[0, 1]$, and let $\bar{X}_n = \frac1n \sum_{i=1}^n X_i$ be their sample mean. Then for any $u \geq 0$,
$$
\mathbb{P}[\mathbb{E}[\bar{X}_n] > \bar{X}_n + u] \leq e^{2nu^2}
$$
In the context of bandit algorithms with rewards bounded in $[0,1]$, we can apply Hoeffding's inequality to the rewards of each action. The estimated reward $\hat{Q}_t(a)$ acts as the sample mean, and we set the tolerance level $u$ equal to the upper confidence bound $\hat{U}_t(a)$:
$$
\mathbb{P}[Q(a) \geq \hat{Q}_t(a) + \hat{U}_t(a)] \leq e^{-2 N_t(a) \cdot \hat{U}_t(a)}
$$
By setting a desired probability $p$ (e.g., $p=t^{-\alpha}$ for some $\alpha>0$) we can solve for $\hat{U}_t(a)$:
$$
e^{-2 N_t(a) \cdot \hat{U}_t(a)}
= p
\Rightarrow \hat{U}_t(a) = \sqrt{\frac{-\log p}{2 N_t(a)}}
$$
### Deriving the UCB1 Algorithm
Substituting the chosen probability $p=t^{-\alpha}$ into the expression for $\hat{U}_t(a)$ yields the upper confidence bound used in the UCB1 algorithm:
$$
\hat{U}_t(a) = \sqrt{\frac{\alpha \log t}{2N_t(a)}}
$$
Combining this with the estimated reward $\hat{Q}_t(a)$, we get the complete UCB1 algorithm:
$$
A_{t+1} = \mathop{\arg\min}_{a \in \mathcal{A}} \left\{\hat{Q}_t(a) + \sqrt{\frac{\alpha \log t}{2 N_t(a)}} \right\}
$$
### UCB1 Logarithmic Regret Guarantee
The effectiveness of UCB1 is backed by the following theorem:
**Theorem (UCB1 Logarithmic Total Regret)** As the number of time steps $T$ approaches infinity, the total regret $L_T$ of the UCB1 algorithm is bounded by:
$$
L_T \leq \sum_{a|\Delta_a > 0} \frac{4\alpha \cdot \log T}{\Delta_a} + \frac{2\alpha \cdot \Delta_a}{\alpha-1}
$$
This result proves that the total regret of UCB1 grows logarithmically with time, making it a powerful tool for balancing exploration and exploitation in many real-world applications.
### Code Implementation
The provided Python code demonstrates a basic implementation of the UCB1 algorithm within the `MABBase` framework, showcasing its practical application for solving multi-armed bandit problems.
```python=
from typing import Sequence, Tuple, List
from rl.distribution import Distribution
from rl.chapter14.mab_base import MABBase
from operator import itemgetter
from numpy import ndarray, empty, sqrt, log
class UCB1(MABBase):
def __init__(
self,
arm_distributions: Sequence[Distribution[float]],
time_steps: int,
num_episodes: int,
bounds_range: float,
alpha: float
) -> None:
if bounds_range < 0 or alpha <= 0:
raise ValueError
super().__init__(
arm_distributions=arm_distributions,
time_steps=time_steps,
num_episodes=num_episodes
)
self.bounds_range: float = bounds_range
self.alpha: float = alpha
def get_episode_rewards_actions(self) -> Tuple[ndarray, ndarray]:
ep_rewards: ndarray = empty(self.time_steps)
ep_actions: ndarray = empty(self.time_steps, dtype=int)
for i in range(self.num_arms):
ep_rewards[i] = self.arm_distributions[i].sample()
ep_actions[i] = i
counts: List[int] = [1] * self.num_arms
means: List[float] = [ep_rewards[j] for j in range(self.num_arms)]
for i in range(self.num_arms, self.time_steps):
ucbs: Sequence[float] = [means[j] + self.bounds_range *
sqrt(0.5 * self.alpha * log(i) /
counts[j])
for j in range(self.num_arms)]
action: int = max(enumerate(ucbs), key=itemgetter(1))[0]
reward: float = self.arm_distributions[action].sample()
counts[action] += 1
means[action] += (reward - means[action]) / counts[action]
ep_rewards[i] = reward
ep_actions[i] = action
return ep_rewards, ep_actions
```
## Bayesian Bandits
Bayesian bandits offer a sophisticated framework for incorporating prior knowledge about reward distributions into the decision-making process. Unlike frequentist approaches that treat rewards as fixed but unknown, Bayesian methods model rewards as random variables with prior distributions. This nuanced approach enables a more comprehensive understanding of uncertainty and facilitates informed exploration strategies.
### Bayesian Approach: Posterior Distribution and Exploration
At the heart of Bayesian bandits is the computation of the posterior distribution of rewards, denoted $\mathbb{P}[\mathcal{R}|h_t]$, conditioned on the history of actions and rewards, $h_t= (A_1, R_1, \dots, A_t, R_t)$. This posterior encapsulates the agent's updated beliefs about the reward probabilities associated with each action after observing the accumulated data.
Leveraging this posterior, Bayesian bandits guide exploration through two primary algorithms:
1. **Bayesian Upper Confidence Bounds (Bayesian UCB):** Extends the UCB concept by integrating prior knowledge into the confidence bounds.
2. **Probability Matching (Thompson Sampling):** Samples directly from the posterior distribution to select actions.
### Bayesian UCB: Gaussian Rewards
In Bayesian UCB, a common assumption is that reward distributions follow a Gaussian form:
$$
\mathcal{R}^a(r) \sim \mathcal{N}(r;\mu_a,\sigma_a^2)
$$
where $\mu_a$ is the mean reward and $\sigma^2_a$ is the variance of action $a$.
Given the history $h_t$, we compute the Gaussian posterior distribution over $\mu_a$ and variance $\sigma_a^2$ for each action. The Bayesian UCB algorithm then selects the action maximizing the expected value of the mean plus a multiple (controlled by parameter $c$) of the standard deviation:
$$
A_{t+1} = \mathop{\arg\min}_{a \in \mathcal{A}}\ \mathbb{E}_{\mathbb{P}[\mu_a,\sigma_a^2|h_t]} \left[\mu_a + \frac{c \cdot \sigma_a}{\sqrt{\mathcal{N}_a(t)}} \right]
$$
### Probability Matching and Thompson Sampling
Probability matching algorithms, such as Thompson Sampling, select actions randomly based on their probability of being optimal according to the posterior:
$$
\pi(A_{t+1}|h_t)
= \mathbb{P}_{\mathcal{D}_t \sim \mathbb{P}[\mathcal{R}_t|h_t]} \left[\mathbb{E}_{\mathcal{D}_t}[r| A_{t+1}] > \mathbb{E}_{\mathcal{D}_t}[r| a], \forall a \neq A_{t+1} \right]
$$
Thompson Sampling simplifies this by drawing samples of reward distributions $\mathcal{D}_t$ from the posterior $\mathbb{P}[\mathcal{R}|h_t]$. It then computes a sample action-value function:
$$
\hat{Q}_t(a) = \mathbb{E}_{\mathcal{D}_t}[r|a]
$$
Finally, it selects the action maximizing this sample function:
$$
A_{t+1} = \mathop{\arg\min}_{a \in \mathcal{A}}\ \hat{Q}_t(a)
$$
Thompson Sampling boasts strong theoretical guarantees and often exhibits exceptional performance. Notably, under certain conditions, it can achieve the Lai-Robbins lower bound, making it a benchmark for Bayesian bandits.
### Example: Thompson Sampling
#### Gaussian-InverseGamma Distributions
The provided Python code showcases a Thompson Sampling implementation for Gaussian reward distributions with unknown mean and variance, utilizing Gaussian-Inverse Gamma conjugate priors for efficient updates.
```python=
from typing import Sequence, Tuple, List
from rl.distribution import Gaussian, Gamma
from rl.chapter14.mab_base import MABBase
from operator import itemgetter
from numpy import ndarray, empty, sqrt
class ThompsonSamplingGaussian(MABBase):
def __init__(
self,
arm_distributions: Sequence[Gaussian],
time_steps: int,
num_episodes: int,
init_mean: float,
init_stdev: float
) -> None:
super().__init__(
arm_distributions=arm_distributions,
time_steps=time_steps,
num_episodes=num_episodes
)
self.theta0: float = init_mean
self.n0: int = 1
self.alpha0: float = 1
self.beta0: float = init_stdev * init_stdev
def get_episode_rewards_actions(self) -> Tuple[ndarray, ndarray]:
ep_rewards: ndarray = empty(self.time_steps)
ep_actions: ndarray = empty(self.time_steps, dtype=int)
bayes: List[Tuple[float, int, float, float]] = [
(self.theta0, self.n0, self.alpha0, self.beta0)
] * self.num_arms
for i in range(self.time_steps):
mean_draws: Sequence[float] = [
Gaussian(
mu=theta,
sigma=1 / sqrt(n * Gamma(alpha=alpha, beta=beta).sample())
).sample()
for theta, n, alpha, beta in bayes
]
action: int = max(enumerate(mean_draws), key=itemgetter(1))[0]
reward: float = self.arm_distributions[action].sample()
theta, n, alpha, beta = bayes[action]
bayes[action] = (
(reward + n * theta) / (n + 1),
n + 1,
alpha + 0.5,
beta + 0.5 * n / (n + 1) * (reward - theta) * (reward - theta)
)
ep_rewards[i] = reward
ep_actions[i] = action
return ep_rewards, ep_actions
```
#### Beta Probability Distributions
Another Python code example demonstrates Thompson Sampling for Bernoulli bandits, employing Beta distributions as conjugate priors for the underlying reward probabilities.
``` python=
from typing import Sequence, Tuple, List
from rl.distribution import Bernoulli, Beta
from rl.chapter14.mab_base import MABBase
from operator import itemgetter
from numpy import ndarray, empty
class ThompsonSamplingBernoulli(MABBase):
def __init__(
self,
arm_distributions: Sequence[Bernoulli],
time_steps: int,
num_episodes: int
) -> None:
super().__init__(
arm_distributions=arm_distributions,
time_steps=time_steps,
num_episodes=num_episodes
)
def get_episode_rewards_actions(self) -> Tuple[ndarray, ndarray]:
ep_rewards: ndarray = empty(self.time_steps)
ep_actions: ndarray = empty(self.time_steps, dtype=int)
bayes: List[Tuple[int, int]] = [(1, 1)] * self.num_arms
for i in range(self.time_steps):
mean_draws: Sequence[float] = [
Beta(alpha=alpha, beta=beta).sample()
for alpha, beta in bayes
]
action: int = max(enumerate(mean_draws), key=itemgetter(1))[0]
reward: float = float(self.arm_distributions[action].sample())
alpha, beta = bayes[action]
bayes[action] = (alpha + int(reward), beta + int(1 - reward))
ep_rewards[i] = reward
ep_actions[i] = action
return ep_rewards, ep_actions
```
## Gradient Bandits
Gradient Bandits offer an alternative approach to the multi-armed bandit problem, drawing inspiration from policy gradient methods in reinforcement learning. Instead of directly estimating the value of each action (like in Q-learning), they maintain a set of score parameters for each action, which determine the probability of selecting those actions. These scores are updated iteratively using gradient ascent on an expected reward objective function.
### Gradient Bandit Algorithms: Stochastic Gradient Ascent
The core idea behind gradient bandit algorithms is to optimize score parameters $s_a$ for each action $a \in \mathcal{A} = \{a_1, \dots, a_m \}$ using stochastic gradient ascent. The objective function to be maximized is the expected reward:
$$
J(s_{a_1}, \dots, s_{a_m}) = \sum_{a \in \mathcal{A}} \pi(a) \cdot \mathbb{E}[r|a]
$$
where $\pi: \mathcal{A} \rightarrow [0,1]$ represents the probability of taking each action, defined by the softmax function:
$$
\pi(a) = \frac{e^{s_a}}{\sum_{b \in \mathcal{A}} e^{s_b}}
$$
The scores essentially encode the relative preference for each action based on the observed rewards.
#### Gradient of Expected Reward
To perform gradient ascent, we need to calculate the gradient of the objective function $J(\cdot)$ with respect to the score parameters $s_a$. By applying the chain rule and leveraging the properties of the softmax function, we can derive this gradient:
\begin{split}
\frac{\partial J}{\partial s_a} &= \frac{\partial}{\partial s_a} \left(\sum_{a' \in \mathcal{A}} \pi(a') \cdot \mathbb{E}[r|a']\right) \\
&= \sum_{a' \in \mathcal{A}} \frac{\partial \pi(a')}{\partial s_a} \cdot \mathbb{E}[r|a'] \\
&= \sum_{a' \in \mathcal{A}} \pi(a') \cdot \mathbb{E}[r|a'] \cdot \frac{\partial \log \pi(a')}{\partial s_a} \\
&= \mathbb{E}_{a' \sim \pi, r \sim \mathcal{R}^{a'}} \left[r \cdot \frac{\partial \log \pi(a')}{\partial s_a} \right]
\end{split}
Now, we can further simplify the gradient by computing the derivative of the log probability:
$$
\frac{\log \pi(a')}{\partial{s_a}}
= \frac{\partial}{\partial{s_a}} (\log \frac{e^{s_{a'}}}{\sum_{b \in \mathcal{A}} e^{s_b}})
= \mathbb{1}_{a=a'} - \pi(a)
$$
Substituting this back into the gradient expression, we obtain:
$$
\frac{\partial{J}}{\partial{s_a}} = \mathbb{E}_{a' \sim \pi, r \sim \mathcal{R}^{a'}}[r \cdot (\mathbb{1}_{a=a'} - \pi(a))]
$$
#### Score Updates with Stochastic Gradient Ascent
In practice, we don't have direct access to the true expected reward, so we approximate the gradient using a single sample $(A_t, R_t)$ at each time step $t$. The update rule for the scores then becomes:
$$
s_{t+1}(a) = s_t(a) + \alpha \cdot R_t \cdot (\mathbb{1}_{a=A_t} - \pi_t(a))
$$
where $\alpha$ is the learning rate.
#### Baseline Trick
To reduce the variance of the gradient estimate and improve convergence, we can subtract a baseline B (independent of the action) from the reward:
$$
s_{t+1}(a) = s_t(a) + \alpha \cdot (R_t - B) \cdot (\mathbb{1}_{a=A_t} - \pi_t(a))
$$
A common choice for the baseline is the average of the observed rewards:
$$
B = \bar{R}_t = \frac1t \sum_{s=1}^t R_s
$$
This update rule allows the gradient bandit algorithm to adapt its preferences for different actions based on the observed rewards, leading to improved performance over time.
### Code Implementation
The provided Python code showcases a basic implementation of a gradient bandit algorithm. You can experiment with different learning rates and decay schedules to observe their impact on the algorithm's convergence and overall performance.
```python=
from typing import Sequence, Tuple, List
from rl.distribution import Distribution, Categorical
from rl.chapter14.mab_base import MABBase
from numpy import ndarray, empty, exp
class GradientBandits(MABBase):
def __init__(
self,
arm_distributions: Sequence[Distribution[float]],
time_steps: int,
num_episodes: int,
learning_rate: float,
learning_rate_decay: float
) -> None:
super().__init__(
arm_distributions=arm_distributions,
time_steps=time_steps,
num_episodes=num_episodes
)
self.learning_rate: float = learning_rate
self.learning_rate_decay: float = learning_rate_decay
def get_episode_rewards_actions(self) -> Tuple[ndarray, ndarray]:
ep_rewards: ndarray = empty(self.time_steps)
ep_actions: ndarray = empty(self.time_steps, dtype=int)
scores: List[float] = [0.] * self.num_arms
avg_reward: float = 0.
for i in range(self.time_steps):
max_score: float = max(scores)
exp_scores: List[float] = [exp(s - max_score) for s in scores]
sum_exp_scores: float = sum(exp_scores)
probs: List[float] = [s / sum_exp_scores for s in exp_scores]
action: int = Categorical(
{i: p for i, p in enumerate(probs)}
).sample()
reward: float = self.arm_distributions[action].sample()
avg_reward += (reward - avg_reward) / (i + 1)
step_size: float = self.learning_rate * (i / self.learning_rate_decay + 1) ** -0.5
for j in range(self.num_arms):
scores[j] += step_size * (reward - avg_reward) * ((1 if j == action else 0) - probs[j])
ep_rewards[i] = reward
ep_actions[i] = action
return ep_rewards, ep_actions
```
## Example: A Horse Race of Bandit Algorithms
In this section, we'll pit various bandit algorithms against each other in a simulated "horse race." We'll compare their total regret over time and examine how frequently they choose different actions (arms).
### Gaussian Horse Race
Let's begin with a scenario where the rewards of each arm are drawn from Gaussian distributions. We'll simulate this race for 7 arms, each with different mean rewards and variances. We'll run the simulation for 500 time steps and average the results over 500 episodes.
``` python=
from operator import itemgetter
from rl.distribution import Gaussian, Bernoulli
from rl.chapter14.epsilon_greedy import EpsilonGreedy
from rl.chapter14.ucb1 import UCB1
from rl.chapter14.ts_gaussian import ThompsonSamplingGaussian
from rl.chapter14.ts_bernoulli import ThompsonSamplingBernoulli
from rl.chapter14.gradient_bandits import GradientBandits
from numpy import arange
import matplotlib.pyplot as plt
def plot_gaussian_algorithms() -> None:
means_vars_data = [
(0., 10.),
(2., 20.),
(4., 1.),
(6., 8.),
(8., 4.),
(9., 6.),
(10., 4.)]
mu_star = max(means_vars_data, key=itemgetter(0))[0]
steps = 500
episodes = 500
eps = 0.3
eps_hl = 400
ci = 5
mi = mu_star * 3.
ts_mi = 0.
ts_si = 10.
lr = 0.1
lr_decay = 20.
arm_distrs = [Gaussian(mu=m, sigma=s) for m, s in means_vars_data]
greedy_opt_init = EpsilonGreedy(
arm_distributions=arm_distrs,
time_steps=steps,
num_episodes=episodes,
epsilon=0.,
epsilon_half_life=1e8,
count_init=ci,
mean_init=mi
)
eps_greedy = EpsilonGreedy(
arm_distributions=arm_distrs,
time_steps=steps,
num_episodes=episodes,
epsilon=eps,
epsilon_half_life=1e8,
count_init=0,
mean_init=0.
)
decay_eps_greedy = EpsilonGreedy(
arm_distributions=arm_distrs,
time_steps=steps,
num_episodes=episodes,
epsilon=eps,
epsilon_half_life=eps_hl,
count_init=0,
mean_init=0.
)
ts = ThompsonSamplingGaussian(
arm_distributions=arm_distrs,
time_steps=steps,
num_episodes=episodes,
init_mean=ts_mi,
init_stdev=ts_si
)
grad_bandits = GradientBandits(
arm_distributions=arm_distrs,
time_steps=steps,
num_episodes=episodes,
learning_rate=lr,
learning_rate_decay=lr_decay
)
plot_colors = ['k', 'y', 'k--', 'y--', 'b-.']
labels = [
'Greedy, Optimistic Initialization',
'$\epsilon$-Greedy',
'Decaying $\epsilon$-Greedy',
'Thompson Sampling',
'Gradient Bandit'
]
exp_cum_regrets = [
greedy_opt_init.get_expected_cum_regret(mu_star),
eps_greedy.get_expected_cum_regret(mu_star),
decay_eps_greedy.get_expected_cum_regret(mu_star),
ts.get_expected_cum_regret(mu_star),
grad_bandits.get_expected_cum_regret(mu_star)
]
x_vals = range(1, steps + 1)
for i in range(len(exp_cum_regrets)):
plt.plot(x_vals, exp_cum_regrets[i], plot_colors[i], label=labels[i])
plt.xlabel("Time Steps", fontsize=20)
plt.ylabel("Expected Total Regret", fontsize=20)
plt.title("Total Regret Curves", fontsize=25)
plt.xlim(xmin=x_vals[0], xmax=x_vals[-1])
plt.ylim(ymin=0.0)
plt.grid(True)
plt.legend(loc='upper left', fontsize=15)
plt.show()
exp_act_counts = [
greedy_opt_init.get_expected_action_counts(),
eps_greedy.get_expected_action_counts(),
decay_eps_greedy.get_expected_action_counts(),
ts.get_expected_action_counts(),
grad_bandits.get_expected_action_counts()
]
index = arange(len(means_vars_data))
spacing = 0.4
width = (1 - spacing) / len(exp_act_counts)
hist_plot_colors = ['r', 'b', 'g', 'k', 'y']
for i in range(len(exp_act_counts)):
plt.bar(
index - (1 - spacing) / 2 + (i - 1.5) * width,
exp_act_counts[i],
width,
color=hist_plot_colors[i],
label=labels[i]
)
plt.xlabel("Arms", fontsize=20)
plt.ylabel("Expected Counts of Arms", fontsize=20)
plt.title("Arms Counts Plot", fontsize=25)
plt.xticks(
index - 0.3,
["$\mu$=%.1f, $\sigma$=%.1f" % (m, s) for m, s in means_vars_data]
)
plt.legend(loc='upper left', fontsize=15)
plt.tight_layout()
plt.show()
```


The plots reveal the following insights:
* **Regret Performance:** Thompson Sampling and the decaying $\epsilon$-greedy algorithm generally outperform the other methods, achieving lower regret over time. This aligns with their theoretical properties, as they adapt their exploration strategies more effectively.
* **Arm Selection:** We can see how the algorithms distribute their choices across the different arms. Thompson Sampling and the decaying $\epsilon$-greedy algorithm tend to focus more on the arms with higher expected rewards, while still exploring others to a reasonable extent.
#### Bernoulli Horse Race
Next, let's consider a scenario where the rewards are binary, following Bernoulli distributions with varying probabilities of success. We'll simulate this race with 9 arms, each having a different probability of success. Again, we'll run the simulation for 500 time steps and average over 500 episodes.
```python=
def plot_bernoulli_algorithms() -> None:
probs_data = [0.1, 0.2, 0.4, 0.5, 0.6, 0.75, 0.8, 0.85, 0.9]
mu_star = max(probs_data)
steps = 500
episodes = 500
eps = 0.3
eps_hl = 400
ci = 5
mi = mu_star * 3.
ucb_alpha = 4.0
lr = 0.5
lr_decay = 20.
arm_distrs = [Bernoulli(p) for p in probs_data]
greedy_opt_init = EpsilonGreedy(
arm_distributions=arm_distrs,
time_steps=steps,
num_episodes=episodes,
epsilon=0.,
epsilon_half_life=1e8,
count_init=ci,
mean_init=mi
)
eps_greedy = EpsilonGreedy(
arm_distributions=arm_distrs,
time_steps=steps,
num_episodes=episodes,
epsilon=eps,
epsilon_half_life=1e8,
count_init=0,
mean_init=0.
)
decay_eps_greedy = EpsilonGreedy(
arm_distributions=arm_distrs,
time_steps=steps,
num_episodes=episodes,
epsilon=eps,
epsilon_half_life=eps_hl,
count_init=0,
mean_init=0.
)
ucb1 = UCB1(
arm_distributions=arm_distrs,
time_steps=steps,
num_episodes=episodes,
bounds_range=1.0,
alpha=ucb_alpha
)
ts = ThompsonSamplingBernoulli(
arm_distributions=arm_distrs,
time_steps=steps,
num_episodes=episodes
)
grad_bandits = GradientBandits(
arm_distributions=arm_distrs,
time_steps=steps,
num_episodes=episodes,
learning_rate=lr,
learning_rate_decay=lr_decay
)
plot_colors = ['k', 'y', 'k--', 'y--', 'r-.', 'c-.']
labels = [
'Greedy, Optimistic Initialization',
'$\epsilon$-Greedy',
'Decaying $\epsilon$-Greedy',
'UCB1',
'Thompson Sampling',
'Gradient Bandit'
]
exp_cum_regrets = [
greedy_opt_init.get_expected_cum_regret(mu_star),
eps_greedy.get_expected_cum_regret(mu_star),
decay_eps_greedy.get_expected_cum_regret(mu_star),
ucb1.get_expected_cum_regret(mu_star),
ts.get_expected_cum_regret(mu_star),
grad_bandits.get_expected_cum_regret(mu_star)
]
x_vals = range(1, steps + 1)
for i in range(len(exp_cum_regrets)):
plt.plot(x_vals, exp_cum_regrets[i], plot_colors[i], label=labels[i])
plt.xlabel("Time Steps", fontsize=20)
plt.ylabel("Expected Total Regret", fontsize=20)
plt.title("Total Regret Curves", fontsize=25)
plt.xlim(xmin=x_vals[0], xmax=x_vals[-1])
plt.ylim(ymin=0.0)
plt.grid(True)
plt.legend(loc='upper left', fontsize=15)
plt.show()
exp_act_counts = [
greedy_opt_init.get_expected_action_counts(),
eps_greedy.get_expected_action_counts(),
decay_eps_greedy.get_expected_action_counts(),
ucb1.get_expected_action_counts(),
ts.get_expected_action_counts(),
grad_bandits.get_expected_action_counts()
]
index = arange(len(probs_data))
spacing = 0.4
width = (1 - spacing) / len(exp_act_counts)
hist_plot_colors = ['r', 'b', 'g', 'k', 'y', 'c']
for i in range(len(exp_act_counts)):
plt.bar(
index - (1 - spacing) / 2 + (i - 1.5) * width,
exp_act_counts[i],
width,
color=hist_plot_colors[i],
label=labels[i]
)
plt.xlabel("Arms", fontsize=20)
plt.ylabel("Expected Counts of Arms", fontsize=20)
plt.title("Arms Counts Plot", fontsize=25)
plt.xticks(
index - 0.2,
["$p$=%.2f" % p for p in probs_data]
)
plt.legend(loc='upper left', fontsize=15)
plt.tight_layout()
plt.show()
```


Here's what we observe:
* **Regret Performance:** UCB1, Thompson Sampling, and the decaying ε-greedy algorithm exhibit superior performance, with UCB1 demonstrating a slight edge in this particular scenario.
* **Arm Selection:** UCB1 and Thompson Sampling concentrate their choices more heavily on the arms with higher probabilities of success, reflecting their efficient exploration strategies.
## The Power of Information: MDPs in Information State Space
In the realm of bandit problems, exploration is not merely a random act but a deliberate quest for information that can lead to better decisions and higher rewards. A fundamental question arises: Can we quantify the worth of this information? Enter the concept of the **value of information (VOI)**.
Imagine having the opportunity to purchase additional knowledge before committing to a decision. The VOI represents how much you'd be willing to invest to acquire this information. In the context of bandits, the VOI can be defined as the difference between the expected long-term reward after gaining the information and the immediate reward attainable without it.
Intuitively, the VOI is higher in situations where uncertainty reigns supreme. When the outcome of an action is highly uncertain, the potential for learning and significant improvement is greater. This insight naturally suggests a guiding principle: Explore more in uncertain situations! By accurately estimating the VOI, we can strike a perfect balance between exploiting our current knowledge and exploring for potentially better options.
### Information State Space: A Sequential Decision Framework
We often think of bandit problems as one-step decision tasks. However, a more powerful perspective emerges when we reframe them as sequential decision problems. At each step, the agent possesses an **information state** $\tilde{s}$, which encapsulates the cumulative knowledge gleaned from past actions and their corresponding rewards. This information state can be any relevant statistic derived from the history, such as the number of times each action has been selected and the rewards observed.
Each action taken by the agent triggers a transition to a new information state $\tilde{s}'$, with a probability determined by the transition function $\tilde{\mathcal{P}}^a_{\tilde{s}, \tilde{s}'}$. This naturally leads to a Markov Decision Process (MDP) defined in the information state space:
$$
\tilde{M} = (\tilde{\mathcal{S}}, \mathcal{A}, \tilde{\mathcal{P}}, \mathcal{R}, \gamma)
$$
Here, $\tilde{\mathcal{S}}$ represents the set of all possible information states, $\mathcal{A}$ is the set of actions, $\tilde{\mathcal{P}}$ governs the state transitions, $\mathcal{R}$ defines the reward function, and $\gamma$ is the discount factor (which may not be relevant in all bandit scenarios).
### Illustrative Example: Bernoulli Bandits
Let's consider Bernoulli bandits to concretize the notion of information state. In this setting, each arm yields a reward of 1 with probability $\mu_a$ and 0 with probability $1-\mu_a$.
A natural choice for the information state in this case is the vector $(\alpha_{a_1}, \beta_{a_1}, \alpha_{a_2}, \beta_{a_2}, \dots, \alpha_{a_m}, \beta_{a_m})$ where $\alpha_a$ and $\beta_a$ respectively track the number of times arm a has produced a reward of 1 and 0. As the agent interacts with the bandit, this information state evolves dynamically, reflecting the accumulating knowledge about the reward probabilities.
### Solving Information State Space Bandits
The information state space MDP, while providing a comprehensive framework, often leads to an infinitely large state space. This poses computational challenges for finding exact solutions. However, we can harness the power of reinforcement learning to tackle this problem:
* **Model-Free Reinforcement Learning:** Algorithms like Q-learning can be applied directly to the information state space, learning an optimal policy through experience.
* **Bayesian Model-Based Reinforcement Learning (Bayes-Adaptive RL):** This approach employs Bayesian inference to maintain a belief distribution over the reward probabilities. This belief is continuously updated as new information is gathered, guiding both exploration and exploitation in a principled manner.
### Bayes-Adaptive Bernoulli Bandits
In Bayes-adaptive Bernoulli bandits, we begin with a Beta prior distribution $\text{Beta}(\alpha_a, \beta_a)$ or the reward probability $\mu_a$ of each arm. With each action and observed reward, we update this prior using Bayesian rules, yielding a posterior distribution that serves as the revised reward model for that arm.
The information state $(\alpha_a, \beta_a)$ for each arm effectively encodes this evolving reward model. The transition function $\tilde{\mathcal{P}}$ of the Bayes-adaptive MDP describes how these models change over time as the agent interacts with the bandit.
### Gittins Indices and the Challenge of Tractability
Theoretically, the Bayes-adaptive MDP can be solved using dynamic programming. The optimal policy involves computing the Gittins index for each arm, which quantifies the value of continuing to play that arm. However, calculating these indices exactly is often computationally infeasible for complex bandit problems.
## Contextual Bandits
Contextual bandits extend the traditional multi-armed bandit problem by introducing the concept of context or state. This additional information allows the agent to make more informed decisions by tailoring its action choices to the specific situation it faces at each step.
### Formal Definition
A contextual bandit is formally defined as a three-tuple $(\mathcal{A}, \mathcal{S}, \mathcal{R})$ where:
* $\mathcal{A}$: A known set of $m$ actions (also known as "arms").
* $\mathcal{S}$: An unknown distribution over states (or "contexts"). Each state $s$ provides additional information that characterizes the environment at a given time.
* $\mathcal{R}^a_s(r)$: An unknown probability distribution over rewards, conditioned on both the current state s and the chosen action $a$.
### The Contextual Bandit Process
At each time step $t$, the following sequence of events occurs:
1. **State Generation:** The environment samples a state $S_t$ from the unknown distribution $\mathcal{S}$
2. **Action Selection:** The agent observes the state $S_t$ and chooses an action $A_t$ from the set $\mathcal{A}$.
3. **Reward Generation:** The environment produces a reward $R_t$ based on the joint distribution $\mathcal{R}^{A_t}_{S_t}$, which is influenced by both the selected action and the current state.
### Objective: Maximizing Cumulative Reward Under Uncertainty
The agent's ultimate goal is to maximize the expected cumulative reward over $T$ time steps:
$$
\sum_{t=1}^T \mathbb{E}[R_t]
$$
This involves navigating a dynamic environment where the optimal action may vary depending on the context. The agent must strike a delicate balance between:
* **Exploration:** Trying out different actions in various states to learn their associated rewards.
* **Exploitation:** Choosing actions that have historically yielded high rewards in similar states.
### Extending Bandit Algorithms to Contextual Settings
To address contextual bandits, we extend the principles and algorithms developed for traditional bandits by incorporating state information. Instead of estimating the expected reward $Q(a)$ for each action in isolation, we now estimate context-dependent action values $Q(s,a)$. This enables us to make more tailored decisions that leverage the specific information available in each state.
Many established bandit algorithms, such as $\epsilon$-greedy, UCB, and Bayesian methods, can be adapted to the contextual setting by incorporating state into their decision-making mechanisms. This often involves maintaining separate estimates for each state-action pair or using models that capture the relationship between states, actions, and rewards.
### Practical Applications: Contextual Bandits in the Real World
Contextual bandits have found widespread applications in various domains:
* **Personalized Recommendations:** Recommending news articles, products, or movies based on user preferences and behavior.
* **Online Advertising:** Displaying ads that are most likely to be clicked by a particular user based on their browsing history.
* **Clinical Trials:** Assigning treatments to patients based on their individual characteristics to maximize the effectiveness of the trial.
* **A/B Testing:** Determining the best website design or user interface by adapting to user feedback.
By effectively leveraging contextual information, these algorithms can achieve superior performance compared to traditional bandits that operate without such knowledge.
## Reference
- Chapter 15 of the [RLForFinanceBook](https://stanford.edu/~ashlearn/RLForFinanceBook/book.pdf)
- [Exploration versus Exploitation](https://github.com/coverdrive/technical-documents/blob/master/finance/cme241/Tour-Bandits.pdf) slides for CME 241: Foundations of Reinforcement Learning with Applications in Finance