[Multi-Armed-Bandit problem](https://en.wikipedia.org/wiki/Multi-armed_bandit)

# [Multi-Armed-Bandit problem](https://en.wikipedia.org/wiki/Multi-armed_bandit) Author [Raj Ghugare](https://github.com/Raj19022000) ###### tags: `notes` `multi-armed bandits` ### Introduction : * Multi-armed bandits is a rich area of research on its own but it is often treated by the RL community as an immediate RL problem. It scales down the more intricate details of full RL problems while maintaining the same gist. * The basic setup of a K-armed bandit problem is as follows: * There are K number of arms, $A = {0,1,2.....K}$ is the Arm set. * Each time an arm is selected a reward will be generated, sampled from any distribution but specific to every arm.This means that every arm will have its own reward distribution. Before we start off, lets get aquainted with the notations * $q_*(a)$ is the true value function of all arms. * $Q(a)$ is our empirical estimate of the true value function of all arms. * $k$ is the number of arms. * $a^*$ is the true best action. * A is the set of all arms. ### Solutions : There are three ways in which we can the think of the solutions for a bandit problem : 1. ##### Asymptotic Correctness : As the name suggests this concept of a solution will guarantee that eventually the agent will reach the optimal arm. Old and naive techniques like Epsilon-greedy and softmax exploration techniques will provide you with an asymptotic solution. 2. ##### Regret Optimality : Rather than just asymptotically converging towards the solution this puts up another level of optimality by trying to maximise the likelyhood of reward that we obtain initially while exploring.Many algorithms like UCB and its variants try to maximize the expected reward while exploring. 3. ##### Pac optimality : Given an ($\epsilon$,$\delta$) pair, the agent should return an arm with a gaurantee that the value of that arm should be within $\epsilon$ limits of the value of the true best arm with a probablity of 1-$\delta$.In mathematical terms $$pr[q_*(a) \geq q_*(a^*) - \epsilon] \geq 1 - \delta $$ where $\epsilon$,$\delta$ $\in [0,1]$ ### Implementation : Every algorithm was tested with 10 arms having a bernoulli distribution with : $q_*(a) = [0.1, 0.5, 0.7, 0.73, 0.756, 0.789, 0.81, 0.83, 0.855, 0.865]$ `NOTE` > You can change the distribution to a gaussian by using the GaussianStationaryBandit imported from bandits class ### Results : ![](https://i.imgur.com/eLYVDfb.png) This graph does not tell you the complete story as there are many other aspects of a multi-armed bandit problem like regret,noisy rewards,$(\epsilon,\delta)$ guarantees etc. You can easily run these algorithms by importing my implementation and testing them out in various conditions.