[Chapter 2 solutions](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf)

# [Chapter 2 solutions](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf) Author [Raj Ghugare](https://github.com/Raj19022000) ###### tags: `Sutton and barton` `Solutions` ### Chapter 5 - Multi-armed Bandits: * ### Exercise 2.1: Probability of greedy action being selected in this case is : $0.75$ * ### Exercise 2.2: Until action $4$ all actions were greedy actions.After the third action was taken the $Q$ estimates of all arms were as follows $Q(a_{1}) = -1$ $Q(a_{2}) = -0.5$ $Q(a_{3}) = 0$ $Q(a_{4}) = 0$ At this stage the greedy actions are $3$ or $4$. But at this stage action $2$ was selected hence it was surely an $\epsilon$ case. After the fourth action was taken the $Q$ estimates of all arms were: $Q(a_{1}) = -1$ $Q(a_{2}) = \frac{1}{3}$ $Q(a_{3}) = 0$ $Q(a_{4}) = 0$ But at this stage action action $3$ was taken which was again an $\epsilon$ case. Apart from these at all other stages actions could have been chosen under $\epsilon$ case because there is an $\epsilon$/n probability that the best arm is chosen even in the exploration case. * ### Exercise 2.3: In the long run, like for millions of trials, $\epsilon = 0.01$ would be the best because once the best arm is known it will choose that arm almost 99 out of 100 times. Moreover, the best way to do epsilon-greedy is to reduce epsilon over time.