# [Chapter 5 solutions](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf) Author [Raj Ghugare](https://github.com/Raj19022000) ###### tags: `Sutton and barton` `Solutions` ## Chapter 5 - Monte Carlo Methods: * ### Exercise 5.1: * Figure 5.1 corresponds to the value function according to the policy that sticks only if the agent has a value of 20 or 21.If the agent has 20 or 21 cards and he sticks of all the possible uniform outcomes majority of them lead to a reward of +1 for the agent hence the value for the last two row is near 1. * For any player-sum the value of the states slightly falls off in the states where the dealer is showing an ace.This is because the dealer ccan use the ace as a usable ace and this increases the number of outcomes that go in the favour of the dealer.For example even if the player sticks at a value of 20, and the dealer is showing an ace. This situation will lead to the dealers victory if his second card(hidden card) is either a 10, jack, queen or king. Which would not have been the case otherwise. * ### Exercise 5.2: Every visit monte-carlo algorithm wouldn't make difference because the graph of this game is non cyclic. In one episode a state can appear atmost once. * ### Exercise 5.3: It starts from an state-action pair and ends when the agent reaches a terminal state. * ### Exercise 5.4: Initialize: $\pi(s) \in A(s)$(arbitrarily), for all $s \in S$ $Q(s,a) \in R$(arbitrarily), for all $s\in S$, $a \in A(s)$ $Returns(s,a) \leftarrow$ empty list, for all $s\in S$,$a \in A(s)$ $N(s,a) = 0$, visit count for all $s\in S$, $a \in A(s)$ loop forever: Choose $S_{0}\in S$, $A_{o}\in A(S_{0})$, randomly such that all pairs have probability > 0 Generate an episode from $S_{0},A_{0}$ following $\pi$: $S_{0},A_{0},R_{1},S_{1},A_{1},R_{2}...S_{T-1},A_{T-1}$ $G \leftarrow 0$ Loop for each step of the episode $t = T-1,T-2,...,0$ : { $G \leftarrow \gamma G + R_{t+1}$ Unless the pair $(S_{t},A_{t})$ appears in the trajectory: { $Q(S_{t},A_{t}) \leftarrow \frac{N(s,a)Q(S_{t},A_{t})}{N(s,a)+1} + \frac{G}{N(s,a)+1}$ $N(S_{t},A_{t}) \leftarrow N(S_{t},A_{t}) + 1$ } } * ### Exercise 5.5: It is evident that the agent remained in the non terminal state for the first 9 steps. #### First Visit: Since the non-terminal state is the starting state, the first visit value estimators after this one trajectory would be the return obtained: 10 $V(s) = 10$ #### Every Visit: The agent, in this trajectory, visited the non-terminal state 10 times(including the starting time). Hence the every visit value estimator would be the average of all these 10 returns. $V(s) = (10+9+8+7+6+5+4+2+3+1)/10$ $V(s) = 5.5$ * ### Exercise 5.6: