[Chapter 3 solutions](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf)

# [Chapter 3 solutions](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf) Author [Raj Ghugare](https://github.com/Raj19022000) ###### tags: `Sutton and barton` `Solutions` ## Finite Markov Decision Processes * ### Exercise 3.4: |s|a|s'|r|$p(s',r;s,a)$| | ------------- |-------------|:-----:|:----:|:----:| | high | search | high |$r_{search}$|$\alpha$| | high | search | low |$r_{search}$|$1-\alpha$| | high |wait | high |$r_{wait}$|1| |low|search|high|-3|$1-\beta$| |low|search|low|$r_{search}$|$\beta$| |low|wait|low|$r_{wait}$|1| |low|recharge|high|0|1| * ### Exercise 3.5: Equation 3.3 states: $\Sigma_{s' \in S}\Sigma_{r \in R}p(s',r|s,a) = 1$ for all $s,a$ I think that this is not true for the terminal state because the conditional probability p is defined only when p(s,a) -The probability of being in a state s and taking action a, is non zero. Being in the terminal state the episode ends and we don't have the option to take any action,hence the terminal state (in case of episodic tasks of course) should be excluded from all possible states for which the above expression is true. * ### Exercise 3.6: An episode would naturally be the one attempt to balance the pole. Let N be the time-step where a particular episode ends. The return at any time step n would be: $G_n = -\gamma^{N-n}$ Whereas in the continuing task the return at any time step n would be $G_n = -\Sigma_{i=1}^{\infty} \gamma^{N_i-n}$ where $N_i$ is the time-step where the $ith$ episode ends. * ### Exercise 3.7: The availability of non zero rewards is very sparse. i.e. The agent will not learn any particular behavior if it consistently receives zero rewards. At first the agent obviously behaves randomly and hence depending on the complexity of the maze, it may be that the agent is not able to solve it or atleast takes a huge amount of time to escape it even once. Even if it manages to escape the maze and receives a +1 reward,this will only increase the value of the states which are very close to the end of the maze.Hence a very large amount of time would be required for this information to pass back to the starting states of the agent(again the question of how long depends on several factors like the complexity of the maze and the algorithms used) A better way to distribute rewards would be to give a reward of -1 at every step until the agents escapes the maze. This way it can differentiate different states according to the amount of time it takes the agent to escape staring from those states. * ### Exercise 3.8: $\gamma = 0.5$ For episodic tasks where T is the terminal time-step, We know that: $G_t = \Sigma_{i=t+1}^{T}\gamma^{i-t-1} R_i$ Using this formula we can easily find $G_4$ and then work backwards using the formula: $G_t = R_{t+1} + \gamma G_{t+1}$ After solving we obtain the following values: $G_5 = 0$ $G_4 = 2$ $G_3 = 4$ $G_2 = 8$ $G_1 = 6$ $G_0 = 2$ * ### Exercise 3.9: $\gamma = 0.9$ For a continual task: $G_t = \Sigma_{i=t+1}^{\infty}\gamma^{i-t-1} R_i$ $G_1 = 7+0.9*7+0.9^2*7 +...$ $G_1 = 7(1+0.9+0.9^2 +...) = 70$ And, $G_0 = R_0 = \gamma G_1$ $G_0 = 65$ * ### Exercise 3.10: This is just the formula of an infinite GP for $|\gamma|<1$. * ### Exercise 3.11: $E[R_{t+1}] = \Sigma_{a\in A}\pi(a|s_{t})\Sigma_{s' \in S}[p(s',r|a,s)*r]$ * ### Exercise 3.12: $v_\pi(s) = \Sigma_{a\in A}\pi(a|s)q_\pi(s,a)$ * ### Exercise 3.13: $q_\pi(s,a) = \Sigma_{s' \in S}[p(s^{'},r|s,a)(r+\gamma v_{\pi}(s'))]$ * ### Exercise 3.14: Bellman equation -> $v_\pi(s)=\Sigma_{a\in A}\pi(a|s)\Sigma_{s' \in S}p(s',r|a,s)[r+\gamma v_\pi(s')]$ 1. Since we are following a random policy $\pi(a|s) = 0.25$ for all states and action pair. 2. The environment is deterministic, hence the function $p(r,s'|s,a)$ is either 1 or 0. i.e lets say you take an action of going right from the (1,1) square you will reach the (2,1) square and get a reward of 0 with probability 1.(Assuming bottom left to be origin) 3. Hence for the (2,2) central state: $0.7 = 0.25*0.4 + 0.25*2.3 + 0.25*0.7 + 0.25*(-0.4)$ which is equal to 0.7(accurate upto one decimal place) * ### Exercise 3.15: For a continual task return at any state(at any time-step) is $G_t = = \Sigma_{i=t+1}^{\infty}\gamma^{i-t-1} R_i$ If we add a constant $c$ to all the rewards then $G_t^{'} = = \Sigma_{i=t+1}^{\infty}\gamma^{i-t-1}R_i + c \Sigma_{i=t+1}^{\infty}\gamma^{i-t-1}$ $G_t^{'} = = \Sigma_{i=t+1}^{\infty}\gamma^{i-t-1}R_i + c\Sigma_{i=0}^{\infty}\gamma^{i}$ $G_t^{'} = = \Sigma_{i=t+1}^{\infty}\gamma^{i-t-1}R_i + c/(1-\gamma)$ The returns at all time-steps are just incremented by a constant. Hence the expectation of returns from all states(values of states) would also be incremented by the same constant, $v_c = c/(1-\gamma)$ * ### Exercise 3.16: Adding a constant to all the rewards in an episodic task will could change the relative values of states because the expected returns from different states change differently. The expected return of the second last terminal state would just change by that constant, but the starting state lets say will be changed by a larger quantity. For an episodic task, $G_t = \Sigma_{i=t+1}^{T}\gamma^{i-t-1}R_i$ the new returns $G_t'$ would be $G_t' = \Sigma_{i=t+1}^{T}\gamma^{i-t-1} R_i + c\Sigma_{i=t+1}^{T}\gamma^{i-t-1}$ where c is the constant.You can see that second term depends on t. * ### Exercise 3.17: $q(s,a) = \Sigma_{s'\in S}p(s',r|s,a)[r+\gamma \Sigma_{a' \in A} \pi(a'|s')q(s',a')]$ * ### Exercise 3.18: 1. $v_\pi(s) = E_{\pi}[q_\pi(s,a)]$ for all $s \in S$. 2. $v_\pi(s) = \Sigma_{a\in A}\pi(a|s)q_\pi(s,a)$ * ### Exercise 3.19: 1. $q_\pi(s,a) = E[R_{t+1}+ \gamma v_\pi(s_{t+1})]$ 2. $q_\pi(s,a) = \Sigma_{s',r}[p(s',r|s,a)(r+\gamma v_{\pi}(s'))]$ * ### Exercise 3.20: 1) It is important to note that $V_{putt} = q_{*}(s,putt)$. 2) Then using the formula $V*(s) = max_{a}q_*(s,a)$ we can easily find the solution. 3) The optimal value function of the golf example needs an optimal policy first. It is quite evident that in the plot where $q_*(s,driver) = -3$, it is better to use the driver. Hence for all states belonging to this area, $V_*(s)=-3$. Now in the region where $V_{putt}(s) = -3$, it is still better to use the driver as it from this region it can get us to the goal in two steps rather than the 3 steps of putter. Hence for all states belonging to this area, $V_*(s)=-2$.The region just outside the green portion is where $V_{putt}(s)=-2$ and $q_{s,driver}=-2$ as well, hence it doesn't matter which choice we make, $V_*(s)$ for this region will be $-2$.If you are in the green region it is best to use the putter,although in the inner circle of the green region you could use both, but nevertheless $V_*(s)$ for this region is $-1$. Lastly, in the sand it is ofcourse best to use the driver therefore $V_*(s)$ for the sandy region is $-2$. * ### Exercise 3.21: As stated in the last solution, the contours of $q_{*}(s,putt)$ are equal to the contours of $V_{putt}(s)$. * ### Exercise 3.22: 1) if $\lambda = 0$ $\pi_{left}$ is optimal 2) if $\lambda = 0.5$ Both policies are optimal 3) if $\lambda = 0.9$ $\pi_{right}$ is optimal * ### Exercise 3.23: $q_*(high,search) = r_{search} + \gamma(\alpha V_*(high) + (1-\alpha) V_*(low))$ $q_*(high,wait) = r_{wait} + \gamma V_*(high)$ $q_*(low,search) = (1-\beta)(-3 + V_*(high)) + (\beta)(r_search + V_*(low))$ $q_*(low,wait) = r_{wait} + \gamma V(low)$ $q_*(low,recharge) = \gamma V(high)$ * ### Exercise 3.24: Let $A = s_{t}$, then: $V(s_{t}) = E[G_{t}] = \Sigma_{k=0} \gamma^{k}R_{t+k}$ * ### Exercise 3.25: $V(s) = max_{a}q_{*}(s,a)$ * ### Exercise 3.26: $q_*(s,a) = \Sigma_{s',r}(p(s',r|s,a)[r+ \gamma V_*(s')])$ $q_*(s,a) = 10 + 0.9(-1) + 0.9^{2}(-1) + 0.9^{3}(-1) + 0.9^{4}(-1) + 0.9^{5}(-10) ...$ To find the exact value run the code I've implemented [here](https://github.com/RajGhugare19/Classical-RL/tree/master/Unit2-Bellman-equations). Normal value iteration would fetch you the accurate answer. * ### Exercise 3.27: $\pi_*(s) =argmax_{a}q_*(s,a)$ * ### Exercise 3.28: $pi_*(s) = argmax_{a}[p(s',r|s,a)(r +\gamma V_*({s'})]$ * ### Exercise 3.29: 1) $V_{\pi}(s) = \Sigma_{a} \pi(a|s)(r(s,a) + \Sigma_{s'}p(s'|s,a)V_{\pi}(s'))$ 2) $V_*(s) = max_{a}[r(s,a) + p(s'|s,a)V_{\pi}(s')]$ 3) $q_{\pi}(s,a) = r(s,a) +\Sigma_{s'}p(s'|s,a)V_{\pi}(s'))$ 4) $q_*(s,a) = max_{s}[r(s,a) +\Sigma_{s'}p(s'|s,a)V_{\pi}(s'))]$