[Chapter 6 solutions](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf)

# [Chapter 6 solutions](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf) Author [Raj Ghugare](https://github.com/Raj19022000) ###### tags: `Sutton and barton` `Solutions` ## Temporal Difference Learning * ### Exercise 6.1: $V_t$ is the array of state values used at time $t$ in the TD error and the TD update. Derivation: $G_t - V_t(s_t)$ $= R_{t+1} + \gamma G_{t+1}-V_{t}(s_{t}) +\gamma V_{t}(S_{t+1})-\gamma V_{t}(S_{t+1})$ $= \delta_{t} + \gamma(G_{t+1}-V_t({s_{t+1}}))$ $= \delta_{t} + \gamma(G_{t+1}-V_{t+1}({s_{t+1}}))+\gamma(V_{t+1}({s_{t+1}})- V_{t}(S_{t+1}))$ Just as we expanded $G_t - V_t(s_t)$, we will expand $G_{t+1} - V_{t+1}(s_{t+1})$ $=\delta_{t}+\gamma[\delta_{t+1}+\gamma(G_{t+2}-V_{t+2}(s_{t+2}))+\gamma(V_{t+2}(s_{t+2})-V_{t+1}(s_{t+2}))] + \gamma(V_{t+1}(s_{t+1})-V_{t}(s_{t+1}))$ $=\delta_{t}+\gamma \delta_{t+1} + \gamma^{2}(G_{t+2} - V_{t+2}(s_{t+2})) + \gamma^{2}(V_{t+2}(s_{t+2})-V_{t+1}(s_{t+2}))+\gamma(V_{t+1}(s_{t+1})-V_{t}(s_{t+1}))$ This breaks down further into the earlier approximate estimation and the extra term. $=[\delta_{t}+\gamma \delta_{t+1} +...+\gamma^{T-t-1}\delta_{T-1}] + [\gamma^{T-t}(V_{T}(s_{T})-V_{T-1}(s_{T})) + \gamma^{T-t-1}(V_{T-1}(s_{T-1})-V_{T-2}(s_{T-1})) +...+\gamma(V_{t+1}(s_{t+1})-V_{t}(s_{t+1}))]$ $=\Sigma_{k=t}^{T-1}[\gamma^{k-t}\delta_{k} + \gamma^{k-t+1}(V_{k+1}(s_{K+1})-V_{k}(s_{k+1}))]$ Extra term $\leftarrow \Sigma_{k=t}^{T-1}\gamma^{k-t+1}(V_{k+1}(s_{K+1})-V_{k}(s_{k+1}))$ Hence that is the other bit that has to be added to the sum of TD(0) errors in order to equal the Monte-Carlo error. * ### Exercise 6.2: A scenario were TD is much better is where there are many starting states of the MDP and we have a lot of experience of how to get from one particular starting state to the goal.Just like the example given in the question where we have a lot of experience driving from the highway to the home. So basically they mean that we have a good approximation of the values fo states that normally arrive after we start driving on the highway. But Let's say one day we start from a different building parking lot (eventually we end up in the same highway). TD methods are better here because we can update the value of the new parking lot by bootstrapping the optimal state values of the highway. But for monte-carlo methods we will have observe many more trajectories because there is a lot of variance adding up from all the rewards obtained from the start state till we end up at our destination. * ### Exercise 6.3: During the first episode only the value of $V(A)$ was changed. This tells us that the episode ended at the left terminal state with a total return of $0$. we know that th TD(0) updates for states are the following: $V(s_{t}) \leftarrow V(s_{t}) + \alpha(r_{t} + V(s_{t+1}) - V(s_{t}))$ For all states visited during the episode other than $A$, the next step reward for them was always 0 and the next state value was $0.5$. Hence the $TD$ error at those respective time-steps was always 0 and no updates were made. But the value of terminal state is $0$, hence before termination the TD error for the update of state $A$ was: $\delta_{t} = (-V(A))$, because $r_{t} = 0$ and $V(terminal) = 0$ Hence the value was changed by: $\alpha \delta_{t} = -0.05$ * ### Exercise 6.4: * ### Excercise 6.5: For relatively higher values of $\alpha$ the state value does not converge but keep oscillating as around the true state values. This is because higher values of alpha is equivalent to calculating the running average of the sample bootstrapped values where the newest values are given more weightage. This is why the RMS error seems to go up again. A solution to this is decaying the learning rate slowly as the number of episodes increase. * ### Excercise 6.6: To find the true values of the states many methods like Monte-carlo (first visit and every visit). But since the dynamics of the environment is known, they could have applied Dynamic programming to find the true values. * ### Excercise 6.7: Let $X$ be a random variable over any sample space $\Omega$. Let $P$ and $B$ be two probability measures defined over measurable subsets of $\Omega$. Then: $E_{X \sim P}[X] = \int_{x} xP(x)dx$ $E_{X \sim P}[X] = \int_{x} x B(X) \frac{P(x)}{B(x)} dx$ $E_{X \sim P}[X] = E_{X \sim B}[X\frac{P(X)}{B(X)}]$ This way we can use the samples obtain from the behaviour distribution to estimate the expectation over a different distibution. In our MDP scenario, the random variable is the rewards obtained, action taken and the next state encountered. Hence if we use a behaviour policy we would have to re-weight these random varibales or the functions of these random variable by the importance sampling ratio. Henc the update according to me would be: $V(S_{t}) \leftarrow V(s_{t}) + \alpha(R_{t} + \gamma V(s_{t+1}))p_{t:t}$ Where , $p_{t:t} = \frac{P(X)}{B(X)}$ * ### Excercise 6.8: $G_{t} - Q(s_{t},a_{t}) = R_{t+1} + \gamma G_{t+1} - Q(s_{t},a_{t}) -\gamma Q(s_{t+1},a_{t+1}) + \gamma Q(s_{t+1},a_{t+1})$ $= \delta_{t} + \gamma(G_{t+1} - Q(s_{t+1},a_{t+1}))$ $= \delta_{t} + \gamma\delta_{t+1} + \gamma^{2}(G_{t+2} - Q(s_{t+2},a_{t+2}))$ We can keep on unrolling the same way: $= \delta_{t} + \gamma\delta_{t+1} + \gamma^{2}\delta_{t+2} ... +\gamma^{T-t}(G_T - Q(s_T,a_T))$ $Q(s_T,a_T) = 0$ , no matter what action you take in the terminal state. $G_t = 0$ is already known.Hence, $G_{t}-Q(s_{t},a_{t}) = \Sigma_{k=t}^{T-1} \gamma^{k-t}\delta_{k}$ * ### Excercise 6.11: Q-learning is an off-policy algorithm because the behaviour policy is epsilon-greedy whereas the TD update is made to improve a completely greedy policy. In contrast to this SARSA is an on-policy algorithm because the actions made and and the actions used in TD updates are derived from the same policy. * ### Excercise 6.12: If the action selection is greedy, then Q-learing will be the same as SARSA. because behaviour policy for both SARSA and Q-learning will be a greedy policy. The updates made would be same but this methods wont converge due to lack of exploration. * ### Excercise 6.13: The algorithm would similar to double Q-learning with slightly different updates: #### With $0.5$ probability: $Q_{1}(s,a) \leftarrow Q_{1}(s,a) + \alpha(R + \gamma \Sigma_{a'}\pi(a|s')Q_{2}(s',a') - Q_{1}(s,a))$ #### else: $Q_{2}(s,a) \leftarrow Q_{2}(s,a) + \alpha(R + \gamma \Sigma_{a'}\pi(a'|s')Q_{1}(s',a')-Q_{2}(s,a))$ * ### Excercise 6.14: <----incomplete----> Just like board games, in Jack's car rental problem we have the knowledge of environment's dynamics. That is if we take a particular action, we have complete knowledge about the state we'd end up in. The problem can be formulated as ending up in the best after-state possible. This would require you to maintain a table for after states <----incomplete---->