# [Chapter 4 solutions](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf)
Author
[Raj Ghugare](https://github.com/Raj19022000)
###### tags: `Sutton and barton` `Solutions`
## Chapter 4 - Dynamic Programming:
* ### Exercise 4.1:
$q_{\pi}(11,down) = -1 + V_{T} = -1$
$q_{\pi}(7,down) = -1 + V_{\pi}(11)$
* ### Exercise 4.2:
1) For the random policy $\pi$,
$V_{\pi}(15) = -1 + 0.25 V_{\pi}(12) + 0.25V_{\pi}(13) + 0.25V_{\pi}(14) + 0.25V_{\pi}(15)$
2) After the given changes the value for state 15, given equiprobable policy will remain the same.
* ### Exercise 4.3:
1) $q_{\pi}(s,a) = E_{\pi}[R_{t+1} +\gamma V_{\pi}(s_{t+1})|s_{t} = s,a_{t} = a]$
2) $q_{\pi}(s,a) = \Sigma_{s',r}p(s',r|s,a)[r + \gamma \Sigma_{a'}\pi(a'|s')q_{\pi}(s',a')]$
3) $q_{k+1}(s,a) = \Sigma_{s',r}p(s',r|s,a)[r + \gamma \Sigma_{a'}\pi(a'|s')q_{k}(s',a')]$
* ### Exercise 4.4:
Instead of using traditional argmax to find the optimal action we can assign an arbitrary rule like we would always use the lowest action index if there are more than one actions which are optimal.
In python we can use
> np.argmax(PolicyOutput)[0]
* ### Exercise 4.5:
1. Initialize
$q(s,a) \in R$ and $\pi(s) \in A(s)$ arbitrarily for all $s,a$.
2. Policy Evalutaion
Loop:
$\nabla \leftarrow 0$
loop for each $s \in S$, $a \in A(s)$
$q \leftarrow q(s,a)$
$q(s,a) =\Sigma_{s',r}p(s',r|s,\pi(s))[r + \gamma \Sigma_{a'}\pi(a'|s')q(s',a')]$
$\nabla \leftarrow max(\nabla,|q-q(s,a)|)$
repeat until $\nabla < \theta$ (small positive number determining the accuracy of estimation)
3. Policy Improvement
$policyStable$ = True
* for each $s \in S$:
$OldAction = \pi(s)$
$\pi(s) \leftarrow argmax_{a}(q(s,a))$
if $OldAction$ $\neq \pi(s)$, then $policyStable$ $\leftarrow$ False
if $policyStable$ is True,then stop and return policy $\pi$ and action-value function $q(s,a)$; else go to step 2.
* ### Exercise 4.6:
1. Initialize
$v(S) \in R$ and $\pi(a|s) \in R$ arbitrarily for all $s,a$.
2. Policy Evalutaion
Loop:
$\nabla \leftarrow 0$
loop for each $s \in S$
$v \leftarrow v(s)$
sample 'a' from distribution $\pi(a|s)$
$v(s)=\Sigma_{s',r}p(s',r|s,a)[r + \gamma v(s')]$
$\nabla \leftarrow max(\nabla,|v-v(s)|)$
repeat until $\nabla < \theta$ (small positive number determining the accuracy of estimation)
3. Policy Improvement
$policyStable$ = True
* for each $s \in S$:
$OldAction = argmax_{a}[\pi(a|s)]$
$a'\leftarrow argmax_{a}(q(s,a))$
$\pi(a'|s) = 1-\epsilon +\epsilon/|A(s)|$
for all other actions,
$\pi(a|s) = \epsilon/|A(s)|$
if $OldAction$ $\neq a'$, then $policyStable$ $\leftarrow$ False
if $policyStable$ is True,then stop and return policy $\pi$ and value function $v(s)$; else go to step 2.
* ### Exercise 4.8:
The nature of this policy is due to the reward distribution of the problem. A reward of +1 is obtained only at one state($100), Hence at $50, the action value of betting all $50 become relatively high in the first update itself:
$q(s=50,a=50) = 0.4*(1)$
The action value for betting $49 when having $51 is:
$q(s=51,a=49) = 0.6(v(s=2)) + 0.4(v(s=99))$
In contrast to the above, the action value for betting $1 when having $51 is much higher:
$q(s=51,a=49) = 0.6(v(s=50)) + 0.4(v(s=52))$
This information is kept on passing down the lane and in the second cycle of updates the action value of the actions that reach state 50 or hundred are increased.This is a good policy with respect to increasing the probability of obtaining the +1 reward at the end of the epsiode.
* ### Exercise 4.10:
The analog of value-ietration update is:
$q_{k+1}(s,a) = max_a'\Sigma_{s,r}p(s',r|s,a)[r +\gamma q_{k}(s',a')]$