# Notes on "[Approximately Optimal Approximate Reinforcement Learning](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/KakadeLangford-icml2002.pdf)"
###### tags: `incomplete` `research paper notes` `Conservative Policy iteration`
#### Author
[Raj Ghugare](https://github.com/RajGhugare19)
### Preliminaries:
This paper assumes that they have access to a simulation of the MDP. This is what they mean by a restarting distribution.The definition for MDP is standard
Every MDP is described by the tuple $(S,A,P,r,P_{0},\gamma)$ where,
$S \rightarrow$ set of all states(finite)
$A \rightarrow$ set of all actions(finite)
$P \rightarrow$ transfer dynamics of the env
$r \rightarrow$ reward function
$\mu \rightarrow$ Probability dist. of initial states
$\gamma \rightarrow$ discount factor
They use a normalised value function:
$V_{\pi}(s) = (1-\gamma)E_{\pi}[\Sigma_{t=0}^{\infty}(\gamma^{t} r_{t})]$
Expectation over $\pi$ means over the trajectories generated by following policy $\pi$.
Hence the state value function is defined as follows:
$Q_{\pi}(s,a) = (1-\gamma)R(s,a) + E_{\pi}[\gamma V_{\pi}(s')]$
Advantage function:
$A_{\pi}(s,a) = Q_{\pi}(s,a) - V_{\pi}(s,a)$
The visitation frequency measure are discounted as well:
$d_{\pi}(s) = (1-\gamma)\Sigma_{t=0}^{\infty}\gamma^{t}p(s_{t}=s|\pi,\mu)$
### Three desirable questions:
1) Is there some performance measure that is guaranteed to improve at every policy update.
2) How difficult is it to verify if a particular update improves this measure?
3) After a reasonable number of policy update what performance level is obtained.
The authors promise an algorithm which addresses all three questions.
## Problems with existing methods:
### Approximate Value-based methods:
Consider a function approximator $\tilde V(s)$ with:
$\epsilon =|\tilde V(s) - V_{\pi}(S) |_{\infty}$ ,Infinty norm over the entire state space.
Where $\pi$ is a any policy. Let $\tilde\pi$ be a greedy policy with respect to this approximation then bounds are proved for all states $s$:
$V_{\tilde\pi}(s) > V_{\pi}(s) - \frac{2\gamma\epsilon}{1-\gamma}$
This is only proved for a finite state space (countably infinite as well).
Hence question 1 is loosely answered for such methods. But question 2 and 3 aren't. Because a well defined performance measure doesn't exist and only these bounds exist only if all states are visited infinitely often(asymptotic results).
### Policy gradient methods:
Before moving on to this section let's get familiar with the following MDPs
#### MDP 1:

#### MDP 2:

Using the policy gradient theorem we know that we can estimate an unbiased estimate of the performance measure and guarantee improvements using gradient ascent. Hence Policy gradient methods answer the first question.
For MDP 1 if we have a random policy then for large values of n the expected time required to reach the end scales exponentially.Thus any on-policy method with random initial exploration has to run for atleast an exponential time(^(n)) to start improving.
### Approximately optimal RL:
The problem with standard policy gradient techniques are that we use $\nabla \eta(\pi)$ as their gradient estimate where,
$\nabla \eta(\pi) = \Sigma_{s,a}d_{\pi,D}(s)\nabla \pi_{\theta}(a|s)Q_{\pi}(s,a)$
$d_{\pi}(s) = (1-\gamma)\Sigma_{t=0}^{\infty}\gamma^{t}p(s_{t}=s|\pi,\mu)$
We can see that for unlikely states(unlikely under current) $d_{\pi}(s)$ is less and hence the gradients for those action is relatively lesser. But it might be so that improvement in the current unlikely states is critical for optimality.